Modify

Opened 12 years ago

Closed 12 years ago

#7359 closed enhancement (wontfix)

[patch] Fuzzy (approximate) string searching

Reported by: joshdoe Owned by: team
Priority: normal Milestone:
Component: Core Version:
Keywords: fuzzy string search Cc: bastiK

Description

I think it would be useful to enable a fuzzy (approximate) string search. We could use something like the Levenshtein distance for this. In the past I found several Java implementations for this, which I should link to here. We'd need to allow the user to choose whether or not to use fuzzy searching, ideally on a per-word basis, and to adjust how approximate the match has to be (the edit distance). I believe this should belong in core, but could be in a plugin, using a keyword like fuzzy, perhaps in utilsplugin2.

Attachments (2)

extract_levenshtein.patch (6.1 KB ) - added by joshdoe 12 years ago.
extract levenshtein code to StringMetrics
extract_levenshtein2.patch (6.4 KB ) - added by joshdoe 12 years ago.
updates

Download all attachments as: .zip

Change History (14)

comment:1 by simon04, 12 years ago

JOSM has already a Levenshtein distance implementation (see SimilarNamedWays).

in reply to:  1 ; comment:2 by joshdoe, 12 years ago

Replying to simon04:

JOSM has already a Levenshtein distance implementation (see SimilarNamedWays).

I didn't realize that, though I should have. I'd like to move that function out of SimilarNamedWays.java. I think I should put it in something like tools/TextUtils.java, and create an interface like the following to allow for other implementations of string distance (inspired by SimMetrics):

public interface InterfaceStringMetric {
    // 0 - no match, 1 - perfect match
    public float getSimilarity(String string1, String string2);
    public float getUnNormalisedSimilarity(String string1, String string2);
}

The remaining question is how such a fuzzy search is made? Possibly prepending the search term with a special character, like ~string, or be more verbose and require a "special target", like fuzzy:string?

in reply to:  2 comment:3 by skyper, 12 years ago

Replying to joshdoe:

Replying to simon04:

JOSM has already a Levenshtein distance implementation (see SimilarNamedWays).

I didn't realize that, though I should have. I'd like to move that function out of SimilarNamedWays.java. I think I should put it in something like tools/TextUtils.java, and create an interface like the following to allow for other implementations of string distance (inspired by SimMetrics):

Taking #3733 into account it should not belong to SimilarNamedWays.java as this function hopefully will use a different search in future.

by joshdoe, 12 years ago

Attachment: extract_levenshtein.patch added

extract levenshtein code to StringMetrics

comment:4 by joshdoe, 12 years ago

Summary: Fuzzy (approximate) string searching[patch] Fuzzy (approximate) string searching

I've attached a patch which extracts the Levenshtein code, allowing for use in the conflation plugin and other code, such as with fuzzy matching. I'm not sure the implementation is the best, so I'd appreciate a second look.

comment:5 by simon04, 12 years ago

What about:

  • having StringMetric as abstract class (representing the current interface) and being the only outer visible class
  • use StringMetric.getByName/forName in the code (as in SimilarNamedWays)

What is the intention of the getSimilarity method?

in reply to:  5 ; comment:6 by joshdoe, 12 years ago

Replying to simon04:

What about:

  • having StringMetric as abstract class (representing the current interface) and being the only outer visible class
  • use StringMetric.getByName/forName in the code (as in SimilarNamedWays)

I'd like to keep the interface (leaves open the possibility of plugins providing implementations), but I agree that users shouldn't have to carry around an instance, so maybe something like getSimilarityByName(String string1, String string2, String method).

What is the intention of the getSimilarity method?

getSimilarity is normalized from 0 to 1 so metrics can be easily swapped (not all metrics use a simple edit distance, i.e. number of characters changed). getUnNormalizedSimilarity is provided for those who know what the raw similarity measure means.

in reply to:  6 ; comment:7 by simon04, 12 years ago

Replying to joshdoe:

so maybe something like getSimilarityByName(String string1, String string2, String method).

Or .getStringMetricByName(String method).getSimilarity(...)

To understand the aim of this modularization better: Should external plugins be able to provide an alternative String similarity measure for say SimilarNamedWays?

by joshdoe, 12 years ago

Attachment: extract_levenshtein2.patch added

updates

in reply to:  7 comment:8 by joshdoe, 12 years ago

Replying to simon04:

Replying to joshdoe:

so maybe something like getSimilarityByName(String string1, String string2, String method).

Or .getStringMetricByName(String method).getSimilarity(...)

Much better, done in new patch.

To understand the aim of this modularization better: Should external plugins be able to provide an alternative String similarity measure for say SimilarNamedWays?

I think plugins should be able to register alternative metrics, which core or other plugins can use, just like with projections. Though I haven't implemented such registering yet, let's wait for more than one implementation. :)

comment:9 by stoecker, 12 years ago

@simon04: What's the status of this patch?

comment:10 by stoecker, 12 years ago

Cc: bastiK added

see also #3733.

What is the status of this?

comment:11 by bastiK, 12 years ago

I may have overlooked something, but it seems this patch does not contain any essential changes. I would prefer not to move the function to a separate class, unless there is a good reason for it. I.e. first provide alternative implementations, a GUI to switch these implementations, etc.

Then we can still organize the code, this is really the easiest part.

comment:12 by bastiK, 12 years ago

Resolution: wontfix
Status: newclosed

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain team.
as The resolution will be set.
The resolution will be deleted. Next status will be 'reopened'.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.