Opened 13 years ago
Closed 13 years ago
#7359 closed enhancement (wontfix)
[patch] Fuzzy (approximate) string searching
Reported by: | joshdoe | Owned by: | team |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | Core | Version: | |
Keywords: | fuzzy string search | Cc: | bastiK |
Description
I think it would be useful to enable a fuzzy (approximate) string search. We could use something like the Levenshtein distance for this. In the past I found several Java implementations for this, which I should link to here. We'd need to allow the user to choose whether or not to use fuzzy searching, ideally on a per-word basis, and to adjust how approximate the match has to be (the edit distance). I believe this should belong in core, but could be in a plugin, using a keyword like fuzzy
, perhaps in utilsplugin2.
Attachments (2)
Change History (14)
follow-up: 2 comment:1 by , 13 years ago
follow-up: 3 comment:2 by , 13 years ago
Replying to simon04:
JOSM has already a Levenshtein distance implementation (see
SimilarNamedWays
).
I didn't realize that, though I should have. I'd like to move that function out of SimilarNamedWays.java. I think I should put it in something like tools/TextUtils.java
, and create an interface like the following to allow for other implementations of string distance (inspired by SimMetrics):
public interface InterfaceStringMetric { // 0 - no match, 1 - perfect match public float getSimilarity(String string1, String string2); public float getUnNormalisedSimilarity(String string1, String string2); }
The remaining question is how such a fuzzy search is made? Possibly prepending the search term with a special character, like ~string
, or be more verbose and require a "special target", like fuzzy:string
?
comment:3 by , 13 years ago
Replying to joshdoe:
Replying to simon04:
JOSM has already a Levenshtein distance implementation (see
SimilarNamedWays
).
I didn't realize that, though I should have. I'd like to move that function out of SimilarNamedWays.java. I think I should put it in something like
tools/TextUtils.java
, and create an interface like the following to allow for other implementations of string distance (inspired by SimMetrics):
Taking #3733 into account it should not belong to SimilarNamedWays.java as this function hopefully will use a different search in future.
by , 13 years ago
Attachment: | extract_levenshtein.patch added |
---|
extract levenshtein code to StringMetrics
comment:4 by , 13 years ago
Summary: | Fuzzy (approximate) string searching → [patch] Fuzzy (approximate) string searching |
---|
I've attached a patch which extracts the Levenshtein code, allowing for use in the conflation plugin and other code, such as with fuzzy matching. I'm not sure the implementation is the best, so I'd appreciate a second look.
follow-up: 6 comment:5 by , 13 years ago
What about:
- having
StringMetric
as abstract class (representing the current interface) and being the only outer visible class - use
StringMetric.getByName/forName
in the code (as inSimilarNamedWays
)
What is the intention of the getSimilarity
method?
follow-up: 7 comment:6 by , 13 years ago
Replying to simon04:
What about:
- having
StringMetric
as abstract class (representing the current interface) and being the only outer visible class- use
StringMetric.getByName/forName
in the code (as inSimilarNamedWays
)
I'd like to keep the interface (leaves open the possibility of plugins providing implementations), but I agree that users shouldn't have to carry around an instance, so maybe something like getSimilarityByName(String string1, String string2, String method)
.
What is the intention of the
getSimilarity
method?
getSimilarity
is normalized from 0 to 1 so metrics can be easily swapped (not all metrics use a simple edit distance, i.e. number of characters changed). getUnNormalizedSimilarity
is provided for those who know what the raw similarity measure means.
follow-up: 8 comment:7 by , 13 years ago
Replying to joshdoe:
so maybe something like
getSimilarityByName(String string1, String string2, String method)
.
Or .getStringMetricByName(String method).getSimilarity(...)
To understand the aim of this modularization better: Should external plugins be able to provide an alternative String similarity measure for say SimilarNamedWays
?
comment:8 by , 13 years ago
Replying to simon04:
Replying to joshdoe:
so maybe something like
getSimilarityByName(String string1, String string2, String method)
.
Or
.getStringMetricByName(String method).getSimilarity(...)
Much better, done in new patch.
To understand the aim of this modularization better: Should external plugins be able to provide an alternative String similarity measure for say
SimilarNamedWays
?
I think plugins should be able to register alternative metrics, which core or other plugins can use, just like with projections. Though I haven't implemented such registering yet, let's wait for more than one implementation. :)
comment:11 by , 13 years ago
I may have overlooked something, but it seems this patch does not contain any essential changes. I would prefer not to move the function to a separate class, unless there is a good reason for it. I.e. first provide alternative implementations, a GUI to switch these implementations, etc.
Then we can still organize the code, this is really the easiest part.
comment:12 by , 13 years ago
Resolution: | → wontfix |
---|---|
Status: | new → closed |
JOSM has already a Levenshtein distance implementation (see
SimilarNamedWays
).