Opened 16 years ago
Closed 10 years ago
#3733 closed defect (fixed)
[PATCH] SimilarNamedWays naïvely uses Levenshtein distance and marks a lot of false positives
Reported by: | Owned by: | team | |
---|---|---|---|
Priority: | normal | Milestone: | 14.12 |
Component: | Core validator | Version: | latest |
Keywords: | similar name | Cc: | AM909, mdk |
Description (last modified by )
The SimilarNamedWays test just uses Levenshtein distance to determine if ways have a similar name. This is turning up a lot of false positives for the Iceland data (and presumably other locations). In Iceland it's common to have ways in the same suburb that share the same suffix or prefix. For example:
- Fagraholt
- Hafraholt
- Hlíðarberg
- Hlíðartorg
- Hjallabraut
- Hjallahraun
- Nóatún
- Sóltún
- Austurvegur
- Vesturvegur
Attachments (3)
Change History (27)
follow-up: 2 comment:1 by , 16 years ago
comment:2 by , 16 years ago
Replying to stoecker:
Well, so what improvement do you suggest? Actually these are similar named ways.
These would probably help:
- Use distance/abs(length(str1), length(str2)) instead of just distance.
Perhaps doing this would also mean that the test could be used for strings < 6 chars in length (instead of being skipped as it is now)
- Compute how many parts the two strings have in common.
FOOafe and BARafe have the same distance as FlOrghafe BlaArghafe have the same distance (3) but the former have 1 string in common ("afe") but the latter 2 ("l", "rghafe")
I haven't tested either of those. I just wanted to note the issue.
comment:5 by , 13 years ago
Cc: | added |
---|---|
Description: | modified (diff) |
Keywords: | similar name added |
from #7741 AM909:
In r5181:
The validator's "similar named ways" gives incorrect warnings in (at least) the following situations:
- "East Foothill Drive" and "West Foothill Drive": Should not report when the diffs are just leading or trailing cardinal directions. Can be tested by a second pass at reported errors. Run both args through
s/East |West |North |South | East| West| North| South//g
and then diff again. If no diff, no error should be reported.
- "5th Street" and "6th Street": Should not report when the diffs are only leading numbers. Similar to above, remove all leading numbers (
1st |2nd |First |Second |etc.
) and diff again. No diff = no report.In the bounding box (34.0600548, -117.7425385, 34.1251634, -117.6477814) (BLTR), at least 50 of the 56 reported "similar named ways" fall into one of these two (invalid) scenarios.
mdk:
The same problem about railway platform names like "Gleis 1" and "Gleis 2".
EDIT: fix. @AM909: sorry
comment:6 by , 13 years ago
I played a little bit around with the parameters in SimilarNamedWays.getLevenshteinDistance(String s, String t).
The idea is to "normalize" both strings before calculating the distance. For my part ("Gleis 1" and "Gleis 2") the is a simple rule:
// step 0 normalize digits s = s.replaceAll("\\d+", "0"); t = t.replaceAll("\\d+", "0");
This will also solve "Gleis 5" == "Gleis 12" and "track 2 - 4" == "track 10 - 11", but still "gleis 3" != "Gleis 3".
Other "normalization" rules could be:
// East = West = North = South replaceAll("East|West|North", "South"); replaceAll("east|west|north", "south"); // 1st = 2nd = 3rd = 4th = ... replaceAll("\\d+st|\\d+nd|\\d+rd|\\d+th", "0th"); // First = Second = Third replaceAll("Second|Third", "First"); replaceAll("second|third", "first");
So differences in case and spaces are still recognized.
Perhaps the rules can be defined as regular expressions plus replacements in the Settings.
I don't know the details of who uses getLevenshteinDistance() and if we can generally extend this function. So I can't say where is the best place for such a normalization.
comment:7 by , 13 years ago
Sounds good. Please prepare a patch as you see fit and we will check that it doesn't break anything.
comment:8 by , 13 years ago
Such rules should be loaded from a file in data dir when this is implemented.
follow-up: 10 comment:9 by , 13 years ago
OK, I found some time to implement the function. But it was a little bit more complex than I thought.
First I have to distinguish between "regular expression rules" like
- "\d+" => "0"
- "\d+(st|nd|rd|th)" => "1st"
and "synonym rules" like
- "east" = "west" = "north" = "south"
- "first" = "second" = "third"
The actual implementation (see unit tests) will also handle typos as expected:
- similaryName("First Street", "Second Street") = false, because they are "identical"
- similaryName("First Street", "second Street") = true, because "first" = "second" and the distance between "Second Street" and "second Street" is 1.
- similaryName("Forst Street", "Second Street") = true, because "first" = "second" and the distance between "Forst Street" and "First Street" is 1.
- similaryName("First Street", "Soconds Street") = true, because "first" = "second" and the distance between "Second Street" and "Soconds Street" is 2.
- similaryName("First Street", "Soconds Stret") = false, because "First Street" = "Second Street" = "Third Street", but all of them has a distance of more than 2 compared with "Soconds Stret".
TODO: Parsing a configuration file with the two different rules types and adding the rules (see unit test initialization).
comment:10 by , 13 years ago
Replying to mdk:
OK, I found some time to implement the function. But it was a little bit more complex than I thought.
First I have to distinguish between "regular expression rules" like
- "\d+" => "0"
- "\d+(st|nd|rd|th)" => "1st"
and "synonym rules" like
- "east" = "west" = "north" = "south"
- "first" = "second" = "third"
The actual implementation (see unit tests) will also handle typos as expected:
- similaryName("First Street", "Second Street") = false, because they are "identical"
- similaryName("First Street", "second Street") = true, because "first" = "second" and the distance between "Second Street" and "second Street" is 1.
- similaryName("Forst Street", "Second Street") = true, because "first" = "second" and the distance between "Forst Street" and "First Street" is 1.
- similaryName("First Street", "Soconds Street") = true, because "first" = "second" and the distance between "Second Street" and "Soconds Street" is 2.
- similaryName("First Street", "Soconds Stret") = false, because "First Street" = "Second Street" = "Third Street", but all of them has a distance of more than 2 compared with "Soconds Stret".
TODO: Parsing a configuration file with the two different rules types and adding the rules (see unit test initialization).
Please provide at least a set of rules, otherwise this will be dead code. Ideally you should read it from a config file, but a hard-coded list would be better than nothing.
comment:11 by , 13 years ago
Please provide at least a set of rules, otherwise this will be dead code. Ideally you should read it from a config file, but a hard-coded list would be better than nothing.
The rules in the unit test are real:
similarity_.addRegExprRule("\\d+", "0"); similarity_.addRegExprRule("\\d+(st|nd|rd|th)", "0st"); similarity_.addSynonyms("east", "west", "north", "south"); similarity_.addSynonyms("first", "second", "third");
I have no syntax for the configuration file, just the necessary calls to initialize. They cover all cases from comment:5.
comment:12 by , 13 years ago
Summary: | SimilarNamedWays naïvely uses Levenshtein distance and marks a lot of false positives → [PATCH] SimilarNamedWays naïvely uses Levenshtein distance and marks a lot of false positives |
---|
comment:13 by , 13 years ago
Summary: | [PATCH] SimilarNamedWays naïvely uses Levenshtein distance and marks a lot of false positives → [PATCH needs rework] SimilarNamedWays naïvely uses Levenshtein distance and marks a lot of false positives |
---|
comment:15 by , 12 years ago
Hmm, this looks good, but is unusable in its current state. Can't you finish the patch? Simply use the default config handling, e.g. using an array setting or one of the other types and have a set of default rules?
comment:16 by , 12 years ago
Sorry for the late answer, but I was very busy.
I have no clue how to use "default config handling". For the first step it would be ok to hard code the rules from comment:11 (see also SimilarNamesTest.java).
For default config handling you must be able to define a list of string pairs (for regular expressions and replacement) and a list of string lists (for synonyms). The rules from comment:11 should be the default rules.
comment:17 by , 12 years ago
For Prefs see doc/org/openstreetmap/josm/data/Preferences.html.
Mainly instead of hardcoded stuff call Main.pref.getWhatever(name, default values) and then josm cares for storing and changing these values in advanced config. When really important a prefs GUI panel can be created later on.
Adding sensible default checks is also required, like e.g. the ones you have in test class.
If the are MANY checks instead of the storage in prefs, you could use a data file, like for tagchecker.
comment:18 by , 10 years ago
Hello,
What's the status of this patch? Is it not being integrated simply because the values are hard-coded? It would be better to have this in place and then refactor it into a config file if needed.
Otherwise, it would probably be better to disable the similar names test until such a patch is integrated, as the test generates many more false positives than it does actual errors.
comment:19 by , 10 years ago
It is still only code without any consequences. Neither are there defaults nor a config to define values. It seems joining the hardcoded test and the checking code may be a first step (using a prefs array for data storage).
comment:20 by , 10 years ago
I've attached a new patch which contains the above, with the rules moved into the validator. Additionally, it contains an extra test class which tests the complete rule.
comment:22 by , 10 years ago
Summary: | [PATCH needs rework] SimilarNamedWays naïvely uses Levenshtein distance and marks a lot of false positives → [PATCH] SimilarNamedWays naïvely uses Levenshtein distance and marks a lot of false positives |
---|
I think it got overlooked. Now it has to wait until after next release.
comment:23 by , 10 years ago
Milestone: | → 14.12 |
---|
Well, so what improvement do you suggest? Actually these are similar named ways.