Opened 4 years ago
Last modified 6 weeks ago
#3733 new defect
[PATCH needs rework] SimilarNamedWays naïvely uses Levenshtein distance and marks a lot of false positives
| Reported by: | avar | Owned by: | team |
|---|---|---|---|
| Priority: | normal | Component: | Core validator |
| Version: | latest | Keywords: | similar name |
| Cc: | AM909, mdk |
Description (last modified by skyper)
The SimilarNamedWays test just uses Levenshtein distance to determine if ways have a similar name. This is turning up a lot of false positives for the Iceland data (and presumably other locations). In Iceland it's common to have ways in the same suburb that share the same suffix or prefix. For example:
- Fagraholt
- Hafraholt
- Hlíðarberg
- Hlíðartorg
- Hjallabraut
- Hjallahraun
- Nóatún
- Sóltún
- Austurvegur
- Vesturvegur
Attachments (2)
Change History (17)
comment:1 follow-up: ↓ 2 Changed 4 years ago by stoecker
comment:2 in reply to: ↑ 1 Changed 4 years ago by avar
Replying to stoecker:
Well, so what improvement do you suggest? Actually these are similar named ways.
These would probably help:
- Use distance/abs(length(str1), length(str2)) instead of just distance.
Perhaps doing this would also mean that the test could be used for strings < 6 chars in length (instead of being skipped as it is now)
- Compute how many parts the two strings have in common.
FOOafe and BARafe have the same distance as FlOrghafe BlaArghafe have the same distance (3) but the former have 1 string in common ("afe") but the latter 2 ("l", "rghafe")
I haven't tested either of those. I just wanted to note the issue.
comment:3 Changed 23 months ago by simon04
Ticket #5223 has been marked as a duplicate of this ticket.
comment:4 Changed 13 months ago by skyper
Ticket #7741 has been marked as a duplicate of this ticket.
comment:5 Changed 13 months ago by skyper
- Cc AM909 mdk added
- Description modified (diff)
- Keywords similar name added
from #7741 AM909:
In r5181:
The validator's "similar named ways" gives incorrect warnings in (at least) the following situations:
- "East Foothill Drive" and "West Foothill Drive": Should not report when the diffs are just leading or trailing cardinal directions. Can be tested by a second pass at reported errors. Run both args through s/East |West |North |South | East| West| North| South//g and then diff again. If no diff, no error should be reported.
- "5th Street" and "6th Street": Should not report when the diffs are only leading numbers. Similar to above, remove all leading numbers (1st |2nd |First |Second |etc.) and diff again. No diff = no report.
In the bounding box (34.0600548, -117.7425385, 34.1251634, -117.6477814) (BLTR), at least 50 of the 56 reported "similar named ways" fall into one of these two (invalid) scenarios.
mdk:
The same problem about railway platform names like "Gleis 1" and "Gleis 2".
EDIT: fix. @AM909: sorry
comment:6 Changed 9 months ago by mdk
I played a little bit around with the parameters in SimilarNamedWays.getLevenshteinDistance(String s, String t).
The idea is to "normalize" both strings before calculating the distance. For my part ("Gleis 1" and "Gleis 2") the is a simple rule:
// step 0 normalize digits
s = s.replaceAll("\\d+", "0");
t = t.replaceAll("\\d+", "0");
This will also solve "Gleis 5" == "Gleis 12" and "track 2 - 4" == "track 10 - 11", but still "gleis 3" != "Gleis 3".
Other "normalization" rules could be:
// East = West = North = South
replaceAll("East|West|North", "South");
replaceAll("east|west|north", "south");
// 1st = 2nd = 3rd = 4th = ...
replaceAll("\\d+st|\\d+nd|\\d+rd|\\d+th", "0th");
// First = Second = Third
replaceAll("Second|Third", "First");
replaceAll("second|third", "first");
So differences in case and spaces are still recognized.
Perhaps the rules can be defined as regular expressions plus replacements in the Settings.
I don't know the details of who uses getLevenshteinDistance() and if we can generally extend this function. So I can't say where is the best place for such a normalization.
comment:7 Changed 9 months ago by bastiK
Sounds good. Please prepare a patch as you see fit and we will check that it doesn't break anything.
comment:8 Changed 9 months ago by stoecker
Such rules should be loaded from a file in data dir when this is implemented.
comment:9 follow-up: ↓ 10 Changed 9 months ago by mdk
OK, I found some time to implement the function. But it was a little bit more complex than I thought.
First I have to distinguish between "regular expression rules" like
- "\d+" => "0"
- "\d+(st|nd|rd|th)" => "1st"
and "synonym rules" like
- "east" = "west" = "north" = "south"
- "first" = "second" = "third"
The actual implementation (see unit tests) will also handle typos as expected:
- similaryName("First Street", "Second Street") = false, because they are "identical"
- similaryName("First Street", "second Street") = true, because "first" = "second" and the distance between "Second Street" and "second Street" is 1.
- similaryName("Forst Street", "Second Street") = true, because "first" = "second" and the distance between "Forst Street" and "First Street" is 1.
- similaryName("First Street", "Soconds Street") = true, because "first" = "second" and the distance between "Second Street" and "Soconds Street" is 2.
- similaryName("First Street", "Soconds Stret") = false, because "First Street" = "Second Street" = "Third Street", but all of them has a distance of more than 2 compared with "Soconds Stret".
TODO: Parsing a configuration file with the two different rules types and adding the rules (see unit test initialization).
comment:10 in reply to: ↑ 9 Changed 9 months ago by bastiK
Replying to mdk:
OK, I found some time to implement the function. But it was a little bit more complex than I thought.
First I have to distinguish between "regular expression rules" like
- "\d+" => "0"
- "\d+(st|nd|rd|th)" => "1st"
and "synonym rules" like
- "east" = "west" = "north" = "south"
- "first" = "second" = "third"
The actual implementation (see unit tests) will also handle typos as expected:
- similaryName("First Street", "Second Street") = false, because they are "identical"
- similaryName("First Street", "second Street") = true, because "first" = "second" and the distance between "Second Street" and "second Street" is 1.
- similaryName("Forst Street", "Second Street") = true, because "first" = "second" and the distance between "Forst Street" and "First Street" is 1.
- similaryName("First Street", "Soconds Street") = true, because "first" = "second" and the distance between "Second Street" and "Soconds Street" is 2.
- similaryName("First Street", "Soconds Stret") = false, because "First Street" = "Second Street" = "Third Street", but all of them has a distance of more than 2 compared with "Soconds Stret".
TODO: Parsing a configuration file with the two different rules types and adding the rules (see unit test initialization).
Please provide at least a set of rules, otherwise this will be dead code. Ideally you should read it from a config file, but a hard-coded list would be better than nothing.
comment:11 Changed 9 months ago by mdk
Please provide at least a set of rules, otherwise this will be dead code. Ideally you should read it from a config file, but a hard-coded list would be better than nothing.
The rules in the unit test are real:
similarity_.addRegExprRule("\\d+", "0");
similarity_.addRegExprRule("\\d+(st|nd|rd|th)", "0st");
similarity_.addSynonyms("east", "west", "north", "south");
similarity_.addSynonyms("first", "second", "third");
I have no syntax for the configuration file, just the necessary calls to initialize. They cover all cases from comment:5.
comment:12 Changed 9 months ago by bastiK
- Summary changed from SimilarNamedWays naïvely uses Levenshtein distance and marks a lot of false positives to [PATCH] SimilarNamedWays naïvely uses Levenshtein distance and marks a lot of false positives
comment:13 Changed 8 months ago by stoecker
- Summary changed from [PATCH] SimilarNamedWays naïvely uses Levenshtein distance and marks a lot of false positives to [PATCH needs rework] SimilarNamedWays naïvely uses Levenshtein distance and marks a lot of false positives
comment:14 Changed 8 months ago by stoecker
see also #7359
comment:15 Changed 6 weeks ago by stoecker
Hmm, this looks good, but is unusable in its current state. Can't you finish the patch? Simply use the default config handling, e.g. using an array setting or one of the other types and have a set of default rules?



Well, so what improvement do you suggest? Actually these are similar named ways.