Opened 17 years ago
Last modified 11 years ago
#3733 closed defect
SimilarNamedWays naïvely uses Levenshtein distance and marks a lot of false positives — at Version 5
| Reported by: | Owned by: | team | |
|---|---|---|---|
| Priority: | normal | Milestone: | 14.12 |
| Component: | Core validator | Version: | latest |
| Keywords: | similar name | Cc: | AM909, mdk |
Description (last modified by )
The SimilarNamedWays test just uses Levenshtein distance to determine if ways have a similar name. This is turning up a lot of false positives for the Iceland data (and presumably other locations). In Iceland it's common to have ways in the same suburb that share the same suffix or prefix. For example:
- Fagraholt
- Hafraholt
- Hlíðarberg
- Hlíðartorg
- Hjallabraut
- Hjallahraun
- Nóatún
- Sóltún
- Austurvegur
- Vesturvegur
Change History (5)
follow-up: 2 comment:1 by , 17 years ago
comment:2 by , 17 years ago
Replying to stoecker:
Well, so what improvement do you suggest? Actually these are similar named ways.
These would probably help:
- Use distance/abs(length(str1), length(str2)) instead of just distance.
Perhaps doing this would also mean that the test could be used for strings < 6 chars in length (instead of being skipped as it is now)
- Compute how many parts the two strings have in common.
FOOafe and BARafe have the same distance as FlOrghafe BlaArghafe have the same distance (3) but the former have 1 string in common ("afe") but the latter 2 ("l", "rghafe")
I haven't tested either of those. I just wanted to note the issue.
comment:5 by , 14 years ago
| Cc: | added |
|---|---|
| Description: | modified (diff) |
| Keywords: | similar name added |
from #7741 AM909:
In r5181:
The validator's "similar named ways" gives incorrect warnings in (at least) the following situations:
- "East Foothill Drive" and "West Foothill Drive": Should not report when the diffs are just leading or trailing cardinal directions. Can be tested by a second pass at reported errors. Run both args through
s/East |West |North |South | East| West| North| South//gand then diff again. If no diff, no error should be reported.
- "5th Street" and "6th Street": Should not report when the diffs are only leading numbers. Similar to above, remove all leading numbers (
1st |2nd |First |Second |etc.) and diff again. No diff = no report.In the bounding box (34.0600548, -117.7425385, 34.1251634, -117.6477814) (BLTR), at least 50 of the 56 reported "similar named ways" fall into one of these two (invalid) scenarios.
mdk:
The same problem about railway platform names like "Gleis 1" and "Gleis 2".
EDIT: fix. @AM909: sorry



Well, so what improvement do you suggest? Actually these are similar named ways.