Opened 17 years ago

Last modified 11 years ago

#3733 closed defect

SimilarNamedWays naïvely uses Levenshtein distance and marks a lot of false positives — at Version 5

Reported by: avarab@… Owned by: team
Priority: normal Milestone: 14.12
Component: Core validator Version: latest
Keywords: similar name Cc: AM909, mdk

Description (last modified by skyper)

The SimilarNamedWays test just uses Levenshtein distance to determine if ways have a similar name. This is turning up a lot of false positives for the Iceland data (and presumably other locations). In Iceland it's common to have ways in the same suburb that share the same suffix or prefix. For example:

  • Fagraholt
  • Hafraholt
  • Hlíðarberg
  • Hlíðartorg
  • Hjallabraut
  • Hjallahraun
  • Nóatún
  • Sóltún
  • Austurvegur
  • Vesturvegur

Change History (5)

comment:1 by stoecker, 17 years ago

Well, so what improvement do you suggest? Actually these are similar named ways.

in reply to:  1 comment:2 by avarab@…, 17 years ago

Replying to stoecker:

Well, so what improvement do you suggest? Actually these are similar named ways.

These would probably help:

  • Use distance/abs(length(str1), length(str2)) instead of just distance.

Perhaps doing this would also mean that the test could be used for strings < 6 chars in length (instead of being skipped as it is now)

  • Compute how many parts the two strings have in common.

FOOafe and BARafe have the same distance as FlOrghafe BlaArghafe have the same distance (3) but the former have 1 string in common ("afe") but the latter 2 ("l", "rghafe")

I haven't tested either of those. I just wanted to note the issue.

comment:3 by simon04, 15 years ago

Ticket #5223 has been marked as a duplicate of this ticket.

comment:4 by skyper, 14 years ago

Ticket #7741 has been marked as a duplicate of this ticket.

comment:5 by skyper, 14 years ago

Cc: AM909 mdk added
Description: modified (diff)
Keywords: similar name added

from #7741 AM909:

In r5181:

The validator's "similar named ways" gives incorrect warnings in (at least) the following situations:

  • "East Foothill Drive" and "West Foothill Drive": Should not report when the diffs are just leading or trailing cardinal directions. Can be tested by a second pass at reported errors. Run both args through s/East |West |North |South | East| West| North| South//g and then diff again. If no diff, no error should be reported.
  • "5th Street" and "6th Street": Should not report when the diffs are only leading numbers. Similar to above, remove all leading numbers (1st |2nd |First |Second |etc.) and diff again. No diff = no report.

In the bounding box (34.0600548, -117.7425385, 34.1251634, -117.6477814) (BLTR), at least 50 of the 56 reported "similar named ways" fall into one of these two (invalid) scenarios.

mdk:

The same problem about railway platform names like "Gleis 1" and "Gleis 2".

EDIT: fix. @AM909: sorry

Last edited 14 years ago by skyper (previous) (diff)
Note: See TracTickets for help on using tickets.