Modify

Opened 11 years ago

Closed 6 years ago

#3733 closed defect (fixed)

[PATCH] SimilarNamedWays naïvely uses Levenshtein distance and marks a lot of false positives

Reported by: avarab@… Owned by: team
Priority: normal Milestone: 14.12
Component: Core validator Version: latest
Keywords: similar name Cc: AM909, mdk

Description (last modified by skyper)

The SimilarNamedWays test just uses Levenshtein distance to determine if ways have a similar name. This is turning up a lot of false positives for the Iceland data (and presumably other locations). In Iceland it's common to have ways in the same suburb that share the same suffix or prefix. For example:

  • Fagraholt
  • Hafraholt
  • Hlíðarberg
  • Hlíðartorg
  • Hjallabraut
  • Hjallahraun
  • Nóatún
  • Sóltún
  • Austurvegur
  • Vesturvegur

Attachments (3)

patch.diff (6.6 KB) - added by mdk 8 years ago.
changes
SimilarNamesTest.java (2.8 KB) - added by mdk 8 years ago.
Unit tests
3733-brianegge.patch (13.5 KB) - added by brianegge 6 years ago.
Incorporates previous patches

Download all attachments as: .zip

Change History (27)

comment:1 Changed 11 years ago by stoecker

Well, so what improvement do you suggest? Actually these are similar named ways.

comment:2 in reply to:  1 Changed 11 years ago by avarab@…

Replying to stoecker:

Well, so what improvement do you suggest? Actually these are similar named ways.

These would probably help:

  • Use distance/abs(length(str1), length(str2)) instead of just distance.

Perhaps doing this would also mean that the test could be used for strings < 6 chars in length (instead of being skipped as it is now)

  • Compute how many parts the two strings have in common.

FOOafe and BARafe have the same distance as FlOrghafe BlaArghafe have the same distance (3) but the former have 1 string in common ("afe") but the latter 2 ("l", "rghafe")

I haven't tested either of those. I just wanted to note the issue.

comment:3 Changed 9 years ago by simon04

Ticket #5223 has been marked as a duplicate of this ticket.

comment:4 Changed 8 years ago by skyper

Ticket #7741 has been marked as a duplicate of this ticket.

comment:5 Changed 8 years ago by skyper

Cc: AM909 mdk added
Description: modified (diff)
Keywords: similar name added

from #7741 AM909:

In r5181:

The validator's "similar named ways" gives incorrect warnings in (at least) the following situations:

  • "East Foothill Drive" and "West Foothill Drive": Should not report when the diffs are just leading or trailing cardinal directions. Can be tested by a second pass at reported errors. Run both args through s/East |West |North |South | East| West| North| South//g and then diff again. If no diff, no error should be reported.
  • "5th Street" and "6th Street": Should not report when the diffs are only leading numbers. Similar to above, remove all leading numbers (1st |2nd |First |Second |etc.) and diff again. No diff = no report.

In the bounding box (34.0600548, -117.7425385, 34.1251634, -117.6477814) (BLTR), at least 50 of the 56 reported "similar named ways" fall into one of these two (invalid) scenarios.

mdk:

The same problem about railway platform names like "Gleis 1" and "Gleis 2".

EDIT: fix. @AM909: sorry

Last edited 8 years ago by skyper (previous) (diff)

comment:6 Changed 8 years ago by mdk

I played a little bit around with the parameters in SimilarNamedWays.getLevenshteinDistance(String s, String t).

The idea is to "normalize" both strings before calculating the distance. For my part ("Gleis 1" and "Gleis 2") the is a simple rule:

// step 0 normalize digits
s = s.replaceAll("\\d+", "0");
t = t.replaceAll("\\d+", "0");

This will also solve "Gleis 5" == "Gleis 12" and "track 2 - 4" == "track 10 - 11", but still "gleis 3" != "Gleis 3".

Other "normalization" rules could be:

// East = West = North = South
replaceAll("East|West|North", "South");
replaceAll("east|west|north", "south");

// 1st = 2nd = 3rd = 4th = ...
replaceAll("\\d+st|\\d+nd|\\d+rd|\\d+th", "0th");

// First = Second = Third
replaceAll("Second|Third", "First");
replaceAll("second|third", "first");

So differences in case and spaces are still recognized.

Perhaps the rules can be defined as regular expressions plus replacements in the Settings.

I don't know the details of who uses getLevenshteinDistance() and if we can generally extend this function. So I can't say where is the best place for such a normalization.

comment:7 Changed 8 years ago by bastiK

Sounds good. Please prepare a patch as you see fit and we will check that it doesn't break anything.

comment:8 Changed 8 years ago by stoecker

Such rules should be loaded from a file in data dir when this is implemented.

comment:9 Changed 8 years ago by mdk

OK, I found some time to implement the function. But it was a little bit more complex than I thought.

First I have to distinguish between "regular expression rules" like

  • "\d+" => "0"
  • "\d+(st|nd|rd|th)" => "1st"

and "synonym rules" like

  • "east" = "west" = "north" = "south"
  • "first" = "second" = "third"

The actual implementation (see unit tests) will also handle typos as expected:

  • similaryName("First Street", "Second Street") = false, because they are "identical"
  • similaryName("First Street", "second Street") = true, because "first" = "second" and the distance between "Second Street" and "second Street" is 1.
  • similaryName("Forst Street", "Second Street") = true, because "first" = "second" and the distance between "Forst Street" and "First Street" is 1.
  • similaryName("First Street", "Soconds Street") = true, because "first" = "second" and the distance between "Second Street" and "Soconds Street" is 2.
  • similaryName("First Street", "Soconds Stret") = false, because "First Street" = "Second Street" = "Third Street", but all of them has a distance of more than 2 compared with "Soconds Stret".

TODO: Parsing a configuration file with the two different rules types and adding the rules (see unit test initialization).

Changed 8 years ago by mdk

Attachment: patch.diff added

changes

Changed 8 years ago by mdk

Attachment: SimilarNamesTest.java added

Unit tests

comment:10 in reply to:  9 Changed 8 years ago by bastiK

Replying to mdk:

OK, I found some time to implement the function. But it was a little bit more complex than I thought.

First I have to distinguish between "regular expression rules" like

  • "\d+" => "0"
  • "\d+(st|nd|rd|th)" => "1st"

and "synonym rules" like

  • "east" = "west" = "north" = "south"
  • "first" = "second" = "third"

The actual implementation (see unit tests) will also handle typos as expected:

  • similaryName("First Street", "Second Street") = false, because they are "identical"
  • similaryName("First Street", "second Street") = true, because "first" = "second" and the distance between "Second Street" and "second Street" is 1.
  • similaryName("Forst Street", "Second Street") = true, because "first" = "second" and the distance between "Forst Street" and "First Street" is 1.
  • similaryName("First Street", "Soconds Street") = true, because "first" = "second" and the distance between "Second Street" and "Soconds Street" is 2.
  • similaryName("First Street", "Soconds Stret") = false, because "First Street" = "Second Street" = "Third Street", but all of them has a distance of more than 2 compared with "Soconds Stret".

TODO: Parsing a configuration file with the two different rules types and adding the rules (see unit test initialization).

Please provide at least a set of rules, otherwise this will be dead code. Ideally you should read it from a config file, but a hard-coded list would be better than nothing.

comment:11 Changed 8 years ago by mdk

Please provide at least a set of rules, otherwise this will be dead code. Ideally you should read it from a config file, but a hard-coded list would be better than nothing.

The rules in the unit test are real:

similarity_.addRegExprRule("\\d+", "0");
similarity_.addRegExprRule("\\d+(st|nd|rd|th)", "0st");
similarity_.addSynonyms("east", "west", "north", "south");
similarity_.addSynonyms("first", "second", "third");

I have no syntax for the configuration file, just the necessary calls to initialize. They cover all cases from comment:5.

comment:12 Changed 8 years ago by bastiK

Summary: SimilarNamedWays naïvely uses Levenshtein distance and marks a lot of false positives[PATCH] SimilarNamedWays naïvely uses Levenshtein distance and marks a lot of false positives

comment:13 Changed 8 years ago by stoecker

Summary: [PATCH] SimilarNamedWays naïvely uses Levenshtein distance and marks a lot of false positives[PATCH needs rework] SimilarNamedWays naïvely uses Levenshtein distance and marks a lot of false positives

comment:14 Changed 8 years ago by stoecker

see also #7359

comment:15 Changed 7 years ago by stoecker

Hmm, this looks good, but is unusable in its current state. Can't you finish the patch? Simply use the default config handling, e.g. using an array setting or one of the other types and have a set of default rules?

comment:16 Changed 7 years ago by mdk

Sorry for the late answer, but I was very busy.

I have no clue how to use "default config handling". For the first step it would be ok to hard code the rules from comment:11 (see also SimilarNamesTest.java).

For default config handling you must be able to define a list of string pairs (for regular expressions and replacement) and a list of string lists (for synonyms). The rules from comment:11 should be the default rules.

comment:17 Changed 7 years ago by stoecker

For Prefs see doc/org/openstreetmap/josm/data/Preferences.html.

Mainly instead of hardcoded stuff call Main.pref.getWhatever(name, default values) and then josm cares for storing and changing these values in advanced config. When really important a prefs GUI panel can be created later on.

Adding sensible default checks is also required, like e.g. the ones you have in test class.

If the are MANY checks instead of the storage in prefs, you could use a data file, like for tagchecker.

comment:18 Changed 6 years ago by brianegge

Hello,

What's the status of this patch? Is it not being integrated simply because the values are hard-coded? It would be better to have this in place and then refactor it into a config file if needed.

Otherwise, it would probably be better to disable the similar names test until such a patch is integrated, as the test generates many more false positives than it does actual errors.

comment:19 Changed 6 years ago by stoecker

It is still only code without any consequences. Neither are there defaults nor a config to define values. It seems joining the hardcoded test and the checking code may be a first step (using a prefs array for data storage).

Changed 6 years ago by brianegge

Attachment: 3733-brianegge.patch added

Incorporates previous patches

comment:20 Changed 6 years ago by brianegge

I've attached a new patch which contains the above, with the rules moved into the validator. Additionally, it contains an extra test class which tests the complete rule.

comment:21 Changed 6 years ago by mdk

Is there any reason not to integrate this patch now?

comment:22 Changed 6 years ago by bastiK

Summary: [PATCH needs rework] SimilarNamedWays naïvely uses Levenshtein distance and marks a lot of false positives[PATCH] SimilarNamedWays naïvely uses Levenshtein distance and marks a lot of false positives

I think it got overlooked. Now it has to wait until after next release.

comment:23 Changed 6 years ago by stoecker

Milestone: 14.12

comment:24 Changed 6 years ago by Don-vip

Resolution: fixed
Status: newclosed

In 7848/josm:

fix #3733 - SimilarNamedWays naïvely uses Levenshtein distance and marks a lot of false positives (patch by mdk, brianegge, modified). Rules still need to be stored in JOSM preferences instead of current hardcoding.

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain team.
as The resolution will be set.
The resolution will be deleted.

Add Comment


E-mail address and name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.