Modify

Opened 7 weeks ago

Last modified 13 hours ago

#17055 reopened enhancement

[Patch] Validator should complain about simple typos in tag values

Reported by: GerdP Owned by: team
Priority: normal Milestone: 19.01
Component: Core validator Version:
Keywords: template_report Cc:

Description

What steps will reproduce the problem?

  1. Create way with typo like highway=residental (missing 2nd i)
  2. Run validator

What is the expected result?

Warning Misspelled property value - Value 'residental' for key 'highway' looks like 'residential'. (1)

What happens instead?

Only an informational message is produced:
Presets do not contain property value - Value 'residental' for key 'highway' not in presets. (1)

Please provide any additional information below. Attach a screenshot if possible.

See also https://josm.openstreetmap.de/ticket/15203#comment:31
The attached patch uses Utils.getLevenshteinDistance() to detect typical typos and propose a fix.
I found one possible problem so far:
When you type highway=servics it suggests highway=services

Build-Date:2018-12-01 17:14:34
Revision:14478
Is-Local-Build:true

Identification: JOSM/1.5 (14478 SVN en) Windows 10 64-Bit
OS Build number: Windows 10 Home 1803 (17134)
Memory Usage: 1437 MB / 3641 MB (388 MB allocated, but free)
Java version: 1.8.0_191-b12, Oracle Corporation, Java HotSpot(TM) 64-Bit Server VM
Screen: \Display0 1920x1080
Maximum Screen Size: 1920x1080
VM arguments: [-agentlib:jdwp=transport=dt_socket,suspend=y,address=localhost:51821, -ea, -Dfile.encoding=UTF-8]
Dataset consistency test: No problems found

Plugins:
+ OpeningHoursEditor (34535)
+ apache-commons (34506)
+ buildings_tools (34724)
+ download_along (34503)
+ ejml (34389)
+ geotools (34513)
+ jaxb (34506)
+ jts (34524)
+ merge-overlap (34664)
+ o5m (34405)
+ opendata (34698)
+ pbf (34576)
+ poly (34546)
+ reverter (34552)
+ undelete (34568)
+ utilsplugin2 (34506)

Last errors/warnings:
- W: Update plugins - org.openstreetmap.josm.plugins.PluginHandler$UpdatePluginsMessagePanel[,0,0,0x0,invalid,layout=java.awt.GridBagLayout,alignmentX=0.0,alignmentY=0.0,border=,flags=9,maximumSize=,minimumSize=,preferredSize=]
- W: No configuration settings found.  Using hardcoded default values for all pools.
- W: Failed to locate resource '/README'.
- W: Failed to locate resource '/CONTRIBUTION'.
- W: Failed to locate resource '/LICENSE'.

Attachments (8)

17055-v1.patch (1.5 KB) - added by GerdP 7 weeks ago.
filter_words.patch (2.0 KB) - added by GerdP 7 weeks ago.
small hack to create a reduced version of words.cfg
shrink-words.cfg.patch (13.2 KB) - added by GerdP 7 weeks ago.
patch to reduce and update words.cfg (some values added, many ignored values removed)
17055-v2.patch (20.4 KB) - added by GerdP 7 weeks ago.
17055-enh-v1.patch (12.9 KB) - added by GerdP 7 weeks ago.
17055-enh-v2.patch (14.2 KB) - added by GerdP 7 weeks ago.
previous version contained a bug
17055-use-ignore.patch (14.7 KB) - added by GerdP 4 weeks ago.
use values with prefix K: for checks with Levenshtein distance, too
frequent-tags.zip (168.2 KB) - added by GerdP 4 days ago.

Download all attachments as: .zip

Change History (57)

Changed 7 weeks ago by GerdP

Attachment: 17055-v1.patch added

comment:1 Changed 7 weeks ago by Don-vip

If we go this way we should clean the words.cfg file (TagChecker.SPELL_FILE) and make sure there is no regression.

Last edited 7 weeks ago by Don-vip (previous) (diff)

comment:2 Changed 7 weeks ago by GerdP

Sounds good. I'll have a closer look tomorrow.

comment:3 in reply to:  1 Changed 7 weeks ago by stoecker

Replying to Don-vip:

If we go this way we should clean the words.cfg file (TagChecker.SPELL_FILE) and make sure there is no regression.

Remove it? That has not be maintained for ages. I think this is from the time before we had real presets. The approach of this tickets sounds more like a modern approach comparing the input to valid data.

comment:4 Changed 7 weeks ago by Don-vip

Removing it is the ultimate cleaning :) We just need to check we don't loose functionality (by sampling values).

comment:5 Changed 7 weeks ago by GerdP

My understanding is that words.cfg is only used for tags with misspelled keys, not for wrong tag values with known keys.
Try with e.g. highway=residentail

Last edited 7 weeks ago by GerdP (previous) (diff)

Changed 7 weeks ago by GerdP

Attachment: filter_words.patch added

small hack to create a reduced version of words.cfg

Changed 7 weeks ago by GerdP

Attachment: shrink-words.cfg.patch added

patch to reduce and update words.cfg (some values added, many ignored values removed)

Changed 7 weeks ago by GerdP

Attachment: 17055-v2.patch added

comment:6 Changed 7 weeks ago by GerdP

17055-v2.patch improves:

  • reduced words.cfg
  • reduced memory footprint because harmonizedKeys no longer contains pairs where key and value are equal
  • if more than one alternative for a tag value is found, all are listed and auto fix is disabled
  • updated unit tests

My my gut feeling is that Utils.getLevenshteinDistance() will not work for the keys, but I'll try to code it later.

comment:7 Changed 7 weeks ago by GerdP

As expected we will get many false positives when we simply use all tag keys in the presets in combination with getLevenshteinDistance():

Key 'draft' looks like 'craft'. (3)
Key 'link' looks like 'line'. (1)
Key 'ref_name' looks like 'reg_name'. (59)
Key 'sat' looks like 'salt'. (49)
Key 'type2' looks like 'type'. (1)

Also performance is poor compared to the current code because we have to calculate the Levenshtein distance ~600 times for each key of each primitive when we use all keys from the presets.
I see possibilities:
1) Don't use getLevenshteinDistance() with keys
2) Specify a list of keys for which the getLevenshteinDistance() can be used. That could be a simple list or maybe an attribute in the presets xml.

I think I still prefer the solution implemented with 17055-v2.patch

comment:8 Changed 7 weeks ago by stoecker

Let's try it (your suggestion) ...

If bad we'll get reports soon 🤗

Last edited 7 weeks ago by stoecker (previous) (diff)

comment:9 Changed 7 weeks ago by GerdP

Resolution: fixed
Status: newclosed

In 14490/josm:

fix #17055 17055-v2.patch with one empty line removed

comment:10 Changed 7 weeks ago by Klumbumbus

How to disable clear false positives like Key 'ref_name' looks like 'reg_name'. (59)? Simply add ref_name as correctly spelled key to words.cfg?

comment:11 Changed 7 weeks ago by GerdP

These wrong positives were reported when I tried to ignore words.cfg (did not post a patch for that). You should not see them with r14490.
On the other hand I just noticed that typo highway=residentail is still not flagged :-(

Changed 7 weeks ago by GerdP

Attachment: 17055-enh-v1.patch added

Changed 7 weeks ago by GerdP

Attachment: 17055-enh-v2.patch added

previous version contained a bug

comment:12 Changed 7 weeks ago by GerdP

Resolution: fixed
Status: closedreopened

With 17055-enh-v2.patch I tried to improve TagChecker so that it also suggests a fix when the harmonized tag value looks like a truncated value listed in the presets.
Examples:
highway=u -> highway=unclassified

Problem: Many values are ambiguous, e.g.
highway=res could be residential or rest_area
highway=ser could be serivce or services
highway=tr could be track, trunk, or trunk_link
In those cases a warning with "Ambiguous truncated property value" is produced.
My goal here is not to find the most likely value but to produce a warning instead of information.
What do you think about that?

There is still a problem with the getLevenshteinDistance(). I see possibly wrong warnings like
Value '125' for key 'fire_hydrant:diameter' looks like one of [100, 150, 250].
Value 'none' for key 'sidewalk' looks like 'no'.

It is probably not a good idea to use that for short values.

comment:13 in reply to:  12 Changed 7 weeks ago by Klumbumbus

Replying to GerdP:

There is still a problem with the getLevenshteinDistance(). I see possibly wrong warnings like
Value '125' for key 'fire_hydrant:diameter' looks like one of [100, 150, 250].

fire_hydrant:diameter=125 is in the presets. The getLevenshteinDistance() should only be tested on tags which are not in the presets.

Value 'none' for key 'sidewalk' looks like 'no'.

none is not "officially deprecated", but no is the preferred value, so this warning would be ok in my opinion.

comment:14 Changed 7 weeks ago by GerdP

125 is not in the defaultpresets.xml. See

            <combo key="fire_hydrant:diameter" text="Diameter (in mm)" values="50,80,100,150,200,250,300,400" />

Taginfo says that sidewalk=no and sidewalk=none are nearly equally used. Maybe we can have both in the presets?

comment:15 in reply to:  14 Changed 7 weeks ago by Klumbumbus

Replying to GerdP:

125 is not in the defaultpresets.xml.

Right, sorry, I mixed it up with 150.

Taginfo says that sidewalk=no and sidewalk=none are nearly equally used. Maybe we can have both in the presets?

Looking at the history of the sidewalk wiki page the preferred tag changed several times over the years. We should have only one in our preset and stay with the current no.

comment:16 Changed 7 weeks ago by GerdP

In 14508/josm:

see #17055 improve TagChecker

  • allow Levenshtein distance of 2 so that residentail is found as typo
  • disable Fix for typos where the correct value is not known, e.g. services can be service or services
  • don't suggest values if all preset values are short, e.g.

layer=6 should not say "6 looks like one of [1,2,3,4,5]"

comment:17 Changed 7 weeks ago by GerdP

I didn't commit the changes reg. truncated values, they produced more noise than helpful warnings.
What I'd really like to have would be a flag that says if the list of preset values is quite complete and therefore other values are probably wrong.
This would allow to produce a warning when the value is not in the presets.
I simply don't like that we only produce an information for well known tag keys like highway, e.g. highway=dirt, highway=footpath or highway=u when we do all kinds of complex checks for highways like "Way end node near other highway".

An alternative woulld be to write a MapCSS check which lists all the values that appear in the presets, but I think that would mean a duplication of the lists.
I think a MapCSS rule is needed for something like highway=driveway, where the validator should suggest to replace it with
highway=service + service=driveway?

comment:18 Changed 7 weeks ago by naoliv

I am unsure if I am the only person who sometimes press Enter too quickly (before auto-completing a value with tab), but I usually have wrong surface=a or surface=u values in my data set.

For surface=u it's suggesting mud, while it most probably should be unpaved
Is it possible to have surface=asurface=asphalt too?

comment:19 Changed 7 weeks ago by GerdP

I coded something like that but was not satisfied with the results.
Can you explain why you have to press tab for autocompletion? I don't need to do that.

comment:20 Changed 7 weeks ago by naoliv

Next time I promise I will test exactly what I do before writing :-)
Indeed, I don't use tab but somehow I end up with surface=a and surface=u objects.
(and I will create some local rules to catch this an error)

Rectifying my previous message then, probably surface=u shouldn't be understood as surface=mud (nor offered an auto-fix for this specific case)

comment:21 Changed 7 weeks ago by GerdP

OK, I think JOSM should not suggest to fix a value with only one character with one that has three.
Reg. surface=a : You are right, it is probably always asphalt, but presets also contain artificial_turf, so JOSM would only show a list of these possible candidates, but it would not enable the Fix button. That's why I wasn't satisfied with my approach.
See also comment:12.

comment:22 Changed 6 weeks ago by GerdP

In 14517/josm:

see #17055 improve TagChecker

  • don't suggest fix value when given value is short and Levenshtein distance is 2

surface=u should not be fixed with surface=mud but highway=tra can be fixed with highway=track

comment:23 Changed 6 weeks ago by naoliv

A way with sidewalk=none is giving a warning saying that none seems to be no, but https://wiki.openstreetmap.org/wiki/Sidewalks#Sidewalk_as_refinement_to_a_highway says that "none is also used synonymously to no"

Probably there shouldn't be a warning for this case too.

comment:24 in reply to:  23 Changed 6 weeks ago by GerdP

Replying to naoliv:

A way with sidewalk=none is giving a warning saying that none seems to be no, but https://wiki.openstreetmap.org/wiki/Sidewalks#Sidewalk_as_refinement_to_a_highway says that "none is also used synonymously to no"

Probably there shouldn't be a warning for this case too.

I also noticed this, see comment:13 to comment:15. I also don't like the warning in this case, but I am unsure how to handle it.
I want to see the warning when one types e.g. oneway=none so one possible solution would be to add a line

K:sidewalk:no

to the end of file ignoretags.cfg.

comment:25 Changed 6 weeks ago by GerdP

In 14531/josm:

see #17055 improve TagChecker: Don't say "looks like" when all characters of given value are different, e.g. "Value '10' for key 'fee' looks like 'no'." makes no sense. The Levenshtein distance from 10 to no is only 2, but the length is also only 2.

comment:26 Changed 5 weeks ago by naoliv

shop=gas is warning with Value 'gas' for key 'shop' looks like one of [bag, car, yes].

Gerd, will this warning magically disappear if a preset is created for shop=gas?

comment:27 Changed 5 weeks ago by GerdP

Yes, I think so. Still, this should also be handled in the code. Looking at it...

comment:28 Changed 5 weeks ago by GerdP

In 14571/josm:

see #17055 When harmonized value is only 3 characters reduce allowed Levenshtein distance from 2 to 1.

When shop=gas is not in presets, we don't want a message like "Value 'gas' for key 'shop' looks like one of [bag, car, yes]."
just because one character of those values matches.

comment:29 Changed 5 weeks ago by GerdP

I wonder if we should flag all tag values with just one character (not a digit), so that we produce a warning instead of information for
e.g. highway=u or surface=a, but only for known preset keys.
We have a similar test that says "uncommon short key" for short unknown keys, e.g. for s=asphalt

comment:30 Changed 4 weeks ago by naoliv

shop=pasta is giving a warning saying that it looks like pastry, but they are different:

https://wiki.openstreetmap.org/wiki/Tag:shop%3Dpasta
https://wiki.openstreetmap.org/wiki/Tag:shop%3Dpastry

shop=party is giving a warning saying that it looks like [art, pastry] while it also shouldn't

https://wiki.openstreetmap.org/wiki/Tag:shop%3Dparty

comment:31 Changed 4 weeks ago by GerdP

I fear you will always find more here because the presets will probably never cover all cases.
Maybe we could create a list of "good" values for which we don't have a preset using taginfo? I think we already have a script TagInfoExtract.groovy to find possibly missing presets, but it seems the result is not stored anywhere.

comment:32 Changed 4 weeks ago by naoliv

The only thing that I fear is people pressing Fix without actually taking a look at the message and at the value (in this example, changing pasta to pastry)
Maybe it's safer to disable the autofix and only issue a warning?
Or, I don't know if it's possible, to have a fix option only for cases where there is no doubt at all.

comment:33 Changed 4 weeks ago by GerdP

OK, I'll remove the autofix option for now.

comment:34 Changed 4 weeks ago by GerdP

In 14585/josm:

see #17055 Disable autofix for possibly misspelled tag values

Some keys like shop have a long list of documented values and not all are in the presets. It can happen that a good value is flagged and a fix and an improper fix value is suggested.
This should be reverted once we have a better source for "correct" values. The presets are too incomplete.

comment:35 Changed 4 weeks ago by GerdP

Maybe I should also change the message text to something like
"Value '...' for key '...' is not in presets, did you mean '...'?"

comment:36 in reply to:  31 ; Changed 4 weeks ago by Klumbumbus

Replying to GerdP:

I fear you will always find more here because the presets will probably never cover all cases.
Maybe we could create a list of "good" values for which we don't have a preset using taginfo? I think we already have a script TagInfoExtract.groovy to find possibly missing presets, but it seems the result is not stored anywhere.

"Good" values which don't have a preset yet (e.g. due to low usage numbers like shop=pasta) can be added to source:/josm/trunk/data/validator/ignoretags.cfg Does your misspelled values test consider this list?

I think TagInfoExtract.groovy is used only for the integration test. (Currently tags without wiki page are ignored too, but this constraint may be removed after adding some tags to the presets and feeding the ignore list a bit, see #16900.)

comment:37 in reply to:  36 ; Changed 4 weeks ago by GerdP

Replying to Klumbumbus:

"Good" values which don't have a preset yet (e.g. due to low usage numbers like shop=pasta) can be added to source:/josm/trunk/data/validator/ignoretags.cfg Does your misspelled values test consider this list?

Well, it is not executed for the values in this list. Might not be the best way to consider the list.

I think TagInfoExtract.groovy is used only for the integration test. (Currently tags without wiki page are ignored too, but this constraint may be removed after adding some tags to the presets and feeding the ignore list a bit, see #16900.)

The list doesn't contain the examples shop=party or shop=pastry, so we'd probably need a longer list.

comment:38 in reply to:  37 Changed 4 weeks ago by Klumbumbus

Replying to GerdP:

Replying to Klumbumbus:

"Good" values which don't have a preset yet (e.g. due to low usage numbers like shop=pasta) can be added to source:/josm/trunk/data/validator/ignoretags.cfg Does your misspelled values test consider this list?

Well, it is not executed for the values in this list. Might not be the best way to consider the list.

Yes, with "consider" I meant that the test is not executed for the values in this list.

I think TagInfoExtract.groovy is used only for the integration test. (Currently tags without wiki page are ignored too, but this constraint may be removed after adding some tags to the presets and feeding the ignore list a bit, see #16900.)

The list doesn't contain the examples shop=party or shop=pastry, so we'd probably need a longer list.

Yes these two (and probably some more we don't know atm) should be added to the list to remove the false positive warnings.

comment:39 in reply to:  35 Changed 4 weeks ago by Klumbumbus

Replying to GerdP:

Maybe I should also change the message text to something like
"Value '...' for key '...' is not in presets, did you mean '...'?"

That sounds better. (Maybe ...in the presets...)

comment:40 Changed 4 weeks ago by GerdP

OK, will change it. Reg. the list I think about a different approach:
We could generate a list of all frequently used tags and run them through TagChecker. All false positives should be added to
a special list which would than be used to suggest correct spelling. The test is not so much about correct tagging, so I would not care if e.g. sidewalk=none or sidewalk=no is in the presets as long as both values are often used.
How does that sound?

comment:41 Changed 4 weeks ago by Klumbumbus

Sounds good, however it might be hard to find an proper automatic rule. How to define "often used"?

comment:42 Changed 4 weeks ago by GerdP

Probably there is no simple answer. As a first approach I'd say all tags used > 1000 times are often used. Let me code something to process such a list, I probably cannot help filling it.

Thought about the warning text again. The "you" is not so good when someone else created the tag.
Next try:

single value in {2} :   "Value ''{0}'' for key ''{1}'' is not in the presets, maybe ''{2}'' is meant?"
multiple values in {2}: "Value ''{0}'' for key ''{1}'' is not in the presets, maybe one of {2} is meant?"
Last edited 4 weeks ago by GerdP (previous) (diff)

Changed 4 weeks ago by GerdP

Attachment: 17055-use-ignore.patch added

use values with prefix K: for checks with Levenshtein distance, too

comment:43 Changed 4 weeks ago by GerdP

With the 17055-use-ignore.patch the lines from ignoretags.cfg starting with K: are treated as "often used".
So, if you add K:shop=party to it
and run the validator with shop=party it will be quiet, for shop=panty it will show

Unknown property value - Value 'panty' for key 'shop' is not in the presets, maybe 'party' is meant? (1)

and for shop=pasty you get

Unknown property value - Value 'pasty' for key 'shop' is not in the presets, maybe one of [party, pastry] is meant? (1)

open questions:
1) Should we use another file for the often used values? If yes, should we use the "K:" prefix?
2) Should we always raise a warning when a tag key is found in that list but not the value?
Depending on this the message texts would change again, I think "not in the presets" would then be "unknown".

comment:44 Changed 3 weeks ago by Don-vip

Milestone: 19.01

comment:45 in reply to:  42 Changed 13 days ago by simon04

comment:46 Changed 13 days ago by GerdP

Yes, it is close to what we need, but it stops for value count < 10000. I think that's too early.

comment:47 Changed 13 days ago by simon04

Alternatively, you could download the entire SQLite database from https://taginfo.openstreetmap.org/download and perform custom queries.

Changed 4 days ago by GerdP

Attachment: frequent-tags.zip added

comment:48 Changed 4 days ago by GerdP

Thanks for the hint and sorry for the late answer. I've downloaded the complete DB (1.1GB) and played with a few SQL to extract frequently used tags.
My current thinking is this:
Execute a rather simple sql like

'select * from tags where count_all > 1000';

or maybe a bit more complex:

select * from tags where count_all > 1000 and key not like '%name%' and key not like '%source%' and key not like 'tiger%' and key not like 'yh:%' and key not like 'addr:%' and key not like 'is_in%' and key not like 'contact:%' and key not like 'turn:lanes%' and key not in ('circumference','color','colour','created_by','description','ele','fixme','height','width','level','opening_hours','website');

to extract a list of frequently used tags, lets name it frequent-tags.txt. Attached is the result for the latter.
The table tags looks like this:

CREATE TABLE tags (
  key              VARCHAR,
  value            VARCHAR,
  count_all        INTEGER DEFAULT 0,
  count_nodes      INTEGER DEFAULT 0,
  count_ways       INTEGER DEFAULT 0,
  count_relations  INTEGER DEFAULT 0,
  in_wiki          INTEGER DEFAULT 0,
  in_wiki_en       INTEGER DEFAULT 0
);

The result contains the columns separated by |.
My current thinking is this: I could either try to improve the sql so that it filters more tags which are not usefull for the Levenshtein text ("val looks like ...") so that we can add that result directly to the data\validator directory and parse it on each start of JOSM like we do with ignoretags.cfg. The result could be some HashSets, maybe separated for nodes, ways, and relations, which would be used to suggest different spelling.
I'll try to code that now. Question is how the input file is updated.
I think it would be best if that file would be created on the taginfo server once we know exactly how the format should look like?

comment:49 Changed 13 hours ago by GerdP

I've played with this for a while now. I have a list of frequently used tags where the values are strings. Many of them are also in the presets, BUT also many are not.

I don't think that I can simply use the list from taginfo without review, as it sometimes contains bad data from imports.
Example: leaf_type=broadl-leaved appears > 27000 times but is simply wrong (should be broadleaved)

I don't want to produce a hint like
Unknown property value - Value 'broadlleaved' for key 'leaf_type' is unknown, maybe one of [broadl-leaved, broadleaved] is meant? (1)
So the list has to be checked carefully.
I fear this will take some more days to produce a good start.
I think about involving the community here.

Modify Ticket

Change Properties
Set your email in Preferences
Action
as reopened The owner will remain team.
as The resolution will be set.
to The owner will be changed from team to the specified user.
The owner will change to GerdP
as duplicate The resolution will be set to duplicate.The specified ticket will be cross-referenced with this ticket

Add Comment


E-mail address and name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.