Opened 7 years ago
Closed 7 years ago
#17521 closed enhancement (fixed)
Complain about invisible characters (unicode bidi control) in tags
| Reported by: | mkoniecz | Owned by: | team |
|---|---|---|---|
| Priority: | normal | Milestone: | 19.03 |
| Component: | Core validator | Version: | |
| Keywords: | template_report unicode bidi control character | Cc: |
Description (last modified by )
What steps will reproduce the problem?
- Load https://www.openstreetmap.org/way/40545876/history in version 4
- Run validator
What is the expected result?
Validator complains about invisible characters and offers to remove them.
What happens instead?
Unknown property value - Value 'ground' for key 'surface' is unknown, maybe 'ground' is meant? (1)
It complains but in way that is extremely confusing.
Please provide any additional information below. Attach a screenshot if possible.
This specific issue should be soon gone as result of https://wiki.openstreetmap.org/wiki/Mechanical_Edits/Mateusz_Konieczny_-_bot_account/elimination_of_nonprintable_characters_at_start_or_end_of_tags but just this specific malformed ground value was present 6k times
URL:https://josm.openstreetmap.de/svn/trunk
Repository:UUID: 0c6e7542-c601-0410-84e7-c038aed88b3b
Last:Changed Date: 2019-03-24 22:30:59 +0100 (Sun, 24 Mar 2019)
Build-Date:2019-03-25 02:30:52
Revision:14927
Relative:URL: ^/trunk
Identification: JOSM/1.5 (14927 en) Linux Ubuntu 16.04.6 LTS
Memory Usage: 392 MB / 869 MB (166 MB allocated, but free)
Java version: 1.8.0_201-b09, Oracle Corporation, Java HotSpot(TM) 64-Bit Server VM
Screen: :0.0 1920x1080
Maximum Screen Size: 1920x1080
Dataset consistency test: No problems found
Plugins:
+ OpeningHoursEditor (34867)
+ buildings_tools (34904)
+ continuosDownload (82)
+ imagery_offset_db (34867)
+ measurement (34867)
+ reverter (34946)
+ todo (30306)
Validator rules:
+ ${HOME}/Desktop/tmp/unnecessary.validator.mapcss
Last errors/warnings:
- W: Invalid jar file ''<josm.userdata>/plugins/reverter.jar.new'' (exists: false, canRead: false)
- W: No configuration settings found. Using hardcoded default values for all pools.
- W: java.net.SocketException: Socket closed
- E: java.net.SocketException: Socket closed
Attachments (2)
Change History (15)
comment:1 by , 7 years ago
| Description: | modified (diff) |
|---|
comment:2 by , 7 years ago
| Description: | modified (diff) |
|---|
by , 7 years ago
| Attachment: | sample.osm added |
|---|
comment:3 by , 7 years ago
comment:4 by , 7 years ago
Do you have a hint how to detect those characters in Java? In this case it is 0x202c. I assume there are more?
comment:5 by , 7 years ago
comment:6 by , 7 years ago
Not really. I've already started to code a similar method `containsNonPrintable()' but up to now I found no general rule
to detect characters which are not displayed. All the code snippets that I found so far would not return true for the sample.
comment:7 by , 7 years ago
| Resolution: | → duplicate |
|---|---|
| Status: | new → closed |
Closed as duplicate of #15645.
comment:8 by , 7 years ago
by , 7 years ago
comment:10 by , 7 years ago
| Keywords: | unicode bidi control character added |
|---|---|
| Summary: | Complain about invisible characters in tags → Complain about invisible characters (unicode bidi control) in tags |
It's an Unicode bidi control character: https://en.wikipedia.org/wiki/Unicode_control_characters#Bidirectional_text_control
ASCII control characters are already detected (that's effectively tracked in #15645)
comment:11 by , 7 years ago
JDK implementation (in sun.text.bidi.BidiBase) is:
static boolean IsBidiControlChar(int c) { /* check for range 0x200c to 0x200f (ZWNJ, ZWJ, LRM, RLM) or 0x202a to 0x202e (LRE, RLE, PDF, LRO, RLO) */ return (((c & 0xfffffffc) == 0x200c) || ((c >= 0x202a) && (c <= 0x202e))); }
comment:12 by , 7 years ago
| Milestone: | → 19.03 |
|---|



Thanks for finding this. I've already noticed these strange tags in taginfo but wasn't able to find one in OSM.
I've attached a sample file based on the wrong tag in the mentioned way.