Modify

Opened 7 months ago

Closed 7 months ago

#17521 closed enhancement (fixed)

Complain about invisible characters (unicode bidi control) in tags

Reported by: mkoniecz Owned by: team
Priority: normal Milestone: 19.03
Component: Core validator Version:
Keywords: template_report unicode bidi control character Cc:

Description (last modified by mkoniecz)

What steps will reproduce the problem?

  1. Load https://www.openstreetmap.org/way/40545876/history in version 4
  2. Run validator

What is the expected result?

Validator complains about invisible characters and offers to remove them.

What happens instead?

Unknown property value - Value 'ground‬' for key 'surface' is unknown, maybe 'ground' is meant? (1)

It complains but in way that is extremely confusing.

Please provide any additional information below. Attach a screenshot if possible.

This specific issue should be soon gone as result of https://wiki.openstreetmap.org/wiki/Mechanical_Edits/Mateusz_Konieczny_-_bot_account/elimination_of_nonprintable_characters_at_start_or_end_of_tags but just this specific malformed ground value was present 6k times

URL:https://josm.openstreetmap.de/svn/trunk
Repository:UUID: 0c6e7542-c601-0410-84e7-c038aed88b3b
Last:Changed Date: 2019-03-24 22:30:59 +0100 (Sun, 24 Mar 2019)
Build-Date:2019-03-25 02:30:52
Revision:14927
Relative:URL: ^/trunk

Identification: JOSM/1.5 (14927 en) Linux Ubuntu 16.04.6 LTS
Memory Usage: 392 MB / 869 MB (166 MB allocated, but free)
Java version: 1.8.0_201-b09, Oracle Corporation, Java HotSpot(TM) 64-Bit Server VM
Screen: :0.0 1920x1080
Maximum Screen Size: 1920x1080
Dataset consistency test: No problems found

Plugins:
+ OpeningHoursEditor (34867)
+ buildings_tools (34904)
+ continuosDownload (82)
+ imagery_offset_db (34867)
+ measurement (34867)
+ reverter (34946)
+ todo (30306)

Validator rules:
+ ${HOME}/Desktop/tmp/unnecessary.validator.mapcss

Last errors/warnings:
- W: Invalid jar file ''<josm.userdata>/plugins/reverter.jar.new'' (exists: false, canRead: false)
- W: No configuration settings found.  Using hardcoded default values for all pools.
- W: java.net.SocketException: Socket closed
- E: java.net.SocketException: Socket closed

Attachments (2)

sample.osm (468 bytes) - added by GerdP 7 months ago.
17521.osm (1.3 KB) - added by Don-vip 7 months ago.

Download all attachments as: .zip

Change History (15)

comment:1 Changed 7 months ago by mkoniecz

Description: modified (diff)

comment:2 Changed 7 months ago by mkoniecz

Description: modified (diff)

Changed 7 months ago by GerdP

Attachment: sample.osm added

comment:3 Changed 7 months ago by GerdP

Thanks for finding this. I've already noticed these strange tags in taginfo but wasn't able to find one in OSM.
I've attached a sample file based on the wrong tag in the mentioned way.

comment:4 Changed 7 months ago by GerdP

Do you have a hint how to detect those characters in Java? In this case it is 0x202c. I assume there are more?

comment:6 Changed 7 months ago by GerdP

Not really. I've already started to code a similar method `containsNonPrintable()' but up to now I found no general rule
to detect characters which are not displayed. All the code snippets that I found so far would not return true for the sample.

comment:7 Changed 7 months ago by Don-vip

Resolution: duplicate
Status: newclosed

Closed as duplicate of #15645.

comment:8 Changed 7 months ago by mkoniecz

Given title and description #15645 seems about fixing already detected cases.

Here problem is not even detected.

Though maybe #15645 is for any validation related to whitespaces.

comment:9 Changed 7 months ago by Don-vip

Resolution: duplicate
Status: closedreopened

Right.

Changed 7 months ago by Don-vip

Attachment: 17521.osm added

comment:10 Changed 7 months ago by Don-vip

Keywords: unicode bidi control character added
Summary: Complain about invisible characters in tagsComplain about invisible characters (unicode bidi control) in tags

It's an Unicode bidi control character: https://en.wikipedia.org/wiki/Unicode_control_characters#Bidirectional_text_control

ASCII control characters are already detected (that's effectively tracked in #15645)

comment:11 Changed 7 months ago by Don-vip

JDK implementation (in sun.text.bidi.BidiBase) is:

    static boolean IsBidiControlChar(int c)
    {
        /* check for range 0x200c to 0x200f (ZWNJ, ZWJ, LRM, RLM) or
                           0x202a to 0x202e (LRE, RLE, PDF, LRO, RLO) */
        return (((c & 0xfffffffc) == 0x200c) || ((c >= 0x202a) && (c <= 0x202e)));
    }

comment:12 Changed 7 months ago by Don-vip

Milestone: 19.03

comment:13 Changed 7 months ago by Don-vip

Resolution: fixed
Status: reopenedclosed

In 14933/josm:

fix #15645, fix #17521 - Detect and fix non-printing control characters

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain team.
as The resolution will be set.
The resolution will be deleted.

Add Comment


E-mail address and name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.