Modify

Opened 6 months ago

Closed 6 months ago

Last modified 6 months ago

#17595 closed defect (fixed)

False positive "Tag value contains non-printing character" for Persian script

Reported by: Claudius Owned by: Don-vip
Priority: normal Milestone: 19.04
Component: Core validator Version:
Keywords: template_report persian unicode regression Cc:

Description (last modified by Klumbumbus)

What steps will reproduce the problem?

  1. Load osmwww:node/3305202081 (Tested with node version v8 which contains the tag wikipedia=fa:منجیل‌آباد )
  2. Run validator

What is the expected result?

No "Tag value contains non-printing character" warning to be shown

What happens instead?

"Tag value contains non-printing character" is shown. The name Manjilabad written in Persian script منجیل‌آباد are perfectly valid and printable characters.

Relative:URL: ^/trunk
Repository:UUID: 0c6e7542-c601-0410-84e7-c038aed88b3b
Last:Changed Date: 2019-04-11 21:18:16 +0200 (Thu, 11 Apr 2019)
Revision:14986
Build-Date:2019-04-12 01:30:51
URL:https://josm.openstreetmap.de/svn/trunk

Identification: JOSM/1.5 (14986 de) Windows 10 64-Bit
OS Build number: Windows 10 Home 1809 (17763)
Memory Usage: 653 MB / 2048 MB (142 MB allocated, but free)
Java version: 11.0.1+13, AdoptOpenJDK, OpenJDK 64-Bit Server VM
Screen: \Display0 2736x1824
Maximum Screen Size: 2736x1824
Dataset consistency test: No problems found

Plugins:
+ OpeningHoursEditor (34867)

Tagging presets:
+ https://josm.openstreetmap.de/josmfile?page=Presets/OpenPisteMap&zip=1
+ https://josm.openstreetmap.de/josmfile?page=Presets/OneClick&zip=1
+ https://josm.openstreetmap.de/josmfile?page=Presets/Iranian_Preset&zip=1

Map paint styles:
+ https://josm.openstreetmap.de/josmfile?page=Styles/MaxspeedIcons&zip=1
+ https://josm.openstreetmap.de/josmfile?page=Styles/Lane_and_Road_Attributes&zip=1
+ https://josm.openstreetmap.de/josmfile?page=Styles/LayerChecker&zip=1
+ https://josm.openstreetmap.de/josmfile?page=Styles/AdvertisingStyle&zip=1
+ https://raw.githubusercontent.com/species/josm-preset-traffic_sign_direction/master/direction.mapcss
+ https://github.com/osmlab/appledata/archive/josm_paint_inline_validation.zip

Last errors/warnings:
- W: No configuration settings found.  Using hardcoded default values for all pools.

Attachments (0)

Change History (20)

comment:1 Changed 6 months ago by GerdP

This message is not about persian script characters. Please try the Fix button and check the result.

comment:2 Changed 6 months ago by Don-vip

Keywords: unicode block persian added
Owner: changed from team to Don-vip
Status: newassigned

There is no autofix for this check. It is impossible to guess if the string contains an extra character that must be deleted or if someone used a wrong character instead of the correct one.

comment:3 Changed 6 months ago by GerdP

There is an autofix. I am just not sure if it works for this case where left to right and right to left is mixed.

comment:4 Changed 6 months ago by Klumbumbus

Description: modified (diff)

comment:5 Changed 6 months ago by Don-vip

Keywords: unicode block removed

Ah I confused with another check. With the autofix "fa:منجیل‌آباد" becomes "fa:منجیلآباد".

comment:7 Changed 6 months ago by GerdP

Please double check. For me both links work and I think they both link to the same page.

comment:8 Changed 6 months ago by mkoniecz

Please double check. For me both links work and I think they both link to the same page.

For me the first links to https://fa.wikipedia.org/wiki/%D9%85%D9%86%D8%AC%DB%8C%D9%84%E2%80%8C%D8%A2%D8%A8%D8%A7%D8%AF with the article and second to https://fa.wikipedia.org/wiki/%D9%85%D9%86%D8%AC%DB%8C%D9%84%D8%A2%D8%A8%D8%A7%D8%AF with "There is currently no text in this page."

comment:9 in reply to:  8 Changed 6 months ago by GerdP

Replying to mkoniecz:

Please double check. For me both links work and I think they both link to the same page.

For me the first links to https://fa.wikipedia.org/wiki/%D9%85%D9%86%D8%AC%DB%8C%D9%84%E2%80%8C%D8%A2%D8%A8%D8%A7%D8%AF with the article and second to https://fa.wikipedia.org/wiki/%D9%85%D9%86%D8%AC%DB%8C%D9%84%D8%A2%D8%A8%D8%A7%D8%AF with "There is currently no text in this page."

Yes, sorry, my fault. I think now I understand better what dyslexics means :O
For me both pages looked similar and I did not even try to find out what the words mean :(

comment:10 Changed 6 months ago by FTABoy

Hi guys , the problem is with the "Zero-width_non-joiner" or AKA "half space" character in 'wikipedia' tag.please read this wiki page about this character : https://en.wikipedia.org/wiki/Zero-width_non-joiner
we are not always using this character in Persian writing(typing) but i think this is valid character in Farsi. also i think this problem is not just with 'wikipedia' tag. other tags like Name have this issue too. in my opinion this character shouldn't be detected as invalid.
PS: i speak in persian/farsi , so feel free to ask any question about it.
regards

Last edited 6 months ago by FTABoy (previous) (diff)

comment:11 Changed 6 months ago by Sosha

i dont speaking/writing English very well, so im very sorry.
but, in Iran and other country(like Afghanistan and etc.), we use "Zero-width non-joiner" in writing . usually Zero-width non-joiner used in names. ex: "محمود‌آباد". but sometimes we used Zero-width non-joiner in many words like "ثبت‌نام" or "می‌خواهم" or other words.
please fix it.
thanks.

comment:12 Changed 6 months ago by GerdP

@Vincent: Maybe we can change the test so that it only creates the error when the "fixed" value contains only latin1 characters?

comment:13 Changed 6 months ago by iman

Hi,
In practice, zwnj is like space but with a width of zero and It does prevent adjacent characters to join each other. Although it's a valid character in Persian, yet there are some cases it appears at invalid position in a word. So It would be nice if we could keep warnings for invalid cases.

As an example think about "aa*aaa" as a word in Persian with zwnj included. astetisk is zwnj.

the only valid case is aa*aaa.
(between two adjacent letters that can join each other)

more common cases that are invalid:

  • doubled zwnj or more (like doubled space): aa**aaa
  • at start or end of word: *aa*aaa or aa*aaa*
  • imediately before/after space character: aa* aaa or aa *aaa (this could happen in a word, because normally we type zwnj with shift+space)
  • maybe a more tricky one:
    • We have seven letters (و, ژ, ز, ر, ذ, د, ا) which do not connect to a following letter. So writing zwnj after them is useless and not needed. assume b is one of them. this is invalid: ab*aaa (this case could happen to other languages with similar but not the same letters)

thanks

comment:14 Changed 6 months ago by GerdP

The test was introduced to fix #17521, that's why I would try to limit it to those cases. I see no way to handle all the special cases of hundreds of languages.

comment:15 Changed 6 months ago by iman

No problem, maybe we should check it using some special quality assurance tools.

comment:16 in reply to:  13 Changed 6 months ago by Don-vip

Replying to iman:

In practice, zwnj is like space but with a width of zero and It does prevent adjacent characters to join each other. Although it's a valid character in Persian, yet there are some cases it appears at invalid position in a word. So It would be nice if we could keep warnings for invalid cases.

As an example think about "aa*aaa" as a word in Persian with zwnj included. astetisk is zwnj.

the only valid case is aa*aaa.
(between two adjacent letters that can join each other)

more common cases that are invalid:

  • doubled zwnj or more (like doubled space): aa**aaa
  • at start or end of word: *aa*aaa or aa*aaa*
  • imediately before/after space character: aa* aaa or aa *aaa (this could happen in a word, because normally we type zwnj with shift+space)
  • maybe a more tricky one:
    • We have seven letters (و, ژ, ز, ر, ذ, د, ا) which do not connect to a following letter. So writing zwnj after them is useless and not needed. assume b is one of them. this is invalid: ab*aaa (this case could happen to other languages with similar but not the same letters)

Thank you for this explanation! I will detect and fix the simplest case (start or end of the complete string) to fix this ticket and open a new one to see if we can do better.

Last edited 6 months ago by Don-vip (previous) (diff)

comment:17 Changed 6 months ago by Don-vip

Keywords: unicode added
Milestone: 19.04

comment:18 Changed 6 months ago by Don-vip

Resolution: fixed
Status: assignedclosed

In 14991/josm:

fix #17595 - smarter detection of ZWNJ/ZWJ unicode characters

comment:19 Changed 6 months ago by Don-vip

Component: CoreCore validator
Keywords: regression added

comment:20 Changed 6 months ago by Don-vip

Follow-up: #17609

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain Don-vip.
as The resolution will be set.
The resolution will be deleted.

Add Comment


E-mail address and name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.