Opened 6 years ago
Last modified 5 years ago
#17609 new enhancement
Detect invalid usage of ZWNJ/ZWJ characters
Reported by: | Don-vip | Owned by: | team |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | Core validator | Version: | |
Keywords: | unicode persian arabic | Cc: |
Description
Follow-up of #17595. We should detect incorrect uses of wikipedia:Zero-width_joiner and wikipedia:Zero-width_non-joiner characters in OSM tags:
> In practice, zwnj is like space but with a width of zero and It does prevent adjacent characters to join each other. Although it's a valid character in Persian, yet there are some cases it appears at invalid position in a word. So It would be nice if we could keep warnings for invalid cases.
>
> As an example think about "aa*aaa" as a word in Persian with zwnj included. asterisk is zwnj.
>
> the only valid case is aa*aaa.
> (between two adjacent letters that can join each other)
>
> more common cases that are invalid:
>
> * doubled zwnj or more (like doubled space): aa**aaa
> * at start or end of word: *aa*aaa or aa*aaa*
> * immediately before/after space character: aa* aaa or aa *aaa (this could happen in a word, because normally we type zwnj with shift+space)
> * maybe a more tricky one:
> * We have seven letters (و, ژ, ز, ر, ذ, د, ا) which do not connect to a following letter. So writing zwnj after them is useless and not needed. assume b is one of them. this is invalid: ab*aaa (this case could happen to other languages with similar but not the same letters)
Attachments (0)
Change History (3)
comment:1 by , 6 years ago
comment:2 by , 5 years ago
@iman: can the seven letters connect to a previous letter? With your example, is aa*baa valid?
oh thanks!