Modify

Opened 12 years ago

Closed 12 years ago

#8354 closed defect (duplicate)

Supplementary characters out of the Unicode BMP

Reported by: verdy_p Owned by: team
Priority: normal Milestone:
Component: Core Version:
Keywords: Cc:

Description

I have found some intriguating data on the node for Paris, France : it contains a name:got entry for the name written in the Gottic script and language.
However this script can only be written with characters out of the Unicode BMP, in fact they are all in the SMP (1st supplementary multilingual plane).
JOSM then does not validate this data and parses it as an overlong string despite it is short (𐍀𐌰𐍂𐌹𐍃 according to the name of the Wikipedia article).
Well JOSM cannot display this name due to the current lack of support for rendering this script.
But its vlidator shoul not say that the key value is overlong (?).
In fact I cannot even copy-paste the name to replace it and fix the issue reported (JOSM does not want any presence of surrogates (in D800..DFFF), even if they are correctly paired in the data to paste

OSM node id: 17807753

Attachments (0)

Change History (5)

comment:1 by verdy_p, 12 years ago

Note : For now I consider deleting the Gottic name as it causes trouvles and is not needed. But the case will be more significative in China, where toponyms DO require the use of supplementary characters in the Supplementary Ideographic plane, for effective modern use !

JOSM should then support the input of data containing valid surrogate pairs (in Java strings are encoded internally as UTF-16, even if the OSM protocols and XML file formats use UTF-8 externally, so that they do not contain any "surrogate"). It should supprt them as long as the surrogates pairs are valid :

  • always paired where a single high surrogate in D800..DBFF follows a single low surrogate in CD00..DFFF
  • no isolated surrogates
  • for strict conformance during interchanges of XML, in files or communication protocols, the associated pair does not encode any "non-character" code point such as U+1FFFF (a only 32 surrogates pairs are affected)
  • and we should avoid using supplementary planes for private use (planes 15 and 16, i.e. the encoded code points should all be below U+F000, which is simple to handle by testing only the value of the leading high surrogate in UTF-16 encoded Java strings)
  • no assumptions should be enforced for now about the validity of other surrogate pairs encoding the range U+10000..U+EFFFF, except that some validators may emit errors about codepoints U+nFFFE and U+nFFFF (the two last codepoints of the last plane, which map to the two highest values of the trailing surrogate (0xDFFE and 0xDFFF) when the leading surrogate also maps to the end of each plane (16 possible values), and only later some warnings about unuassigned code points (this would require synchronisation with the Unicode standard versions), and a list of known deprecated characters to avoid (such as language tag characters in the special plane) : unassigned codepoints should still be accepted, unless JOSM really is ALWAYS in sync with the last Unicode version (there may be problems temporarily when a new version is released), and about the presence of ASSIGNED private use characters (to avoid absolutely in OSM).

comment:2 by Don-vip, 12 years ago

Reporter: changed from anonymous to verdy_p

comment:3 by Don-vip, 12 years ago

Is it not a duplicate of #3290 ?

comment:4 by verdy_p, 12 years ago

Yes effectively, because I found this on a node name in Paris where validation tools detected overlong strings containing garbage, created or modified by JOSM not understanding correctly the Gottic name.
Note that JOSM also complains incorrectly about the language code "got" in name:got=* (or in wikipedia=got:* and wikipedia:got=*), which is perfectly valid.

comment:5 by Don-vip, 12 years ago

Resolution: duplicate
Status: newclosed

Closed as duplicate of #3290.

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain team.
as The resolution will be set.
The resolution will be deleted. Next status will be 'reopened'.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.