Opened 5 years ago

Last modified 21 months ago

#3290 new defect

Make sure an XML parser fully supporting UTF-16 is used by JOSM

Reported by: Gubaer Owned by: team
Priority: major Milestone:
Component: Core Version:
Keywords: Cc:

Description (last modified by verdy_p)

See the discussion on dev and josm-dev:

  • there's a problem with XML parsers which don't handle UTF-16 correctly. Apparently, they insert duplicates of surrogate code points in OSM keys or values while parsing. After a couple of IO operations even small OSM files/fragments can become very large. In OSM the problem was spotted because of gothic code points in name:got-tags.

JOSM should either ship a compliant parser with it's distribution or check/enforce on startup that a known compliant parser is on the classpath.

Attachments (2)

gottic.osm (249 bytes) - added by Don-vip 21 months ago.
Sample file with two tags containing a chinese character and a gottic one (2.0 KB) - added by Don-vip 21 months ago.
Program demonstrating bug with default XML parser

Download all attachments as: .zip

Change History (11)

comment:1 Changed 5 years ago by Gubaer

Here the links to the two threads:

comment:2 Changed 5 years ago by stoecker

  • Priority changed from critical to major

comment:3 follow-up: Changed 21 months ago by verdy_p

  • Description modified (diff)

Note that by default Java 6/7 are embedding version 2.1.0 of Xerces (but a bizarre forked version of it), which is a bit out of date and with limuted capabilities (basically just SAX, but a limited DOM parser as it offers limited resolution of schemas and DTDs, to work in standalone mode and without needing validation and full resolution of entities).

Oracle/Sun refused to upgrade it, just to preserve existing functionalities (or lack of functionalities) of its API, but at least it should have patched it to correctly handle UTF-16 since long. This was to allow applications to use their own separate versions when needed. Xerces and SAX are embedded only to support the core Java API, and secure it (by not allowing external entity resolvers based on external URLs pointing to rogue domains in some XML files : adding it would have required adding more security checks within the JRE core, and slower performances).

But with a single jar installation of Josm (and no other JNI components) integrating another version of Xerces would mean installing it within the jar, instead of the common separate jar.

Xerces is not a huge package, it could be embedded in the distribution with very modest increase in size (about 4.2MB for the compiled JAR (and lower if using the newer JAR format, it used to be about 6.5MB before in the old JAR format before Java 6).

If needed, a streamlined version of Xerces just needed for the OSM interchange protocol would be enough (no validation, no DTDs, no parsed entities, and predefined XML schemas; possibly working only in SAX mode, the rest being performed in JOSM itself as there's no need to support the DOM API, when the internal OSM schema is enough to store the parsed documents).

Note that there are also separate issues. Independantly of XML parsing, there's also the need to validate strings entered or modified in the editor. Supporting and validating text input in forms requires more developments (for now any attempt to paste non-BMP characters will discard them from the input because it silently rejects all surrogates). If you allow surrogates, the editor will need to check UTF-16 conformance (notably when pressing Delete or Backspace key or when pasting to the editor, or if the editor must truncate some too long strings). Finally the data validator could contain an additional Unicode confortmance test (including tests for characters to avoid, such as undesired controls and non-characters like U+FFFE).

There's another issue for rendering texts using non-BMP characters: Java supports it but only with some prior setup of supporting fonts (a basic installation of Java will not deliver many fonts, others are assumed to be provided by the underlying OS, but this may fail and would require some special settings in stylesheets to map some scripts or script ranges to some available fonts, or to use prefered fonts for displaying some translated "name:lang=*" or "wikipedia:lang=*" or "wikipedia=lang:*", instead of the default font used for displaying tag values).

Finally there remains issues when displaying some complex scripts that require OpenType features (this includes text selection in grapheme clusters, validating those clusters with a cluster boundary checker, handling BiDi texts...) But this will go into another tracking bug for internationalization (needed notably to support Indic scripts : OSM currently includes mostly English names, but very few names in Hindi, Thai or Lao for example ; Chinese, Japanese and Korean are much less compex to support, except non-BMP sinograms for Chinese).

There are also unsolved issues with Arabic and Hebrew wiith tags containing these scripts mixed with Latin and neutral punctuation possibly mirrored; a correct display in a single string value would require BiDi controls, but overrides and embedding controls are not a good long term solution, only the support of BiDi isolates (with FSI/PDI) will make strings easily reusable and embeddable in any contexts (but FSI..PDI controls are still not standardized by Unicode, they may appear in the next release of Unicode with a major update of the BiDi algorithm, which will deprecate the legacy use of RLO..LRO and RLE/LRE..PDF controls). Here again this will require a separate tracking bug, to support this in JOSM as well as in Flash for Potlatch, or online in HTML in the OSM data browser pages, and in renderers like Mapnik. And nothing will be done before there's an update in Unicode, CSS, and HTML5 to support BiDi isolates.

comment:4 Changed 21 months ago by Don-vip

Ticket #8354 has been marked as a duplicate of this ticket.

comment:5 in reply to: ↑ 3 Changed 21 months ago by Don-vip

Replying to verdy_p:

Note that by default Java 6/7 are embedding version 2.1.0 of Xerces

According to JAXP web site JAXP 1.4.6 is embedded since Java 7 update 4 and contains (a fork of) Xerces 2.7.1 with "all blocker/critical fixes" of version 2.10.0.

If this problem is still here with latest Java 7 update, it should be raised as a bug to JAXP team. It should not be up to JOSM to fix this kind of issues.

Changed 21 months ago by Don-vip

Sample file with two tags containing a chinese character and a gottic one

Changed 21 months ago by Don-vip

Program demonstrating bug with default XML parser

comment:6 Changed 21 months ago by Don-vip

comment:7 Changed 21 months ago by verdy_p

Minor editorial bug in your sample java code : you encode Chinese as "ch" (language code for another language) instead of "zh"... This still does not invalidate your bug report (which is more critical than expected because it exposes data from another unrelated attribute by concatenating them, which means severe unexpected data corruption when handling XML files containing supplementary characters.
So Java has another bug...

comment:8 Changed 21 months ago by anonymous

For now the only workaround for us to install a decent version of Xerces to replace the default implementation in the JRE, in the "lib/endorsed" subdirectory of Java (changing the classpath, even at run-time is not enough, unless we use our own classloader to override the package from which Xerces will be loaded from another package path supplied by the application).
I still wonder why Java does not use the standard classpath for its own implementation of Xerces (notably because it also reuses the same default package name).

comment:9 Changed 21 months ago by Don-vip

No response from JAXP team, I have submitted another bug directly to

Add Comment

Modify Ticket

Change Properties
<Author field>
as new .
as The resolution will be set. Next status will be 'closed'.
to The owner will be changed from team. Next status will be 'new'.
Next status will be 'needinfo'.The owner will change to Gubaer
as duplicate The resolution will be set to duplicate. Next status will be 'closed'.The specified ticket will be cross-referenced with this ticket
The owner will be changed from team to anonymous. Next status will be 'assigned'.

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.