Opened 11 years ago

Closed 6 years ago

Last modified 5 years ago

#3290 closed defect (othersoftware)

Make sure an XML parser fully supporting UTF-16 is used by JOSM

Reported by: Gubaer Owned by: team
Priority: major Milestone:
Component: Core Version:
Keywords: javabug 9 xml unicode stax Cc:

Description (last modified by verdy_p)

See the discussion on dev and josm-dev:

  • there's a problem with XML parsers which don't handle UTF-16 correctly. Apparently, they insert duplicates of surrogate code points in OSM keys or values while parsing. After a couple of IO operations even small OSM files/fragments can become very large. In OSM the problem was spotted because of gothic code points in name:got-tags.

JOSM should either ship a compliant parser with it's distribution or check/enforce on startup that a known compliant parser is on the classpath.

Attachments (2)

gottic.osm (249 bytes) - added by Don-vip 8 years ago.
Sample file with two tags containing a chinese character and a gottic one (2.0 KB) - added by Don-vip 8 years ago.
Program demonstrating bug with default XML parser

Download all attachments as: .zip

Change History (18)

comment:1 Changed 11 years ago by Gubaer

Here the links to the two threads:

comment:2 Changed 11 years ago by stoecker

Priority: criticalmajor

comment:3 Changed 8 years ago by verdy_p

Description: modified (diff)

Note that by default Java 6/7 are embedding version 2.1.0 of Xerces (but a bizarre forked version of it), which is a bit out of date and with limuted capabilities (basically just SAX, but a limited DOM parser as it offers limited resolution of schemas and DTDs, to work in standalone mode and without needing validation and full resolution of entities).

Oracle/Sun refused to upgrade it, just to preserve existing functionalities (or lack of functionalities) of its API, but at least it should have patched it to correctly handle UTF-16 since long. This was to allow applications to use their own separate versions when needed. Xerces and SAX are embedded only to support the core Java API, and secure it (by not allowing external entity resolvers based on external URLs pointing to rogue domains in some XML files : adding it would have required adding more security checks within the JRE core, and slower performances).

But with a single jar installation of Josm (and no other JNI components) integrating another version of Xerces would mean installing it within the jar, instead of the common separate jar.

Xerces is not a huge package, it could be embedded in the distribution with very modest increase in size (about 4.2MB for the compiled JAR (and lower if using the newer JAR format, it used to be about 6.5MB before in the old JAR format before Java 6).

If needed, a streamlined version of Xerces just needed for the OSM interchange protocol would be enough (no validation, no DTDs, no parsed entities, and predefined XML schemas; possibly working only in SAX mode, the rest being performed in JOSM itself as there's no need to support the DOM API, when the internal OSM schema is enough to store the parsed documents).

Note that there are also separate issues. Independantly of XML parsing, there's also the need to validate strings entered or modified in the editor. Supporting and validating text input in forms requires more developments (for now any attempt to paste non-BMP characters will discard them from the input because it silently rejects all surrogates). If you allow surrogates, the editor will need to check UTF-16 conformance (notably when pressing Delete or Backspace key or when pasting to the editor, or if the editor must truncate some too long strings). Finally the data validator could contain an additional Unicode confortmance test (including tests for characters to avoid, such as undesired controls and non-characters like U+FFFE).

There's another issue for rendering texts using non-BMP characters: Java supports it but only with some prior setup of supporting fonts (a basic installation of Java will not deliver many fonts, others are assumed to be provided by the underlying OS, but this may fail and would require some special settings in stylesheets to map some scripts or script ranges to some available fonts, or to use prefered fonts for displaying some translated "name:lang=*" or "wikipedia:lang=*" or "wikipedia=lang:*", instead of the default font used for displaying tag values).

Finally there remains issues when displaying some complex scripts that require OpenType features (this includes text selection in grapheme clusters, validating those clusters with a cluster boundary checker, handling BiDi texts...) But this will go into another tracking bug for internationalization (needed notably to support Indic scripts : OSM currently includes mostly English names, but very few names in Hindi, Thai or Lao for example ; Chinese, Japanese and Korean are much less compex to support, except non-BMP sinograms for Chinese).

There are also unsolved issues with Arabic and Hebrew wiith tags containing these scripts mixed with Latin and neutral punctuation possibly mirrored; a correct display in a single string value would require BiDi controls, but overrides and embedding controls are not a good long term solution, only the support of BiDi isolates (with FSI/PDI) will make strings easily reusable and embeddable in any contexts (but FSI..PDI controls are still not standardized by Unicode, they may appear in the next release of Unicode with a major update of the BiDi algorithm, which will deprecate the legacy use of RLO..LRO and RLE/LRE..PDF controls). Here again this will require a separate tracking bug, to support this in JOSM as well as in Flash for Potlatch, or online in HTML in the OSM data browser pages, and in renderers like Mapnik. And nothing will be done before there's an update in Unicode, CSS, and HTML5 to support BiDi isolates.

comment:4 Changed 8 years ago by Don-vip

Ticket #8354 has been marked as a duplicate of this ticket.

comment:5 in reply to:  3 Changed 8 years ago by Don-vip

Replying to verdy_p:

Note that by default Java 6/7 are embedding version 2.1.0 of Xerces

According to JAXP web site JAXP 1.4.6 is embedded since Java 7 update 4 and contains (a fork of) Xerces 2.7.1 with "all blocker/critical fixes" of version 2.10.0.

If this problem is still here with latest Java 7 update, it should be raised as a bug to JAXP team. It should not be up to JOSM to fix this kind of issues.

Changed 8 years ago by Don-vip

Attachment: gottic.osm added

Sample file with two tags containing a chinese character and a gottic one

Changed 8 years ago by Don-vip

Attachment: added

Program demonstrating bug with default XML parser

comment:6 Changed 8 years ago by Don-vip

comment:7 Changed 8 years ago by verdy_p

Minor editorial bug in your sample java code : you encode Chinese as "ch" (language code for another language) instead of "zh"... This still does not invalidate your bug report (which is more critical than expected because it exposes data from another unrelated attribute by concatenating them, which means severe unexpected data corruption when handling XML files containing supplementary characters.
So Java has another bug...

comment:8 Changed 8 years ago by anonymous

For now the only workaround for us to install a decent version of Xerces to replace the default implementation in the JRE, in the "lib/endorsed" subdirectory of Java (changing the classpath, even at run-time is not enough, unless we use our own classloader to override the package from which Xerces will be loaded from another package path supplied by the application).
I still wonder why Java does not use the standard classpath for its own implementation of Xerces (notably because it also reuses the same default package name).

comment:9 Changed 8 years ago by Don-vip

No response from JAXP team, I have submitted another bug directly to

comment:10 in reply to:  9 Changed 6 years ago by bastiK

Replying to Don-vip:

No response from JAXP team, I have submitted another bug directly to

Any reaction here?

Bug is still present in Oracle Java 8u40.

comment:11 Changed 6 years ago by Don-vip

No, the situation with JAXP is obscure. It looks like it is managed by a single guy who does not do much on this library... I have sent this bug twice but it has never been accepted :( This is very frustrating to see so very little support on a major standard Java library.

comment:12 Changed 6 years ago by Don-vip

I have sent this bug report for a third time...

comment:13 Changed 6 years ago by Don-vip

We should test with latest JDK9 snapshot, this change seems related:

comment:14 Changed 6 years ago by Don-vip

Keywords: javabug 9 xml unicode stax added
Resolution: othersoftware
Status: newclosed

Yep, it has been fixed in JDK9! (tested with b39). So it's definitively javabug:8058175

comment:15 Changed 5 years ago by Don-vip

Ticket #11418 has been marked as a duplicate of this ticket.

comment:16 Changed 5 years ago by Don-vip

Sometimes an e-mail is enough. The fix is going to be backported to Java 8:

Modify Ticket

Change Properties
Set your email in Preferences
as closed The owner will remain team.
as The resolution will be set.
The resolution will be deleted.

Add Comment

E-mail address and name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.