#3290 closed defect (othersoftware)
Make sure an XML parser fully supporting UTF-16 is used by JOSM
Reported by: | Gubaer | Owned by: | team |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | Core | Version: | |
Keywords: | javabug 9 xml unicode stax | Cc: |
Description (last modified by )
See the discussion on dev and josm-dev:
- there's a problem with XML parsers which don't handle UTF-16 correctly. Apparently, they insert duplicates of surrogate code points in OSM keys or values while parsing. After a couple of IO operations even small OSM files/fragments can become very large. In OSM the problem was spotted because of gothic code points in
name:got
-tags.
- Xerces-J 2.6.2 seems to be affected
- Xerces-J 2.9.1 seems to be OK
- Woodstox StAX XML parser seems to be OK
JOSM should either ship a compliant parser with it's distribution or check/enforce on startup that a known compliant parser is on the classpath.
Attachments (2)
Change History (18)
comment:1 by , 16 years ago
comment:2 by , 16 years ago
Priority: | critical → major |
---|
follow-up: 5 comment:3 by , 12 years ago
Description: | modified (diff) |
---|
Note that by default Java 6/7 are embedding version 2.1.0 of Xerces (but a bizarre forked version of it), which is a bit out of date and with limuted capabilities (basically just SAX, but a limited DOM parser as it offers limited resolution of schemas and DTDs, to work in standalone mode and without needing validation and full resolution of entities).
Oracle/Sun refused to upgrade it, just to preserve existing functionalities (or lack of functionalities) of its API, but at least it should have patched it to correctly handle UTF-16 since long. This was to allow applications to use their own separate versions when needed. Xerces and SAX are embedded only to support the core Java API, and secure it (by not allowing external entity resolvers based on external URLs pointing to rogue domains in some XML files : adding it would have required adding more security checks within the JRE core, and slower performances).
But with a single jar installation of Josm (and no other JNI components) integrating another version of Xerces would mean installing it within the jar, instead of the common separate jar.
Xerces is not a huge package, it could be embedded in the distribution with very modest increase in size (about 4.2MB for the compiled JAR (and lower if using the newer JAR format, it used to be about 6.5MB before in the old JAR format before Java 6).
If needed, a streamlined version of Xerces just needed for the OSM interchange protocol would be enough (no validation, no DTDs, no parsed entities, and predefined XML schemas; possibly working only in SAX mode, the rest being performed in JOSM itself as there's no need to support the DOM API, when the internal OSM schema is enough to store the parsed documents).
Note that there are also separate issues. Independantly of XML parsing, there's also the need to validate strings entered or modified in the editor. Supporting and validating text input in forms requires more developments (for now any attempt to paste non-BMP characters will discard them from the input because it silently rejects all surrogates). If you allow surrogates, the editor will need to check UTF-16 conformance (notably when pressing Delete or Backspace key or when pasting to the editor, or if the editor must truncate some too long strings). Finally the data validator could contain an additional Unicode confortmance test (including tests for characters to avoid, such as undesired controls and non-characters like U+FFFE).
There's another issue for rendering texts using non-BMP characters: Java supports it but only with some prior setup of supporting fonts (a basic installation of Java will not deliver many fonts, others are assumed to be provided by the underlying OS, but this may fail and would require some special settings in stylesheets to map some scripts or script ranges to some available fonts, or to use prefered fonts for displaying some translated "name:lang=*" or "wikipedia:lang=*" or "wikipedia=lang:*", instead of the default font used for displaying tag values).
Finally there remains issues when displaying some complex scripts that require OpenType features (this includes text selection in grapheme clusters, validating those clusters with a cluster boundary checker, handling BiDi texts...) But this will go into another tracking bug for internationalization (needed notably to support Indic scripts : OSM currently includes mostly English names, but very few names in Hindi, Thai or Lao for example ; Chinese, Japanese and Korean are much less compex to support, except non-BMP sinograms for Chinese).
There are also unsolved issues with Arabic and Hebrew wiith tags containing these scripts mixed with Latin and neutral punctuation possibly mirrored; a correct display in a single string value would require BiDi controls, but overrides and embedding controls are not a good long term solution, only the support of BiDi isolates (with FSI/PDI) will make strings easily reusable and embeddable in any contexts (but FSI..PDI controls are still not standardized by Unicode, they may appear in the next release of Unicode with a major update of the BiDi algorithm, which will deprecate the legacy use of RLO..LRO and RLE/LRE..PDF controls). Here again this will require a separate tracking bug, to support this in JOSM as well as in Flash for Potlatch, or online in HTML in the OSM data browser pages, and in renderers like Mapnik. And nothing will be done before there's an update in Unicode, CSS, and HTML5 to support BiDi isolates.
comment:5 by , 12 years ago
Replying to verdy_p:
Note that by default Java 6/7 are embedding version 2.1.0 of Xerces
According to JAXP web site http://jaxp.java.net/ JAXP 1.4.6 is embedded since Java 7 update 4 and contains (a fork of) Xerces 2.7.1 with "all blocker/critical fixes" of version 2.10.0.
If this problem is still here with latest Java 7 update, it should be raised as a bug to JAXP team. It should not be up to JOSM to fix this kind of issues.
by , 12 years ago
Attachment: | gottic.osm added |
---|
Sample file with two tags containing a chinese character and a gottic one
comment:7 by , 12 years ago
Minor editorial bug in your sample java code : you encode Chinese as "ch" (language code for another language) instead of "zh"... This still does not invalidate your bug report (which is more critical than expected because it exposes data from another unrelated attribute by concatenating them, which means severe unexpected data corruption when handling XML files containing supplementary characters.
So Java has another bug...
comment:8 by , 12 years ago
For now the only workaround for us to install a decent version of Xerces to replace the default implementation in the JRE, in the "lib/endorsed" subdirectory of Java (changing the classpath, even at run-time is not enough, unless we use our own classloader to override the package from which Xerces will be loaded from another package path supplied by the application).
I still wonder why Java does not use the standard classpath for its own implementation of Xerces (notably because it also reuses the same default package name).
follow-up: 10 comment:9 by , 12 years ago
No response from JAXP team, I have submitted another bug directly to http://bugs.sun.com/
comment:10 by , 10 years ago
Replying to Don-vip:
No response from JAXP team, I have submitted another bug directly to http://bugs.sun.com/
Any reaction here?
Bug is still present in Oracle Java 8u40.
comment:11 by , 10 years ago
No, the situation with JAXP is obscure. It looks like it is managed by a single guy who does not do much on this library... I have sent this bug twice but it has never been accepted :( This is very frustrating to see so very little support on a major standard Java library.
comment:13 by , 10 years ago
We should test with latest JDK9 snapshot, this change seems related:
http://hg.openjdk.java.net/jdk9/dev/jaxp/rev/705a3999eb77
comment:14 by , 10 years ago
Keywords: | javabug 9 xml unicode stax added |
---|---|
Resolution: | → othersoftware |
Status: | new → closed |
Yep, it has been fixed in JDK9! (tested with b39). So it's definitively javabug:8058175
comment:16 by , 10 years ago
Sometimes an e-mail is enough. The fix is going to be backported to Java 8:
http://mail.openjdk.java.net/pipermail/jdk8u-dev/2015-November/004430.html
Here the links to the two threads: