Changes between Initial Version and Version 3 of Ticket #3290


Ignore:
Timestamp:
2013-01-16T04:13:09+01:00 (13 years ago)
Author:
verdy_p
Comment:

Note that by default Java 6/7 are embedding version 2.1.0 of Xerces (but a bizarre forked version of it), which is a bit out of date and with limuted capabilities (basically just SAX, but a limited DOM parser as it offers limited resolution of schemas and DTDs, to work in standalone mode and without needing validation and full resolution of entities).

Oracle/Sun refused to upgrade it, just to preserve existing functionalities (or lack of functionalities) of its API, but at least it should have patched it to correctly handle UTF-16 since long. This was to allow applications to use their own separate versions when needed. Xerces and SAX are embedded only to support the core Java API, and secure it (by not allowing external entity resolvers based on external URLs pointing to rogue domains in some XML files : adding it would have required adding more security checks within the JRE core, and slower performances).

But with a single jar installation of Josm (and no other JNI components) integrating another version of Xerces would mean installing it within the jar, instead of the common separate jar.

Xerces is not a huge package, it could be embedded in the distribution with very modest increase in size (about 4.2MB for the compiled JAR (and lower if using the newer JAR format, it used to be about 6.5MB before in the old JAR format before Java 6).

If needed, a streamlined version of Xerces just needed for the OSM interchange protocol would be enough (no validation, no DTDs, no parsed entities, and predefined XML schemas; possibly working only in SAX mode, the rest being performed in JOSM itself as there's no need to support the DOM API, when the internal OSM schema is enough to store the parsed documents).

Note that there are also separate issues. Independantly of XML parsing, there's also the need to validate strings entered or modified in the editor. Supporting and validating text input in forms requires more developments (for now any attempt to paste non-BMP characters will discard them from the input because it silently rejects all surrogates). If you allow surrogates, the editor will need to check UTF-16 conformance (notably when pressing Delete or Backspace key or when pasting to the editor, or if the editor must truncate some too long strings). Finally the data validator could contain an additional Unicode confortmance test (including tests for characters to avoid, such as undesired controls and non-characters like U+FFFE).

There's another issue for rendering texts using non-BMP characters: Java supports it but only with some prior setup of supporting fonts (a basic installation of Java will not deliver many fonts, others are assumed to be provided by the underlying OS, but this may fail and would require some special settings in stylesheets to map some scripts or script ranges to some available fonts, or to use prefered fonts for displaying some translated "name:lang=*" or "wikipedia:lang=*" or "wikipedia=lang:*", instead of the default font used for displaying tag values).

Finally there remains issues when displaying some complex scripts that require OpenType features (this includes text selection in grapheme clusters, validating those clusters with a cluster boundary checker, handling BiDi texts...) But this will go into another tracking bug for internationalization (needed notably to support Indic scripts : OSM currently includes mostly English names, but very few names in Hindi, Thai or Lao for example ; Chinese, Japanese and Korean are much less compex to support, except non-BMP sinograms for Chinese).

There are also unsolved issues with Arabic and Hebrew wiith tags containing these scripts mixed with Latin and neutral punctuation possibly mirrored; a correct display in a single string value would require BiDi controls, but overrides and embedding controls are not a good long term solution, only the support of BiDi isolates (with FSI/PDI) will make strings easily reusable and embeddable in any contexts (but FSI..PDI controls are still not standardized by Unicode, they may appear in the next release of Unicode with a major update of the BiDi algorithm, which will deprecate the legacy use of RLO..LRO and RLE/LRE..PDF controls). Here again this will require a separate tracking bug, to support this in JOSM as well as in Flash for Potlatch, or online in HTML in the OSM data browser pages, and in renderers like Mapnik. And nothing will be done before there's an update in Unicode, CSS, and HTML5 to support BiDi isolates.

Legend:

Unmodified
Added
Removed
Modified