Earthly Powers
- All
- Fast Infoset
- General
- Java
- REST
A ramble on characters in XML documents
When I was rookie XML 1.0 user i was not aware that there were restrictions in the characters that are allowed in element/attribute tag names, text content and attribute values. It caused some mild eyebrow raises when i found out and looked more closely at the W3C XML 1.0 Recommendation! XML's foundation is Unicode characters, right? so why the subset?
Take for example the specified character range of a character that is allowed as part of text content:
Char | ::= | #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] |
So that means no 'control' characters, like 'NULL' or 'BELL' (see here for good description of the issues , which may explain the reasons why, for XML 1.0, control characters were disallowed). A character code of '0' is not allowed as part of text content of an XML document, i think this makes sense from the perspective C/C++ since '0' is used as terminator for strings, and allowing '0' would cause all sorts of issues.
So the following XML document is not well-formed:
<element>�</element>
For more information on this I highly recommend looking at Tim Bray's most excellent annotated XML 1.0.
Having said that, the W3C XML 1.1 Recommendation opened the door for 'control' characters!, the character range of a character is now:
Char | ::= | [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] |
The 'NULL' character code is still disallowed. I think XML 1.1 is an improvement on XML 1.0 since XML 1.0 restricted the characters codes that were allowed in element/attribute tag names. Now more languages can utilize markup for element/attribute names.
Posted at 12:15PM Jun 05, 2006 by Paul Sandoz in General | Comments[0]