Tucu's Weblog
[Alejandro Abdelnur]
  I don't contradict myself,
  I just change my mind.
[Blogs.Sun.com HOME]
All | General | Java | Syndication | XML

20040910 Friday September 10, 2004

Detecting XML charset encoding, getting it right (or at least trying to)

I was trying to get the charset encoding detection right for some Rome samples and I've run into a few NITs. A couple of quick fixes didn't fly. It become clear to me this was more than fixing things for the samples. It's a problem everybody using XML data and HTTP transport has to deal with.

Current Java libraries or utilities (that I'm aware of) don't do this by default, If you grab a JAXP SAX parser it will detect the charset encoding of a XML document looking at the first bytes of the stream as defined Section 4.3.3 and Appendix F.1 of the in XML 1.0 (Third Edition) specification. But Appendix F.1 is non-normative and not all Java XML parsers do it right now. For example the JAXP SAX parser implementation in J2SE 1.4.2 does not handle Appendix F.1 and Xerces 2.6.2 just added support for it.

Still, this does not solve the whole problem. The (JAXP SAX) XML parsers are not aware of the HTTP transport rules for charset encoding resolution as defined by RFC 3023. They are not because they operate on a byte stream or a char stream without any knowledge of the stream transport protocol, HTTP in this case.

As I was trying to solve this problem folks from within Sun and from Rome (whom I was bugging with my half backed questions and solutions) pointed me to a couple of nice articles by Mark Pilgrim, Determining the character encoding of a feed in his Blog and XML on the Web Has Failed in XML.com.

Let's first begin with a raw XML stream (i.e. reading a XML document from a file). Following Section 4.3.3 and Appendix F.1 of the XML 1.0 specification the charset encoding of an XML document is determined as follows:

BOM         : Byte Order Mark, utf-8, utf-16be or utf-16le if present, NULL otherwise
XMLGuessEnc : best guess using the byte representation of the first characters in '<?xml ...' 
              (it can be UTF-8, UTF-16BE or UTF-16LE)
XMLEnc      : value of the <?xml encoding="..."?> if present, NULL otherwise

if BOM is NULL and XMLEnc is NULL
    if  XMLGuessEnc is ('UTF-16BE' or 'UTF-16LE')
       ERROR (encoding mismatch)                                    [#1.0]
    else
       encoding is 'UTF-8'                                          [#1.1]

if BOM is NULL and XMLEnc is ('UTF-8' or 'UTF-16BE' or 'UTF-16LE')
    ERROR (XML requires BOM for 'UTF-16*' charsets)                 [#1.2]

if BOM is NULL 
    encoding is XMLEnc                                              [#1.3]

if BOM is not NULL and XMLEnc is NULL
    if BOM is XMLGuessEnc
        encoding is BOM                                             [#1.4]
    else
        ERROR (encoding mismatch)                                   [#1.5]

if BOM is not NULL and XMLEnc is not NULL 
    if BOM is XMLGuessEnc and 
       ((BOM is 'UTF-16*' and XMLEnc is 'UTF-16')) or XMLEnc is BOM)
        encoding is BOM                                             [#1.6]
    else
        ERROR (encoding mismatch)                                   [#1.7]

And in the case of the XML document is served over HTTP. Following Section 4.3.3, Appendix F.1 and Appendix F.2 of the XML 1.0 specification, plus RFC 3023 the charset encoding of an XML document served over HTTP is determined as follows:

ContentType : Content-Type HTTP header
CTMime      : MIME type defined in the ContentType
CTEnc       : charset encoding defined in the ContentType, NULL otherwise
BOM         : Byte Order Mark, utf-8, utf-16be or utf-16le if present, NULL otherwise
XMLGuessEnc : best guess using the byte representation of the first characters in '<?xml ...' 
              (it can be UTF-8, UTF-16BE or UTF-16LE)
XMLEnc      : value of the <?xml encoding="..."?> if present, NULL otherwise
APP-XML     : RFC 3023 defined 'application/*xml' types
TEXT-XML    : RFC 3023 defined 'text/*xml' types

if (CTEnc is 'UTF-16BE' or 'UTF-16LE') and BOM is not NULL
  ERROR                                                             [#2.0]

if CTMIME is APP-XML and CTEnc is NULL
  use raw XML stream charset encoding detection logic               [#2.1]

if CTMIME is TEXT-XML and CTEnc is NULL
  encoding is 'US-ASCII'                                            [#2.2]

if CTMIME is (APP-XML or TEXT-XML) and CTEnc is 'UTF-16'
  if BOM is 'UTF-16BE' or 'UTF-16LE'
    encoding is BOM                                                 [#2.3]
  if BOM is NULL and XMLGuessEnc is ('UTF-16BE' or 'UTF-16LE')
    encoding is XMLGuessEnc                                         [#2.4]
  else
    ERROR (encoding mismatch)                                       [#2.5]

if CTMIME is (APP-XML or TEXT-XML) 
  encoding is CTEnc                                                 [#2.6]

ERROR (undefined CTMIME hanlding for XML documents)                 [#2.7]

All this logic, both raw and HTTP charset encoding detection of XML documents will be available in Rome in the upcoming 0.4 release (currrently is available from CVS). The class implementing this logic is the com.syndication.io.XmlReader a subclass of the java.io.Reader. A XmlReader can be created using a File, an InputStream, a URL, a URLConnection or an InputStream with a Content-type String. Passing this reader to the Rome input classes (and any XML Parser) will take care of the charset encoding detection.

Ideally all this should be built into the (JAXP SAX) XML parser, I'd say in the InputSource class. I'll check with the Java XML folks to see what they think.

If you think there is an error somewhere in this logic, please let us know dropping an email to the Rome mailing list, dev@rome.dev.java.net.

(2004-09-10 23:32:00.0) Permalink Comments [2]

Comments:

I think there is a typo in the if in #1.2. It should not include UTF-8. IOW, I don't thing a BOM is needed for UTF-8, as it is not meaningful (no 8BE and 8LE variants there). #2.0 looks like #1.2 (without UTF-8) expressed in negative logic, and there is an extra not there, if I can make sense of the "intent" of the code. This is another typo. (i.e., that UTF-16* streams *require* BOM)

Posted by Santiago Gala on September 11, 2004 at 02:10 AM PDT #

I disagree with results [1.0] and [1.2] (and possibly more, I stopped reading there). BOM is not required by XML for any character set. The only possible encoding mismatches are those where XMLEnc declares an encoding that is clearly impossible given the BOM or XMLGuessEnc.

Posted by Bob Foster on September 12, 2004 at 09:11 AM PDT #

Post a Comment:

Comments are closed for this entry.

archives
links
i'd rather be