Monday September 27, 2004 |
Tucu's Weblog [Alejandro Abdelnur] I don't contradict myself, I just change my mind. |
[Blogs.Sun.com HOME] | |
|
We've just released version 0.4 of Rome and Rome Fetcher. They are still marked as Alpha but we consider they are already stable for some serious use, we just want to do some sanity check (mostly classes, interfaces, methods and packages names) before we go with a Beta release (which we hope it will be the next one). We've been busy with this one, just check Rome Changes Log and Rome Fetcher Changes Log, 35 entries. A few highlights:
Detecting XML charset encoding, again I've got some comments (and corrections) on my previous posting Detecting XML charset encoding, getting it right (or at least trying to). Based on the received feedback I've rewritten the algorithms (they are simpler now, and -hopefully- correct). They are already implemented in Rome. Currently they are in CVS and they will be part of the upcoming release. As before, comments are welcome. Raw XML charset encoding detectionDetecting charset encoding of a XML document without external information (i.e. reading a XML document from a file):
BOMEnc : byte order mark. Possible values: 'UTF-8', 'UTF-16BE',
'UTF-16LE' or NULL
XMLGuessEnc: best guess using the byte representation of the first
bytes of XML declaration ('<?xml...?>') if present.
Possible values: 'UTF-8', 'UTF-16BE', 'UTF-16LE' or NULL
XMLEnc : encoding in the XML declaration ('<?xml encoding="..."?>').
Possible values: anything or NULL
if BOMEnc is NULL
if XMLGuessEnc is NULL or XMLEnc is NULL
encoding is 'UTF-8' [1.0]
else
if XMLEnc is 'UTF-16' and (XMLGuessEnc is 'UTF-16BE' or XMLGuessEnc is 'UTF-16LE')
encoding is XMLGuessEnc [1.1]
else
encoding is XMLEnc [1.2]
else
if BOMEnc is 'UTF-8'
if XMLGuessEnc is not NULL and XMLGuessEnc is not 'UTF-8'
ERROR, encoding mismatch [1.3]
if XMLEnc is not NULL and XMLEnc is not 'UTF-8'
ERROR, encoding mismatch [1.4]
encoding is 'UTF-8'
else
if BOMEnc is 'UTF-16BE' or BOMEnc is 'UTF-16LE'
if XMLGuessEnc is not NULL and XMLGuessEnc is not BOMEnc
ERROR, encoding mismatch [1.5]
if XMLEnc is not NULL and XMLEnc is not 'UTF-16' and XMLEnc is not BOMEnc
ERROR, encoding mismatch [1.6]
encoding is BOMEnc
else
ERROR, cannot happen given BOMEnc possible values (see above) [1.7]
Byte Order Mark encoding and XML guessed encoding detection rules are clearly explained in the XML 1.0 Third Edition Appendix F.1. Note that in this algorithm BOMEnc and XMLGuessEnc are restricted to UTF-8 and UTF-16* encodings. #1.0. There is no BOM, an encoding or encoding family cannot be guessed from the first bytes in the stream, defaulting to UTF-8. #1.1. Strictly following XML 1.0 Third Edition Section 4.3.3 Charset Encoding in Entities (2nd paragraph) no BOM and UTF-16 XML declaration encoding is an error. The BOM is required to identify if the encoding is BE or LE. But if XMLEnc was read it means that the encoding byte order of the stream was guessed from the first bytes in the stream, note that this is possible only if the document starts with a XML declaration. The logic verifies that there is no conflicting encoding information and relaxes the BOM requirement if a XML declaration (that allows guessing the byte order) is present. #1.2. XMLEnc is present, it is not UTF-16 (although it can be UTF-16BE or UTF-16LE), the guessed encoding is used to read the XML declaration encoding. Detecting encoding mismatches here it would require awareness of charset families, instead the algorithm relies on the charset encoding routines that will process the stream to discover and report mismatches. IMPORTANT: To handle other encodings (i.e. USC-4 or EBCDIC encoding families) the algorithm should be extended here. #1.3, #1.4, #1.5, #1.6 There is an explicit encoding mismatch. #1.7. Given the currently assumed BOMEnc values this case cannot happen. IMPORTANT: To handle other encodings (i.e. USC-4 or EBCDIC encoding families) the algorithm should be extended here. HTTP XML charset encoding detectionDetecting charset encoding of a XML document with external information provided by HTTP:
ContentType: Content-Type HTTP header
CTMime : MIME type defined in the ContentType
CTEnc : charset encoding defined in the ContentType,
NULL otherwise
BOMEnc : byte order mark. Possible values: 'UTF-8', 'UTF-16BE',
'UTF-16LE' or NULL
XMLGuessEnc: best guess using the byte representation of the first
bytes of XML declaration ('<?xml...?>') if present.
Possible values: 'UTF-8', 'UTF-16BE', 'UTF-16LE' or NULL
XMLEnc : encoding in the XML declaration ('<?xml encoding="..."?>').
Possible values: anything or NULL
APP-XML : RFC 3023 defined 'application/*xml' types
TEXT-XML : RFC 3023 defined 'text/*xml' types
if CTMime is APP-XML or CTMime is TEXT-XML
if CTEnc is NULL
if CTMime is APP-XML
encoding is determined using the Raw XML charset encoding detection algorithm [2.0]
else
if CTMime is TEXT-XML
encoding is 'US-ASCII' [2.1]
else
if (CTEnc is 'UTF-16BE' or CTEnc is 'UTF-16LE') and BOMEnc is not NULL
ERROR, RFC 3023 explicitly forbids this [2.2]
else
if CTEnc is 'UTF-16'
if BOMEnc is 'UTF-16BE' or BOMEnc is 'UTF-16LE'
encoding is BOMEnc [2.3]
else
ERROR, missing BOM or encoding mismatch [2.4]
else
encoding is CTEnc [2.5]
else
ERROR, handling for other MIME types is undefined [2.6]
Byte Order Mark encoding and XML guessed encoding detection rules are clearly explained in the XML 1.0 Third Edition Appendix F.1. Note that in this algorithm BOMEnc and XMLGuessEnc are restricted to UTF-8 and UTF-16* encodings. #2.0. HTTP content type declares a MIME of application type and XML sub-type. There is not HTTP Content-type charset encoding information. The charset encoding is determined using the Raw XML charset encoding detection algorithm. Refer to RFC 3023 Section 3.2. Application/xml Registration. #2.1. HTTP content type declares a MIME of text type and XML sub-type. There is not HTTP Content-type charset encoding information. The charset encoding is US-ASCII. Refer to RFC 3023 Section 3.1. Text/xml Registration. #2.2. For text-XML and application-XML MIME types if the HTTP content type declares 'UTF-16BE' or 'UTF-16LE' as the charset encoding, a BOM is prohibited. Refer to RFC 3023 Section 4. The Byte Order Mark (BOM) and Conversions to/from the UTF-16 Charset. #2.3. For text-XML and application-XML MIME types if the HTTP content type declares 'UTF-16' as the charset encoding, a BOM is required and it must be used to determine the byte order. Refer to RFC 3023 Section 4. The Byte Order Mark (BOM) and Conversions to/from the UTF-16 Charset. #2.4. For text-XML and application-XML MIME types if the HTTP content type declares 'UTF-16' as the charset encoding, a BOM must be present and it must be 'UTF-16BE' or 'UTF-16LE'. Refer to RFC 3023 Section 4. The Byte Order Mark (BOM) and Conversions to/from the UTF-16 Charset. #2.5. For text-XML and application-XML MIME types if the HTTP content type declares a charset encoding other than the UTF-16 variants, that charset encoding is used, no other special handling is required. #2.6. There is not defined logic to determine the charset encoding based on the MIME type given by the HTTP content type. (2004-09-17 15:11:17.0) Permalink Comments [1]Detecting XML charset encoding, getting it right (or at least trying to) I was trying to get the charset encoding detection right for some Rome samples and I've run into a few NITs. A couple of quick fixes didn't fly. It become clear to me this was more than fixing things for the samples. It's a problem everybody using XML data and HTTP transport has to deal with. Current Java libraries or utilities (that I'm aware of) don't do this by default, If you grab a JAXP SAX parser it will detect the charset encoding of a XML document looking at the first bytes of the stream as defined Section 4.3.3 and Appendix F.1 of the in XML 1.0 (Third Edition) specification. But Appendix F.1 is non-normative and not all Java XML parsers do it right now. For example the JAXP SAX parser implementation in J2SE 1.4.2 does not handle Appendix F.1 and Xerces 2.6.2 just added support for it. Still, this does not solve the whole problem. The (JAXP SAX) XML parsers are not aware of the HTTP transport rules for charset encoding resolution as defined by RFC 3023. They are not because they operate on a byte stream or a char stream without any knowledge of the stream transport protocol, HTTP in this case. As I was trying to solve this problem folks from within Sun and from Rome (whom I was bugging with my half backed questions and solutions) pointed me to a couple of nice articles by Mark Pilgrim, Determining the character encoding of a feed in his Blog and XML on the Web Has Failed in XML.com. Let's first begin with a raw XML stream (i.e. reading a XML document from a file). Following Section 4.3.3 and Appendix F.1 of the XML 1.0 specification the charset encoding of an XML document is determined as follows:
BOM : Byte Order Mark, utf-8, utf-16be or utf-16le if present, NULL otherwise
XMLGuessEnc : best guess using the byte representation of the first characters in '<?xml ...'
(it can be UTF-8, UTF-16BE or UTF-16LE)
XMLEnc : value of the <?xml encoding="..."?> if present, NULL otherwise
if BOM is NULL and XMLEnc is NULL
if XMLGuessEnc is ('UTF-16BE' or 'UTF-16LE')
ERROR (encoding mismatch) [#1.0]
else
encoding is 'UTF-8' [#1.1]
if BOM is NULL and XMLEnc is ('UTF-8' or 'UTF-16BE' or 'UTF-16LE')
ERROR (XML requires BOM for 'UTF-16*' charsets) [#1.2]
if BOM is NULL
encoding is XMLEnc [#1.3]
if BOM is not NULL and XMLEnc is NULL
if BOM is XMLGuessEnc
encoding is BOM [#1.4]
else
ERROR (encoding mismatch) [#1.5]
if BOM is not NULL and XMLEnc is not NULL
if BOM is XMLGuessEnc and
((BOM is 'UTF-16*' and XMLEnc is 'UTF-16')) or XMLEnc is BOM)
encoding is BOM [#1.6]
else
ERROR (encoding mismatch) [#1.7]
And in the case of the XML document is served over HTTP. Following Section 4.3.3, Appendix F.1 and Appendix F.2 of the XML 1.0 specification, plus RFC 3023 the charset encoding of an XML document served over HTTP is determined as follows:
ContentType : Content-Type HTTP header
CTMime : MIME type defined in the ContentType
CTEnc : charset encoding defined in the ContentType, NULL otherwise
BOM : Byte Order Mark, utf-8, utf-16be or utf-16le if present, NULL otherwise
XMLGuessEnc : best guess using the byte representation of the first characters in '<?xml ...'
(it can be UTF-8, UTF-16BE or UTF-16LE)
XMLEnc : value of the <?xml encoding="..."?> if present, NULL otherwise
APP-XML : RFC 3023 defined 'application/*xml' types
TEXT-XML : RFC 3023 defined 'text/*xml' types
if (CTEnc is 'UTF-16BE' or 'UTF-16LE') and BOM is not NULL
ERROR [#2.0]
if CTMIME is APP-XML and CTEnc is NULL
use raw XML stream charset encoding detection logic [#2.1]
if CTMIME is TEXT-XML and CTEnc is NULL
encoding is 'US-ASCII' [#2.2]
if CTMIME is (APP-XML or TEXT-XML) and CTEnc is 'UTF-16'
if BOM is 'UTF-16BE' or 'UTF-16LE'
encoding is BOM [#2.3]
if BOM is NULL and XMLGuessEnc is ('UTF-16BE' or 'UTF-16LE')
encoding is XMLGuessEnc [#2.4]
else
ERROR (encoding mismatch) [#2.5]
if CTMIME is (APP-XML or TEXT-XML)
encoding is CTEnc [#2.6]
ERROR (undefined CTMIME hanlding for XML documents) [#2.7]
All this logic, both raw and HTTP charset encoding detection of XML documents will be available in Rome in the upcoming 0.4 release (currrently is available from CVS). The class implementing this logic is the com.syndication.io.XmlReader a subclass of the java.io.Reader. A XmlReader can be created using a File, an InputStream, a URL, a URLConnection or an InputStream with a Content-type String. Passing this reader to the Rome input classes (and any XML Parser) will take care of the charset encoding detection. Ideally all this should be built into the (JAXP SAX) XML parser, I'd say in the InputSource class. I'll check with the Java XML folks to see what they think.
If you think there is an error somewhere in this logic, please let us know dropping an email to the Rome mailing list, dev@rome.dev.java.net.(2004-09-10 23:32:00.0) Permalink Comments [2] |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||