Monday September 27, 2004 |
Tucu's Weblog [Alejandro Abdelnur] I don't contradict myself, I just change my mind. |
[Blogs.Sun.com HOME] | |
|
We've just released version 0.4 of Rome and Rome Fetcher. They are still marked as Alpha but we consider they are already stable for some serious use, we just want to do some sanity check (mostly classes, interfaces, methods and packages names) before we go with a Beta release (which we hope it will be the next one). We've been busy with this one, just check Rome Changes Log and Rome Fetcher Changes Log, 35 entries. A few highlights:
Detecting XML charset encoding, getting it right (or at least trying to) I was trying to get the charset encoding detection right for some Rome samples and I've run into a few NITs. A couple of quick fixes didn't fly. It become clear to me this was more than fixing things for the samples. It's a problem everybody using XML data and HTTP transport has to deal with. Current Java libraries or utilities (that I'm aware of) don't do this by default, If you grab a JAXP SAX parser it will detect the charset encoding of a XML document looking at the first bytes of the stream as defined Section 4.3.3 and Appendix F.1 of the in XML 1.0 (Third Edition) specification. But Appendix F.1 is non-normative and not all Java XML parsers do it right now. For example the JAXP SAX parser implementation in J2SE 1.4.2 does not handle Appendix F.1 and Xerces 2.6.2 just added support for it. Still, this does not solve the whole problem. The (JAXP SAX) XML parsers are not aware of the HTTP transport rules for charset encoding resolution as defined by RFC 3023. They are not because they operate on a byte stream or a char stream without any knowledge of the stream transport protocol, HTTP in this case. As I was trying to solve this problem folks from within Sun and from Rome (whom I was bugging with my half backed questions and solutions) pointed me to a couple of nice articles by Mark Pilgrim, Determining the character encoding of a feed in his Blog and XML on the Web Has Failed in XML.com. Let's first begin with a raw XML stream (i.e. reading a XML document from a file). Following Section 4.3.3 and Appendix F.1 of the XML 1.0 specification the charset encoding of an XML document is determined as follows:
BOM : Byte Order Mark, utf-8, utf-16be or utf-16le if present, NULL otherwise
XMLGuessEnc : best guess using the byte representation of the first characters in '<?xml ...'
(it can be UTF-8, UTF-16BE or UTF-16LE)
XMLEnc : value of the <?xml encoding="..."?> if present, NULL otherwise
if BOM is NULL and XMLEnc is NULL
if XMLGuessEnc is ('UTF-16BE' or 'UTF-16LE')
ERROR (encoding mismatch) [#1.0]
else
encoding is 'UTF-8' [#1.1]
if BOM is NULL and XMLEnc is ('UTF-8' or 'UTF-16BE' or 'UTF-16LE')
ERROR (XML requires BOM for 'UTF-16*' charsets) [#1.2]
if BOM is NULL
encoding is XMLEnc [#1.3]
if BOM is not NULL and XMLEnc is NULL
if BOM is XMLGuessEnc
encoding is BOM [#1.4]
else
ERROR (encoding mismatch) [#1.5]
if BOM is not NULL and XMLEnc is not NULL
if BOM is XMLGuessEnc and
((BOM is 'UTF-16*' and XMLEnc is 'UTF-16')) or XMLEnc is BOM)
encoding is BOM [#1.6]
else
ERROR (encoding mismatch) [#1.7]
And in the case of the XML document is served over HTTP. Following Section 4.3.3, Appendix F.1 and Appendix F.2 of the XML 1.0 specification, plus RFC 3023 the charset encoding of an XML document served over HTTP is determined as follows:
ContentType : Content-Type HTTP header
CTMime : MIME type defined in the ContentType
CTEnc : charset encoding defined in the ContentType, NULL otherwise
BOM : Byte Order Mark, utf-8, utf-16be or utf-16le if present, NULL otherwise
XMLGuessEnc : best guess using the byte representation of the first characters in '<?xml ...'
(it can be UTF-8, UTF-16BE or UTF-16LE)
XMLEnc : value of the <?xml encoding="..."?> if present, NULL otherwise
APP-XML : RFC 3023 defined 'application/*xml' types
TEXT-XML : RFC 3023 defined 'text/*xml' types
if (CTEnc is 'UTF-16BE' or 'UTF-16LE') and BOM is not NULL
ERROR [#2.0]
if CTMIME is APP-XML and CTEnc is NULL
use raw XML stream charset encoding detection logic [#2.1]
if CTMIME is TEXT-XML and CTEnc is NULL
encoding is 'US-ASCII' [#2.2]
if CTMIME is (APP-XML or TEXT-XML) and CTEnc is 'UTF-16'
if BOM is 'UTF-16BE' or 'UTF-16LE'
encoding is BOM [#2.3]
if BOM is NULL and XMLGuessEnc is ('UTF-16BE' or 'UTF-16LE')
encoding is XMLGuessEnc [#2.4]
else
ERROR (encoding mismatch) [#2.5]
if CTMIME is (APP-XML or TEXT-XML)
encoding is CTEnc [#2.6]
ERROR (undefined CTMIME hanlding for XML documents) [#2.7]
All this logic, both raw and HTTP charset encoding detection of XML documents will be available in Rome in the upcoming 0.4 release (currrently is available from CVS). The class implementing this logic is the com.syndication.io.XmlReader a subclass of the java.io.Reader. A XmlReader can be created using a File, an InputStream, a URL, a URLConnection or an InputStream with a Content-type String. Passing this reader to the Rome input classes (and any XML Parser) will take care of the charset encoding detection. Ideally all this should be built into the (JAXP SAX) XML parser, I'd say in the InputSource class. I'll check with the Java XML folks to see what they think.
If you think there is an error somewhere in this logic, please let us know dropping an email to the Rome mailing list, dev@rome.dev.java.net.(2004-09-10 23:32:00.0) Permalink Comments [2]Rome v0.3 is fresh from the oven. New stuff. Changes, most of them in the implementation, a few in the API. Summary of changes:
Check the Changes Log for details. Enjoy. (2004-07-28 10:33:45.0) Permalink Comments [2]We've just postedRome v0.2 We've been fighting to get everything ready for the release. Summary of the changes:
Some of them were done based on some feedback we've received. Others to try bringing consistency to the naming. And others to enable certain usage pattern in applications using Rome. All the changes (15 of them) are documented in the Changes Log. I hope feedback keeps coming, that will help us make Rome better and easier to use. Enough of this, now I have go back to my day job. (2004-06-16 13:54:44.0) PermalinkTo myself, for the Nth time, *never underestimate the effort behind doing a release* And I'm talking about a small project, Rome v0.2 (It's almost there). Getting the code changes integrated it's the easy part. Getting all the supporting materials ready is a Pharaonic task. General documentation, changeslog, tutorials, javadocs, updating the samples, testing the samples, preparing the distribution bundles and staging all this in the site. Ensuring (or at least trying to) everything has been updated. We don't want to get into the we can do that later approach as most of the times later becomes asymptotic to never. With Rome we are still organizing the site. Deciding where things should go etc. We are using Rome Wiki as the main (and only for now) documentation source, we have to decide what documents we want to keep alive (changing as we go) and what documents we want to freeze with every release. Maybe things will get easier after we have all this settled. I'm making a mental note (too bad I keep losing them) to comment again on this for Rome v0.3, to see if things got better. (2004-06-16 08:20:19.0) PermalinkRome is making some noise already It was a nice way of starting the day today, I've found out there is some blogging about Rome going on. Some of them commenting the appearance of it. Others expressing some concerns about it. That's great, we welcome both. The latter ones will help us improve Rome. I hope they keep coming. P@ posted a blog earlier today -Rome was not built in one day- making some clarifications. (2004-06-10 11:12:44.0) PermalinkTrying to end Syndication feeds Hell, at least for Java developers After a month of work or so we did it. 'We' are Patrick, Elaine and myself. 'What we did' is Rome. You may not care who we are, but [if you are a Java developer and you are doing anything that involves RSS or Atom syndication feeds] you'll likely care about Rome. Rome is a set of Atom/RSS Java utilities that makes it easy to work in Java with most syndication formats. It supports all flavors of RSS (0.90, 0.91, 0.92, 0.93, 0.94, 1.0 and 2.0) and Atom 0.3 feeds. It includes a set of beans, parsers and generators for the various flavors of syndication feeds, as well as converters to convert from one feed type to another. It also includes a generic normalized SyndFeed class that lets you work with any feed type without bothering about the incoming or outgoing feed type. Rome is one of those projects I'd had preferred not to have spent time working on. It came to be out of frustration, or better said to stop frustration. What kind of frustration I'm talking about? First, the nonsense of syndication feeds. Second, the absence of a library that does a good job abstracting all that nonsense from the developer. And third, the quality of available documentation (this last one is in fact a general problem. So, we decided to do something about and that something is Rome. By addressing the second we took care of the first (insert second part of the title here). And for the third -the documentation- we made an special effort from the very beginning. Rome is just starting, there plenty of things to improve, to fix and to add. Check it out and let us know what you think. (2004-06-08 16:50:04.0) Permalink |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||