Tucu's Weblog
[Alejandro Abdelnur]
  I don't contradict myself,
  I just change my mind.
[Blogs.Sun.com HOME]
All | General | Java | Syndication | XML

20050929 Thursday September 29, 2005

Doing the 'Elvis has left the building' routine

Tomorrow, Friday September 30th, is my last day at Sun. It has been 8.47 years, by far an all time record, including personal relationships. First it was Sun IT, then eCommerce and finally Portal, 100% Java almost from the beginning, JDK 1.1 was just out from the oven, my first program was a notepad using AWT, the second one was a peer to peer chatting room using RMI (Atul still makes fun of me saying 'Come see, did some Java') , third one was a mini application server based on RMI as well, they even made me file a patent on it. I've been lucky to work under the *protection* of Mita, Abhay and Yee and with tons of great engineers, many of them become very good friends in the process, I went all the way to India for the wedding of one of them and to the dreadful land of Texas for the termination (yes, got married as well) of another. With JSR168, WSRP and ROME I've got the chance to work, learn and argue with sharp people from all over. Earlier this year I've been working in India for a few months, a truly fantastic experience, the people, the place; the low-point was my failed attempt of convincing Chandra, the driver, to call me 'Alejandro' instead of 'Sir', he would just reply with a 'Yes Sir' to my requests.

What I'm trying to say is that regardless of my reasons for leaving, putting all things in the scale, it tips all the way to IT WAS GREAT, IT WAS FUN.

So, what's next for me? My belongings are in storage, they've been there since March, got a ticket for next week, a small backpack, a camera, a diving computer and about 4 months to spend in Southeast Asia.

I guess this will be the last entry in this blog. I'll show up somewhere else when I'm back.

For sure ROME will be well taken care of while I'm gone. And in the case of JSR168 NEXT, I hope a 'Things to Do in Denver When You Are Dead' happens soon.

Cheers.

Tucu (tucu0[THE FUNNY a]yahoo.com)

(2005-09-29 10:59:03.0) Permalink

20050908 Thursday September 08, 2005

ROME 0.7 is available

ROME v0.7 Beta and ROME Fetcher v0.7 Beta are available.

What is new:

  • A cool logo
  • Bug fixes
  • Several Date and time parsing improvements including support for custom parsing masks
  • Fix for leaking URL Connections which could cause problems in long running applications using ROME Fetcher

For the full list of changes, fixes and additions refer to the ROME Changes Log and the ROME Fetcher Changes Log

ROME still does not have support for Atom 1.0 nor the Content module. We are planing on adding support for them in a future release.

(2005-09-08 23:30:20.0) Permalink Comments [36]

20050818 Thursday August 18, 2005

Waiting for RSS 10000 spec

Today somebody anounced RSS 3.0 Lite. As I've posted it the ROME alias: Didn't go through in detail yet but it looks like somebody either is having fun or needs a bit common sense. Why would you have an attribute isEmpty to indicate the element is empty?

Even the Register took notice of if, check Mark Woodman's R4A2 blog about it.

Please keep them coming.

(2005-08-18 14:30:00.0) Permalink Comments [3]

20050401 Friday April 01, 2005

New release of ROME and ROME Fetcher, v0.6 Beta

ROME v0.6 Beta and ROME Fetcher v0.6 Beta are available in the ROME site.

Highlights of this release:

  • Date-time parsing fixes
  • XML prolog parsing and generation fixes
  • Added an XML healer to handle HTML literal entities
  • Configured RSS2.0 to handle DC Module
  • RSS 1.0 parser and generator fixes
  • DC Module elements now support multi-values
  • Fetcher now provides support for processing of retrieved feeds inside event handlers

For the full list of changes, fixes and additions refer to the Changes Log.

(2005-04-01 10:43:00.0) Permalink Comments [16]

20050329 Tuesday March 29, 2005

ROME wants a logo

First get something real then go for the imaginery.

We are looking for a Logo now (I think P@ has a new iron to ruin). From ROME user alias:

All,

We are pleased to announce a ROME Logo Contest starting today.

   http://wiki.java.net/bin/view/Javawsxml/RomeLogoContest

We encourage anyone who is interested (even people who don't use ROME)
to submit a logo design that will represent the ROME Project.   Check
out the above link for details.

So fire up GIMP/Inkscape/Photoshop/Illustrator and show us your
design-fu.    Otherwise, please help us get the word out so that we
can get a lot of submissions.

Best Regards,

   Mark

-- 
Mark Woodman
http://inkblots.markwoodman.com
(2005-03-29 08:40:45.0) Permalink Comments [3]

20050110 Monday January 10, 2005

ROME v0.5 Beta released

Version 0.5 of ROME and ROME Fetcher are available.

With version 0.5 we've moved from Alpha releases to Beta releases. After 1/2 year, several changes and fixes and a good number of developers using ROME we consider ROME API and code are stable enough to go Beta.

We've also gotten a full name, Rss and atOM utilitiEs (ROME). Don't ask.

Highlights of v0.5 Beta release:

  • Removed common package and classes from the public API
  • Removed Enum class, using plain constants now
  • XmlReader now has a lenient behavior for charset encoding detection
  • Bug fixes
  • The Fetcher adds experimental support for RFC3229 Delta encoding
(2005-01-10 10:30:59.0) Permalink Comments [3]

20040927 Monday September 27, 2004

Rome v0.4 is available

We've just released version 0.4 of Rome and Rome Fetcher. They are still marked as Alpha but we consider they are already stable for some serious use, we just want to do some sanity check (mostly classes, interfaces, methods and packages names) before we go with a Beta release (which we hope it will be the next one).

We've been busy with this one, just check Rome Changes Log and Rome Fetcher Changes Log, 35 entries. A few highlights:

  • Changed naming convention of the bean interfaces and implementations (i.e.: SyndFeedI/SyndFeed are now SyndFeed/SyndFeedImpl)
  • New XmlReader that handles charset encoding as defined by the XML 1.0 specification and RFC 3023
  • Support for RSS 0.91 Netscape and RSS 0.91 Userland as distinct feed types
  • All feed parsers/generators have Modules support
  • Added checks to generators reducing the chances of generating invalid feeds
  • Several implementation pieces have been refactored or rewritten
  • Comprehensive Unit testing
  • Bug fixes
  • More documentation and samples
  • Dependencies, upgraded to JDom 1.0 and removed dependency from Xerces
  • The Fetcher adds support for Apache HTTP Client for retrieving feeds
(2004-09-27 01:07:18.0) Permalink Comments [2]

20040917 Friday September 17, 2004

Detecting XML charset encoding, again

I've got some comments (and corrections) on my previous posting Detecting XML charset encoding, getting it right (or at least trying to). Based on the received feedback I've rewritten the algorithms (they are simpler now, and -hopefully- correct). They are already implemented in Rome. Currently they are in CVS and they will be part of the upcoming release.

As before, comments are welcome.

Raw XML charset encoding detection

Detecting charset encoding of a XML document without external information (i.e. reading a XML document from a file):

BOMEnc     : byte order mark. Possible values: 'UTF-8', 'UTF-16BE', 
             'UTF-16LE' or NULL
XMLGuessEnc: best guess using the byte representation of the first 
             bytes of XML declaration ('<?xml...?>') if present. 
             Possible values: 'UTF-8', 'UTF-16BE', 'UTF-16LE' or NULL
XMLEnc     : encoding in the XML declaration ('<?xml encoding="..."?>'). 
             Possible values: anything or NULL

if BOMEnc is NULL
  if XMLGuessEnc is NULL or XMLEnc is NULL
    encoding is 'UTF-8'                                                               [1.0]
  else
  if XMLEnc is 'UTF-16' and (XMLGuessEnc is 'UTF-16BE' or XMLGuessEnc is 'UTF-16LE')
    encoding is XMLGuessEnc                                                           [1.1]
  else
    encoding is XMLEnc                                                                [1.2]
else
if BOMEnc is 'UTF-8'
  if XMLGuessEnc is not NULL and XMLGuessEnc is not 'UTF-8'
    ERROR, encoding mismatch                                                          [1.3]
  if XMLEnc is not NULL and XMLEnc is not 'UTF-8'
    ERROR, encoding mismatch                                                          [1.4]
  encoding is 'UTF-8'
else
if BOMEnc is 'UTF-16BE' or BOMEnc is 'UTF-16LE'
  if XMLGuessEnc is not NULL and XMLGuessEnc is not BOMEnc
    ERROR, encoding mismatch                                                          [1.5]
  if XMLEnc is not NULL and XMLEnc is not 'UTF-16' and XMLEnc is not BOMEnc
    ERROR, encoding mismatch                                                          [1.6]
  encoding is BOMEnc
else
  ERROR, cannot happen given BOMEnc possible values (see above)                       [1.7]

Byte Order Mark encoding and XML guessed encoding detection rules are clearly explained in the XML 1.0 Third Edition Appendix F.1. Note that in this algorithm BOMEnc and XMLGuessEnc are restricted to UTF-8 and UTF-16* encodings.

#1.0. There is no BOM, an encoding or encoding family cannot be guessed from the first bytes in the stream, defaulting to UTF-8.

#1.1. Strictly following XML 1.0 Third Edition Section 4.3.3 Charset Encoding in Entities (2nd paragraph) no BOM and UTF-16 XML declaration encoding is an error. The BOM is required to identify if the encoding is BE or LE. But if XMLEnc was read it means that the encoding byte order of the stream was guessed from the first bytes in the stream, note that this is possible only if the document starts with a XML declaration. The logic verifies that there is no conflicting encoding information and relaxes the BOM requirement if a XML declaration (that allows guessing the byte order) is present.

#1.2. XMLEnc is present, it is not UTF-16 (although it can be UTF-16BE or UTF-16LE), the guessed encoding is used to read the XML declaration encoding. Detecting encoding mismatches here it would require awareness of charset families, instead the algorithm relies on the charset encoding routines that will process the stream to discover and report mismatches. IMPORTANT: To handle other encodings (i.e. USC-4 or EBCDIC encoding families) the algorithm should be extended here.

#1.3, #1.4, #1.5, #1.6 There is an explicit encoding mismatch.

#1.7. Given the currently assumed BOMEnc values this case cannot happen. IMPORTANT: To handle other encodings (i.e. USC-4 or EBCDIC encoding families) the algorithm should be extended here.

HTTP XML charset encoding detection

Detecting charset encoding of a XML document with external information provided by HTTP:

ContentType: Content-Type HTTP header
CTMime     : MIME type defined in the ContentType
CTEnc      : charset encoding defined in the ContentType, 
             NULL otherwise
BOMEnc     : byte order mark. Possible values: 'UTF-8', 'UTF-16BE', 
             'UTF-16LE' or NULL
XMLGuessEnc: best guess using the byte representation of the first 
             bytes of XML declaration ('<?xml...?>') if present. 
             Possible values: 'UTF-8', 'UTF-16BE', 'UTF-16LE' or NULL
XMLEnc     : encoding in the XML declaration ('<?xml encoding="..."?>'). 
             Possible values: anything or NULL
APP-XML    : RFC 3023 defined 'application/*xml' types
TEXT-XML   : RFC 3023 defined 'text/*xml' types

if CTMime is APP-XML or CTMime is TEXT-XML
  if CTEnc is NULL
    if CTMime is APP-XML
      encoding is determined using the Raw XML charset encoding detection algorithm   [2.0]
    else
    if CTMime is TEXT-XML
      encoding is 'US-ASCII'                                                          [2.1]
  else
  if (CTEnc is 'UTF-16BE' or CTEnc is 'UTF-16LE') and BOMEnc is not NULL
    ERROR, RFC 3023 explicitly forbids this                                           [2.2]
  else
  if CTEnc is 'UTF-16'
    if BOMEnc is 'UTF-16BE' or BOMEnc is 'UTF-16LE' 
      encoding is BOMEnc                                                              [2.3]
    else
      ERROR, missing BOM or encoding mismatch                                         [2.4]
  else
    encoding is CTEnc                                                                 [2.5]
else
  ERROR, handling for other MIME types is undefined                                   [2.6]

Byte Order Mark encoding and XML guessed encoding detection rules are clearly explained in the XML 1.0 Third Edition Appendix F.1. Note that in this algorithm BOMEnc and XMLGuessEnc are restricted to UTF-8 and UTF-16* encodings.

#2.0. HTTP content type declares a MIME of application type and XML sub-type. There is not HTTP Content-type charset encoding information. The charset encoding is determined using the Raw XML charset encoding detection algorithm. Refer to RFC 3023 Section 3.2. Application/xml Registration.

#2.1. HTTP content type declares a MIME of text type and XML sub-type. There is not HTTP Content-type charset encoding information. The charset encoding is US-ASCII. Refer to RFC 3023 Section 3.1. Text/xml Registration.

#2.2. For text-XML and application-XML MIME types if the HTTP content type declares 'UTF-16BE' or 'UTF-16LE' as the charset encoding, a BOM is prohibited. Refer to RFC 3023 Section 4. The Byte Order Mark (BOM) and Conversions to/from the UTF-16 Charset.

#2.3. For text-XML and application-XML MIME types if the HTTP content type declares 'UTF-16' as the charset encoding, a BOM is required and it must be used to determine the byte order. Refer to RFC 3023 Section 4. The Byte Order Mark (BOM) and Conversions to/from the UTF-16 Charset.

#2.4. For text-XML and application-XML MIME types if the HTTP content type declares 'UTF-16' as the charset encoding, a BOM must be present and it must be 'UTF-16BE' or 'UTF-16LE'. Refer to RFC 3023 Section 4. The Byte Order Mark (BOM) and Conversions to/from the UTF-16 Charset.

#2.5. For text-XML and application-XML MIME types if the HTTP content type declares a charset encoding other than the UTF-16 variants, that charset encoding is used, no other special handling is required.

#2.6. There is not defined logic to determine the charset encoding based on the MIME type given by the HTTP content type.

(2004-09-17 15:11:17.0) Permalink Comments [1]

20040910 Friday September 10, 2004

Detecting XML charset encoding, getting it right (or at least trying to)

I was trying to get the charset encoding detection right for some Rome samples and I've run into a few NITs. A couple of quick fixes didn't fly. It become clear to me this was more than fixing things for the samples. It's a problem everybody using XML data and HTTP transport has to deal with.

Current Java libraries or utilities (that I'm aware of) don't do this by default, If you grab a JAXP SAX parser it will detect the charset encoding of a XML document looking at the first bytes of the stream as defined Section 4.3.3 and Appendix F.1 of the in XML 1.0 (Third Edition) specification. But Appendix F.1 is non-normative and not all Java XML parsers do it right now. For example the JAXP SAX parser implementation in J2SE 1.4.2 does not handle Appendix F.1 and Xerces 2.6.2 just added support for it.

Still, this does not solve the whole problem. The (JAXP SAX) XML parsers are not aware of the HTTP transport rules for charset encoding resolution as defined by RFC 3023. They are not because they operate on a byte stream or a char stream without any knowledge of the stream transport protocol, HTTP in this case.

As I was trying to solve this problem folks from within Sun and from Rome (whom I was bugging with my half backed questions and solutions) pointed me to a couple of nice articles by Mark Pilgrim, Determining the character encoding of a feed in his Blog and XML on the Web Has Failed in XML.com.

Let's first begin with a raw XML stream (i.e. reading a XML document from a file). Following Section 4.3.3 and Appendix F.1 of the XML 1.0 specification the charset encoding of an XML document is determined as follows:

BOM         : Byte Order Mark, utf-8, utf-16be or utf-16le if present, NULL otherwise
XMLGuessEnc : best guess using the byte representation of the first characters in '<?xml ...' 
              (it can be UTF-8, UTF-16BE or UTF-16LE)
XMLEnc      : value of the <?xml encoding="..."?> if present, NULL otherwise

if BOM is NULL and XMLEnc is NULL
    if  XMLGuessEnc is ('UTF-16BE' or 'UTF-16LE')
       ERROR (encoding mismatch)                                    [#1.0]
    else
       encoding is 'UTF-8'                                          [#1.1]

if BOM is NULL and XMLEnc is ('UTF-8' or 'UTF-16BE' or 'UTF-16LE')
    ERROR (XML requires BOM for 'UTF-16*' charsets)                 [#1.2]

if BOM is NULL 
    encoding is XMLEnc                                              [#1.3]

if BOM is not NULL and XMLEnc is NULL
    if BOM is XMLGuessEnc
        encoding is BOM                                             [#1.4]
    else
        ERROR (encoding mismatch)                                   [#1.5]

if BOM is not NULL and XMLEnc is not NULL 
    if BOM is XMLGuessEnc and 
       ((BOM is 'UTF-16*' and XMLEnc is 'UTF-16')) or XMLEnc is BOM)
        encoding is BOM                                             [#1.6]
    else
        ERROR (encoding mismatch)                                   [#1.7]

And in the case of the XML document is served over HTTP. Following Section 4.3.3, Appendix F.1 and Appendix F.2 of the XML 1.0 specification, plus RFC 3023 the charset encoding of an XML document served over HTTP is determined as follows:

ContentType : Content-Type HTTP header
CTMime      : MIME type defined in the ContentType
CTEnc       : charset encoding defined in the ContentType, NULL otherwise
BOM         : Byte Order Mark, utf-8, utf-16be or utf-16le if present, NULL otherwise
XMLGuessEnc : best guess using the byte representation of the first characters in '<?xml ...' 
              (it can be UTF-8, UTF-16BE or UTF-16LE)
XMLEnc      : value of the <?xml encoding="..."?> if present, NULL otherwise
APP-XML     : RFC 3023 defined 'application/*xml' types
TEXT-XML    : RFC 3023 defined 'text/*xml' types

if (CTEnc is 'UTF-16BE' or 'UTF-16LE') and BOM is not NULL
  ERROR                                                             [#2.0]

if CTMIME is APP-XML and CTEnc is NULL
  use raw XML stream charset encoding detection logic               [#2.1]

if CTMIME is TEXT-XML and CTEnc is NULL
  encoding is 'US-ASCII'                                            [#2.2]

if CTMIME is (APP-XML or TEXT-XML) and CTEnc is 'UTF-16'
  if BOM is 'UTF-16BE' or 'UTF-16LE'
    encoding is BOM                                                 [#2.3]
  if BOM is NULL and XMLGuessEnc is ('UTF-16BE' or 'UTF-16LE')
    encoding is XMLGuessEnc                                         [#2.4]
  else
    ERROR (encoding mismatch)                                       [#2.5]

if CTMIME is (APP-XML or TEXT-XML) 
  encoding is CTEnc                                                 [#2.6]

ERROR (undefined CTMIME hanlding for XML documents)                 [#2.7]

All this logic, both raw and HTTP charset encoding detection of XML documents will be available in Rome in the upcoming 0.4 release (currrently is available from CVS). The class implementing this logic is the com.syndication.io.XmlReader a subclass of the java.io.Reader. A XmlReader can be created using a File, an InputStream, a URL, a URLConnection or an InputStream with a Content-type String. Passing this reader to the Rome input classes (and any XML Parser) will take care of the charset encoding detection.

Ideally all this should be built into the (JAXP SAX) XML parser, I'd say in the InputSource class. I'll check with the Java XML folks to see what they think.

If you think there is an error somewhere in this logic, please let us know dropping an email to the Rome mailing list, dev@rome.dev.java.net.

(2004-09-10 23:32:00.0) Permalink Comments [2]

20040728 Wednesday July 28, 2004

Rome v0.3 is out

Rome v0.3 is fresh from the oven.

New stuff. Changes, most of them in the implementation, a few in the API.

Summary of changes:

  • There is a new subproject, Rome Fetcher (#6)
  • New CopyFrom functionality to enable easy integration with custom Synd* bean implementations, such as persistent beans (#16)
  • Improved and simplified plugin mechanism for custom modules (#1,#2,#3,#4)
  • New samples and tutorials, including how to add a custom module (#20,#21)
  • Bug fixes (#11,#12,#13,#14,#15,#17,#18,#19,#22)
  • API changes to make things simpler to developers (#8, #9, #10)
  • Various code and workspace corrections (#5, #7)
  • We have some Unit testing in place

Check the Changes Log for details.

Enjoy.

(2004-07-28 10:33:45.0) Permalink Comments [2]

20040616 Wednesday June 16, 2004

RSS Specifications and schemas (well, not really)

This morning, in an internal Sun alias, somebody asked

Are there specifications available for the various RSS formats (not including the atom work at the IETF)? ... Dare I even hope for something like a schema?

YEAH RIGHT, I want that too. My reply to that poor soul was something like this

If you dare calling them specifications here they are: RSS 0.90, RSS 0.91, RSS 0.92, RSS 0.93, RSS 0.94, RSS 1.0 and RSS 2.0.

I'm still on the quest for DTDs or XML-schemas for all of them but the folks that wrote these *specs* apparently were too busy for such distractions.

And there is even more, you have even different versions with the same version number, isn't that great? Mark Pilgrim did a very good analysis of the different RSS versions.

I'm sure others will find useful having all these links together.

(2004-06-16 15:45:12.0) Permalink Comments [2]

Rome v0.2 is out

We've just postedRome v0.2

We've been fighting to get everything ready for the release.

Summary of the changes:

  • Removed dependencies on other components except JDOM
  • Hid implementation packages/classes from public API
  • API interfaces/classes renaming
  • Some API signature changes

Some of them were done based on some feedback we've received. Others to try bringing consistency to the naming. And others to enable certain usage pattern in applications using Rome.

All the changes (15 of them) are documented in the Changes Log.

I hope feedback keeps coming, that will help us make Rome better and easier to use.

Enough of this, now I have go back to my day job.

(2004-06-16 13:54:44.0) Permalink

To myself, for the Nth time, *never underestimate the effort behind doing a release*

And I'm talking about a small project, Rome v0.2 (It's almost there).

Getting the code changes integrated it's the easy part. Getting all the supporting materials ready is a Pharaonic task. General documentation, changeslog, tutorials, javadocs, updating the samples, testing the samples, preparing the distribution bundles and staging all this in the site. Ensuring (or at least trying to) everything has been updated.

We don't want to get into the we can do that later approach as most of the times later becomes asymptotic to never.

With Rome we are still organizing the site. Deciding where things should go etc. We are using Rome Wiki as the main (and only for now) documentation source, we have to decide what documents we want to keep alive (changing as we go) and what documents we want to freeze with every release. Maybe things will get easier after we have all this settled. I'm making a mental note (too bad I keep losing them) to comment again on this for Rome v0.3, to see if things got better.

(2004-06-16 08:20:19.0) Permalink

20040610 Thursday June 10, 2004

Rome is making some noise already

It was a nice way of starting the day today, I've found out there is some blogging about Rome going on.

Some of them commenting the appearance of it. Others expressing some concerns about it. That's great, we welcome both. The latter ones will help us improve Rome. I hope they keep coming.

P@ posted a blog earlier today -Rome was not built in one day- making some clarifications.

(2004-06-10 11:12:44.0) Permalink

20040608 Tuesday June 08, 2004

Trying to end Syndication feeds Hell, at least for Java developers

After a month of work or so we did it. 'We' are Patrick, Elaine and myself. 'What we did' is Rome.

You may not care who we are, but [if you are a Java developer and you are doing anything that involves RSS or Atom syndication feeds] you'll likely care about Rome.

Rome is a set of Atom/RSS Java utilities that makes it easy to work in Java with most syndication formats. It supports all flavors of RSS (0.90, 0.91, 0.92, 0.93, 0.94, 1.0 and 2.0) and Atom 0.3 feeds. It includes a set of beans, parsers and generators for the various flavors of syndication feeds, as well as converters to convert from one feed type to another. It also includes a generic normalized SyndFeed class that lets you work with any feed type without bothering about the incoming or outgoing feed type.

Rome is one of those projects I'd had preferred not to have spent time working on. It came to be out of frustration, or better said to stop frustration.

What kind of frustration I'm talking about?

First, the nonsense of syndication feeds. Second, the absence of a library that does a good job abstracting all that nonsense from the developer. And third, the quality of available documentation (this last one is in fact a general problem.

So, we decided to do something about and that something is Rome. By addressing the second we took care of the first (insert second part of the title here). And for the third -the documentation- we made an special effort from the very beginning.

Rome is just starting, there plenty of things to improve, to fix and to add.

Check it out and let us know what you think.

(2004-06-08 16:50:04.0) Permalink


archives
links
i'd rather be