Back here I was talking about the subtleties of SGML. Well, it turns out, I was wrong in thinking that our SGML lexer wasn't up to scratch. It was working perfectly well!

The actual problem was, that we were being presented with a file that contained only <!ENTITY...> declarations, which isn't valid SGML. It needed to be within the context of a <!DOCTYPE...> subset for the file contents to be valid. The file was being called from the main .book Docbook file, referenced via a parameter entity %textents; that was in the doctype subset, so when read in the context of the .book file, this was all perfectly legal and the lexer happily processed the file.

Of course, that doesn't help our TM system, which isn't smart enough - I guess Norm and Tony were dead right - writing something to process SGML from scratch is quite an undertaking. Thankfully, we don't encounter cases like that too often, so a workaround might be possible for this book (pubstool id 817-4414 if anyone's interested ?)

We've got another non-conformant lexer that we could use in place of the strict sgml one, which can chew on pretty much anything (we needed it when writing the html filter - it's shocking the amount of invalid html we have to process) so whenever the strict sgml parser throws errors on encountering invalid files, we give the non-conformant one a go on the same input. Does anyone have any better ideas (given that we're constrained by the fact that we want to continue to process a file at a time, rather than read the whole book, resolving entity references whenever we come across them)


Comments:

I guess it is a case of having the SYSTEM entity resolver be context aware. If it is being used in the context of an internal subset, then the file referenced should be parsed using a DTD parser, rather than an SGML one. Then you need a DTD parser, which isn't so bad as the DTD grammar is quite straight forward.

Posted by Johnny C. on July 22, 2004 at 01:41 PM IST #

Post a Comment:
  • HTML Syntax: NOT allowed

This blog copyright 2009 by timf