How to write a TM System (part 5)
This is the final part in my series on how to write a TM System. We've done all of the major components now - but I haven't talked about one of the most important ones.
The output format
Now that you've got a method of taking an input document, splitting it up into sentences, looking up each sentence in a database of translations and returning exact and fuzzy matches, the final question remains - what do you do with all of this data ?
Well, you obviously need to represent this data to translators in a manner that will make it simple for them to review and complete the translations. Remember our aim here is to make life as easy as posssible for the translator, by suggesting translations and automatically re-using old translations, that's certainly a step in the right direction. There's one more thing though that we need to take care of. Consider the following :
click on image for a larger version
There's a load of file formats represented there that a translator would need to be able to understand (can anyone name them all ?) and would need tools to process these formats. I don't think that's the best use of a translators time - I think a translator should be able to concentrate on the text that's being translated, rather than being proficient at Frame, or Docbook or understand the nuances of every different XML format you throw at them. Having them edit the original file format means that they have to ensure that they don't corrupt files, that the encoding is correct, that the file displays correctly (which may mean compiling software message files and building them into an application) - a host of things that perhaps could be done by people more experienced in the product being translated.
With this (and many other problems facing the translation industry) a small group of dedicated individuals from Sun, Oracle, Novell and some translation vendors came up with XLIFF, which has since been put under the umbrella of OASIS. XLIFF aims to solve the above problem of dealing with multiple formats, by abstracting the translatable text from the formatting of the document. With a little bit of work, it's possible to come up with something like this which I'm sure you'll agree is a little easier for the translator to deal with :
click on the image for a larger version
Of course, that's only one side of the story. We also need to be able to backconvert XLIFF documents to their original file format, and would also like to be able to generate TMX files from the completed translations, so that we can then import them into our translation memory database, for use by other projects.
That's it - with this work, we can really increase the productivity of our translators and save time and effort when producing localised products !
What's more, now that we have a central database of translations, there's all sorts of other interesting things we could do to increase efficiency even more. For example, perhaps we haven't found an exact match for a segment, but based on the large amount of data we've accumulated, wouldn't it be nice if we could search inside that segment for terms that have been translated elsewhere, and perhaps suggest those terms to the translator via the translation editor ?
How about machine translation - since we've got a large corpus, we might even be able to apply example-based machine translation techniques to translate things automatically.
I may well cover these in future posts, stay tuned!
Posted by Maxym Mykhalchuk on August 16, 2004 at 04:36 PM IST #
Hi Maxym,
Yep, I know what you mean - one of the things we really had to go for was maximum scalability with our indexing system and Steve's index gave us just that.
For TM lookup, what we're doing with Steve's index, is to very quickly narrow down a few hundred-thousand documents into, maybe 30 or 40 as quickly as possible. With a 700,000 entry index, we're doing this in about 1.5 seconds on a medium spec machine, which is pretty good. After that, we can use dynamic programming to further reduce those 30 or 40 matches to just 5...
You might be able to replicate something like this by using a combination of Apache Lucene (or somesuch) and some of the techniques above...
I'm still trying to see if I can find ways in which we can share our tools with you guys, because I'm pretty sure they would be of use. It's slow going, but there's been some recent events that may speed things up. Stay tuned!
Posted by Tim Foster on August 16, 2004 at 05:50 PM IST #