Over the last few days, I've been putting together some stuff about the XLIFF translation editor we built. Using a GPLd tool called vnc2swf, I was able to create this demo of our translation editor. A few hiccoughs were encountered along the way (as a direct result of me writing everything in Java these days) but I think the results are pretty good. There's no soundtrack I'm afraid, but feel free to play any music you like in the background :-)

The demo shows me loading an XLIFF file into the editor (previously converted from the source html) and then some messing about with exact and fuzzy matches found from our TM, finally backconverting the file and loading it in a browser - enjoy ! It's probably best if you have a resolution of at least 1024x768 to view this.

So, having seen the demo, your next question is going to be "Hey Tim, that's cool, where can I get the editor ?" - well, you're still going to have to hang on for a bit (sorry). We're currently running though a bit more legal stuff, but we're nearly there. Just had a great tutorial/intro from someone over at java.net, so things are definitely moving in the right direction there too. Let me know if you find the demo interesting, or have any questions about it, I'd be happy to answer them. There's a prize for "extreme cleverness" if anyone can spot the glitch in the demo, so pay attention !

btw. I'm not sure what the final name of the editor is going to be, so for now, I'm calling it the Sun Translation Editor : don't think that'll get me in trouble with anyone

UPDATE : These tools are now available, as the Open Language Tools project, on java.net - I wrote an introduction to the tools here on blogs.sun.com which you may be interested in reading.


Comments:

Nice demo, Tim! Hoping to see it on my desktop sometime soon... The demo obviously can't show it all, but I hope there are a few functions in it that I spontaneously felt would be nice, when watching the demo: - Keyboard shortcuts to set fuzzy, approve and move on etc. - A toggle for hiding all that non-translatable stuff like tags etc., or at least make them less obtrusive. - An option for colour coding the match (fuzzy, 100%, other match percentages) and translation status. Let's hope it doesn't take too long now before we can all test it... Keep up the good work! //Gudmund

Posted by Gudmund on February 05, 2005 at 11:45 AM GMT #

I too liked the demo. I too found all the tags irritating and would like to see a way to hide tags. Anyway, I am really looking forward to testing it myself. BTW, what formats will be supported? You mentioned HTML, DocBook, XML. What about Star Office (or OO.o)? And sorry for the last comment I posted on your blog. I know, it appeared like 4 times. For some reason I believed it wasn't posted and so I hit the post button a couple of times. My apologies.

Posted by Sonja on February 05, 2005 at 12:00 PM GMT #

Hey folks, glad you like. So, the tags thing - I vaguely remember something about an "abbreviated tags" option at one stage, but I've a feeling it was only for the "Print" option (so translators could quickly review what they've done on hardcopy) but I do agree with you, they could be distracting. This particular file might have been an extreme case : most documents we translate with the editor are much more "free-flow" text, and this problem isn't as bad.

I agree with your comments on colour-coding things - right now, we've just got icons along the left of the editing panes : it'd be quite easy to replace these with more obvious icons, a bit more work to get the colour coding done...

Wrt. keyb shortcuts, I think they're there already (just need to find where they are : the navigate menu, I think)

As for formats, the editor itself supports one format : XLIFF (well, two if you count TMX output) - we've got command-line filters for html, sgml, po, msg, properties and java. There's a generic XML filter too, which (with a little work) could be made to extract the "content.xml" from OOo files and find translatable content - I'll have a quick look at this next week, as it'd certinaly be useful. Should be just a case of writing a config file for the filter (telling it which elements are translatable), and then a quick java wrapper to take care of the zipping/unzipping of the .sxw files...

Anyway, thanks for the feedback, it's really appreciated ! More info on release dates as soon as I know more !

Posted by Tim Foster on February 05, 2005 at 02:28 PM GMT #

Thanks for your demo Tim,

Interesting to see the concept and link between 'Projects' and Mini-TMs in the editor. Is this using a file-based TM, stored with the project-specific data?

Agree with the above comments on colour-coding (or more intuitive icons).

As with the glitch, I might be wrong on this one: Is it with the text-box in the keyboard-shortcuts editor, not being enabled when you open the dialog and something is allready selected?

Posted by Asgeir Frimannsson on February 05, 2005 at 09:27 PM GMT #

Hey Asgeir, no problem. Yep, it's a file-based TM, dead simple file format (not even TMX, simpler than that) which gets stored with the project data, (currently just a name and langauge pair). The glitch you mention wasn't the one I was thinking of , but it does look suspicious.

I was worried about the segmentation of the text "[blah] operating system performance gains ..." - the closing doublequote wasn't pulled inside the segment, and we got the next segment containing just the '"' character. Back to the drawing board :-)

Posted by Tim Foster on February 07, 2005 at 02:00 PM GMT #

Okay, just checked that - the problem was that the quoted text had a space after the closing elipsis but before the closing quote, causing our segmenter a bit of confusion.

Here's the test case showing me running the segmenter across a sample file (I've got it to only show segments, rather than the full output) - you can see that without the space after the elipsis, we manage things just fine :

timf@cuprum[523] cat tim.txt
"This is a sentence ..." This is another sentence correct segmentation. "This is a 3rd sentence ... " This will break.
timf@cuprum[524] sh viewstrings tim.txt | grep "segment :"
segment : "This is a sentence ..."
segment : This is another sentence correct segmentation.
segment : "This is a 3rd sentence ...
segment : " This will break.
timf@cuprum[525]

Posted by Tim Foster on February 07, 2005 at 02:12 PM GMT #

The demo looks nice (although I'd have used GTK+ look&feel ;-). I really can't wait to testdrive the editor myself. Perhaps when the editor is released, the translation memory can be complemented by OmegaT (http://www.omegat.org/)?

Posted by Reinout van Schouwen on February 08, 2005 at 02:09 PM GMT #

Thanks for the comments Reinout! Yep, hopefully at some stage we'll get the TM system open sourced too. There's barriers to that at the moment : currently it uses Oracle specific APIs that we'd need to remove and the SunLabs indexing system we're using would need to either be open sourced, or released as binary-only (beyond my control, unfortunately). Still, we're hoping that for most uses, the mini-tm we have in the editor might be enough (at least to get started on!)

Posted by Tim Foster on February 08, 2005 at 02:24 PM GMT #

Interesting reading your comments on segmentation. On a simple translation system we built (for xhtml) we used a heuristic where the translatable segments were just contiguous dom nodes, starting and ending with the immediate descendents containing 'letter' chars (this has the effect of breaking at block level elements but including inline elements in html, without specifically naming them). Works pretty well.

I'd analyzed the text to see if sentence-level reuse actually happened and it seemed very rare, and paragraph breaking html is much easier (as its already been done!). How much of a difference does sentence- versus paragraph- segmentation make to the level of reuse you get in your projects?

Posted by Brian Ewins on February 08, 2005 at 07:15 PM GMT #

Hey Brian,
Thanks for the thoughts - I had a look at your approach on xml.com a while back. It definitely makes sense for some cases. I admit to never actually running the numbers between sentence vs. paragraph segmentation for our material though - so "it depends" is probably the only answer I can give as to how much leverage we're getting.

First off, I think it depends on the sort of changes you expect to see done on document between revisions. If it's mostly new paragraphs or deleted paragraphs, then yes, paragraph-level segmentation is the right way to go. Likewise, if you're producing "one-off" documents, then there's unlikely to be leverage between previously translated material : one could imagine that leveraging the text of one page of a random book from a TM created from a previous page probably wouldn't yeild much leverage.

However, the way our source material is written at the moment, writers tend to add/change bits of paragraphs : perhaps adding a word here, a bit of punctuation there - so in our case, it makes a lot more sense to segment at the sentence level wherever possible. Add to that, that a lot of the large manuals that are written at Sun evolve over time, we figured that we would be able to get the best reuse of material by going with sentences.

The final reason we chose sentence-level segmentation, is that it's fairly close to the level of segmentation you get in software message files : if we have a TM full of sentences, we're more likely to be able to leverage software message translations off that database, helping to ensure translation consistency between docs and software (which is a really good thing!)

Yep, paragraph level segmentation would probably have been much easier to implement (sigh...) but going with sentences seemed to make sense to us - it's only a shame that I was the poor unfortunate stuck with doing the implementation ! <code>;-)</code>

Posted by Tim Foster on February 10, 2005 at 06:05 PM GMT #

Interesting to read about the segmentation rules. I seem to share your experiences, and much prefer the path you've taken. Since mileage might vary, I guess the ideal situation would be a possibility to choose segmentation rule before file import.

If there were/is such a thing as sub-segment handling, with the editor/CAT assembling and substituting segment parts from smaller snippets, I guess the level of TM leverage between the two segmenting methods might shrink a bit.

Posted by Gudmund on February 20, 2005 at 10:46 AM GMT #

Post a Comment:
  • HTML Syntax: NOT allowed

This blog copyright 2009 by timf