Live XLIFF Editor demo !
Over the last few days, I've been putting together some stuff about the XLIFF translation editor we built. Using a GPLd tool called vnc2swf, I was able to create this demo of our translation editor. A few hiccoughs were encountered along the way (as a direct result of me writing everything in Java these days) but I think the results are pretty good. There's no soundtrack I'm afraid, but feel free to play any music you like in the background :-)
The demo shows me loading an XLIFF file into the editor (previously converted from the source html) and then some messing about with exact and fuzzy matches found from our TM, finally backconverting the file and loading it in a browser - enjoy ! It's probably best if you have a resolution of at least 1024x768 to view this.
So, having seen the demo, your next question is going to be "Hey Tim, that's cool, where can I get the editor ?" - well, you're still going to have to hang on for a bit (sorry). We're currently running though a bit more legal stuff, but we're nearly there. Just had a great tutorial/intro from someone over at java.net, so things are definitely moving in the right direction there too. Let me know if you find the demo interesting, or have any questions about it, I'd be happy to answer them. There's a prize for "extreme cleverness" if anyone can spot the glitch in the demo, so pay attention !
btw. I'm not sure what the final name of the editor is going to be, so for now, I'm calling it the Sun Translation Editor : don't think that'll get me in trouble with anyone
UPDATE : These tools are now available, as the Open Language Tools project, on java.net - I wrote an introduction to the tools here on blogs.sun.com which you may be interested in reading.
Posted by Gudmund on February 05, 2005 at 11:45 AM GMT #
Posted by Sonja on February 05, 2005 at 12:00 PM GMT #
I agree with your comments on colour-coding things - right now, we've just got icons along the left of the editing panes : it'd be quite easy to replace these with more obvious icons, a bit more work to get the colour coding done...
Wrt. keyb shortcuts, I think they're there already (just need to find where they are : the navigate menu, I think)
As for formats, the editor itself supports one format : XLIFF (well, two if you count TMX output) - we've got command-line filters for html, sgml, po, msg, properties and java. There's a generic XML filter too, which (with a little work) could be made to extract the "content.xml" from OOo files and find translatable content - I'll have a quick look at this next week, as it'd certinaly be useful. Should be just a case of writing a config file for the filter (telling it which elements are translatable), and then a quick java wrapper to take care of the zipping/unzipping of the .sxw files...
Anyway, thanks for the feedback, it's really appreciated ! More info on release dates as soon as I know more !
Posted by Tim Foster on February 05, 2005 at 02:28 PM GMT #
Thanks for your demo Tim,
Interesting to see the concept and link between 'Projects' and Mini-TMs in the editor. Is this using a file-based TM, stored with the project-specific data?
Agree with the above comments on colour-coding (or more intuitive icons).
As with the glitch, I might be wrong on this one: Is it with the text-box in the keyboard-shortcuts editor, not being enabled when you open the dialog and something is allready selected?
Posted by Asgeir Frimannsson on February 05, 2005 at 09:27 PM GMT #
I was worried about the segmentation of the text "[blah] operating system performance gains ..." - the closing doublequote wasn't pulled inside the segment, and we got the next segment containing just the '"' character. Back to the drawing board :-)
Posted by Tim Foster on February 07, 2005 at 02:00 PM GMT #
Here's the test case showing me running the segmenter across a sample file (I've got it to only show segments, rather than the full output) - you can see that without the space after the elipsis, we manage things just fine :
Posted by Tim Foster on February 07, 2005 at 02:12 PM GMT #
Posted by Reinout van Schouwen on February 08, 2005 at 02:09 PM GMT #
Posted by Tim Foster on February 08, 2005 at 02:24 PM GMT #
I'd analyzed the text to see if sentence-level reuse actually happened and it seemed very rare, and paragraph breaking html is much easier (as its already been done!). How much of a difference does sentence- versus paragraph- segmentation make to the level of reuse you get in your projects?
Posted by Brian Ewins on February 08, 2005 at 07:15 PM GMT #
Thanks for the thoughts - I had a look at your approach on xml.com a while back. It definitely makes sense for some cases. I admit to never actually running the numbers between sentence vs. paragraph segmentation for our material though - so "it depends" is probably the only answer I can give as to how much leverage we're getting.
First off, I think it depends on the sort of changes you expect to see done on document between revisions. If it's mostly new paragraphs or deleted paragraphs, then yes, paragraph-level segmentation is the right way to go. Likewise, if you're producing "one-off" documents, then there's unlikely to be leverage between previously translated material : one could imagine that leveraging the text of one page of a random book from a TM created from a previous page probably wouldn't yeild much leverage.
However, the way our source material is written at the moment, writers tend to add/change bits of paragraphs : perhaps adding a word here, a bit of punctuation there - so in our case, it makes a lot more sense to segment at the sentence level wherever possible. Add to that, that a lot of the large manuals that are written at Sun evolve over time, we figured that we would be able to get the best reuse of material by going with sentences.
The final reason we chose sentence-level segmentation, is that it's fairly close to the level of segmentation you get in software message files : if we have a TM full of sentences, we're more likely to be able to leverage software message translations off that database, helping to ensure translation consistency between docs and software (which is a really good thing!)
Yep, paragraph level segmentation would probably have been much easier to implement (sigh...) but going with sentences seemed to make sense to us - it's only a shame that I was the poor unfortunate stuck with doing the implementation ! <code>;-)</code>
Posted by Tim Foster on February 10, 2005 at 06:05 PM GMT #
If there were/is such a thing as sub-segment handling, with the editor/CAT assembling and substituting segment parts from smaller snippets, I guess the level of TM leverage between the two segmenting methods might shrink a bit.
Posted by Gudmund on February 20, 2005 at 10:46 AM GMT #