Simos was thinking more about what may or may not have been a fairly zany idea that I've been playing around with in my ample free time of late. He suggested that it might be interesting to see what sort of word distribution we have in the GNOME UI at the moment (so as to determine what words would be most beneficial in a bi-lingual dictionary that said-zany-idea uses).

I managed to dig up some sources for GNOME 2.10, and since I didn't want to build all the POT files, I just took the pa.po translations (which were listed as 100% translated), and concentranted on the msgid strings.

I wrote a quick bit of Java (~150 lines), which, using the Open Language Tools PO parser, pulls the msgids, uses a Java BreakIterator to split up words, blasts them to lower case, and writes out a frequency distribution. The program stdout is here along with an OpenOffice doc containing the list of the words, and the frequency they appear in the UI.

Now, if we got the top x words translated and put into a dict formatted dictionary, then perhaps my idea of trying to bridge the digital divide by providing just enough translation wasn't as zany after all ?

As always, thoughts and comments welcome.

by the way, it's nice to know the most common English word in the GNOME 2.10 UI is "the" - who'd have thunk ;-)

update: - of course, I should have said "... top X nouns translated ..." above : translating other parts of speech probably wouldn't help in this case.


Comments:

This continues to be a rather fun exercise. This bit infact touches on another topic - which I have no idea how to tackle well not that I haven't though or tried :) - that of creating a glossary of words that you should focus on in your terminology development phase.

This might be a good idea though. Extracting nouns (how do you knwo what is a noun?). But we need a way to find the most useful words (not necisarily the most frequent) that relate to the domain covered by this software... and not already defined in a terminology list.

Posted by Dwayne Bailey on August 25, 2005 at 07:22 PM IST #

Hey Dwayne, glad you're finding it entertaining :-)

In terms of creating a glossary, there's some fairly heavyweight NLP methods you can use to identify terms in software (called "term mining") which MT folks tend to know about. The same term mining software usually relies on a decent text parsing engine, which is able to identify parts of speech...

The Wikipedia page on NLP might be a good place to start.

Of course, I'm being a pretty lazy here : using pretty lightweight methods (with lots of sellotape and chewing gum) to explore concepts before trying to dive into better implementations. Will blog more on this subject though, as I'm exploring a few other ideas that I think could be cool...

Posted by Tim Foster on August 26, 2005 at 04:09 PM IST #

Well, I think that there are pretty simple algorithms to mine terms - like mutual information (MI) and n-grams. I found that n-grams combined with frequency do produce good term lists for English (and terms sometimes would also cover things like "Press Enter to continue" - this is however highly desirable if you would feed this [translated] term list into a glossary app, and use for pretranslation -- that´s the idea of "Quick TM" in Heartsome and terminology matching in WordFast or Trados). You have to filter this list, however, by using stopwords, because otherwise you´d get a lot of junk (like false positives for "the" in languages that don´t have any definite articles). Sorry for all those brackets and things, Best, Marcin

Posted by Marcin on August 27, 2005 at 07:23 PM IST #

Hey Marcin, thanks (as ever) for the tips ! It should be pretty easy to modify my word-freq lister to create lists of n-grams as well : I'll have a go at doing that (after I write up the results of my next experiment, as soon as I have it) No problem about the brackets (and stuff) - we all do it ;-)

Posted by Tim Foster on August 29, 2005 at 02:04 PM IST #

Post a Comment:
  • HTML Syntax: NOT allowed

This blog copyright 2009 by timf