ಫೊಸ್.ಇನ್ - FOSS.IN - "Talk is cheap, Show me the code"

Creating Languages Support Architecture On the Desktop for new languages was the most hottest topic of the day, India has 22 officially recognized languages and Computing power at still 11 of them. To contribute or to get in touch with the community, we still need a English literacy as a requirement for localization was the argument of the audience. Though the community of i18n and l10n are the oldest communities ever started from the books to digital information and still growing rapidly in the market. Information on literacies like, Poetry, History, Commerce, Literature..etc are started seeing the digitized world and different views.
For example of a file system : Metaphors support in the file system itself is bit tricky [calling it as Default Folder ] does not make sense in other languages.
So I18N (Internationalization)
is not a feature, but its an architecture! that has to be designed at
the initial stages of the product. I18n is mainly divided as input,
output and processing components, so to first enabling the new
languages requires a knowledge of knowing these components at the file
level and clear documentation for finding what to change and when.
- Input - keyboard mapping, CLDR data, transliteration mappings.. etc
- Output - rendering engines and fonts
- Processing - content type and it processing of metaphors.
Knowing the relationship between the language and scripts like, Language -have- many to many -with- Scripts and vise versa Scripts -have- many to many -with -Language.
Example: The Devanagari script is used
for several languages, including Bhojpuri, Bihari, Hindi, Kashmiri,
Marathi, Nepali and Sanskrit.
Machine Translation of Indic Languages using apertium - SlideShare by Pranava Swaroop S
The presentation was good and reveals that most of the Machine Translation System(MTS) are of Closed type, which cannot be shared like, BableFish, Yahoo..etc. Swaroop gave a brief about an open source toolbox of 'Apertium', An open-source shallow-transfer machine translation engine and toolbox. MTS works mainly on morphological analysis in natural language processing(NLP) like, Corpus Linguistics tagging and Finite state machine. It has become unlikely that MTS tools uses different kinds of non standard tagging for machine translations,
Example would be: Shatki Engine has a different kind of tagging than Bablefish Engine.
Present MTS tools limitation are:
- Most of the MTS are stagnant and inactive makes them venerable to catch the sentence with a context.
- Undocumented, un-deployable, unsupported, closed framework, non-standardized tagging, limited resources are digitized and non-interoperable.
- No proper version control system on Corpus Linguistics
- Closed system and closed collaborative effort.
Apertium
is one of the tool, which over comes these limitations and is growing
in a rapid phase. It is an Open Source modular shallow transfer
machine, which has well test format management, finite-state lexical
processing, statistical lexical disambiguation and finite-state pattern
matching. So what ultimately we all need is,
- Open Collaborative effort
- Readily available and standardized
- Well documented
- Portable
- and shared active corpus database
Coming
back to the Indian languages, 18 languages defined in the constitution
and all of them are still in process of digitizing. Need for rapid
conversion and immediate generation of digital data like BBC site. Need
for collaboration in terms of writing linguistic rules. Most of the
Indic languages are well known as close neighbors and their grammatical
constructs are almost the same. Use of apertium for the machine
translation as close neighbors works for sparsely spaced languages.
Example: English to Hindi Dictionary
Most of the Indic languages belong to the huge group namely,
- Indo-European
- Indo-Iranian
- Indo-Aryan
- Dravidian
Some of these known languages from these group have already well formed corpus and translation rules. So can we inherit some of the properties are the questionable thing that have to be researched.
Example of an bilingual corpus : "Urdu-Hindi Machine Translation System"
Very Impressive standouts from the Open Source Community and Comparison of machine translation applications

