Sampark System: Automated Translation among Indian Languages

Sampark System: Automated Translation among Indian Languages

India has 122 languages of which 22 are designated as constitutionally recognized languages.More than 850 million people world wide speak the following Indian languages: Hindi, Bengali, Telugu, Marathi, Tamil and Urdu. With the availability of e-content and development of language technology, it has become possible to overcome the language barrier. The complexity and diversity of Indian languages present many interesting computational challenges in building automatic translation system.

Sampark is a multipart machine translation system developed with the combined efforts of 11 under the umbrella of consortium project “ Indian language to India Language Machine translation” (ILMT) funded by TDIL program of Dept of IT, Govt. of India.

ILMT project has developed language technology for 9 Indian languages resulting in MT for 18 language pairs. These are: 14 bi-directional between Hindi and Urdu / Punjabi / Telugu / Bengali / Tamil / Marathi / Kannada and 4 bidirectional between Tamil and Malayalam / Telugu.

Approach:

First, Sampark uses Computational Paninian Grammar (CPG) approach for analyzing language and combines it with machine learning. Thus it uses both traditional rules-based and dictionary-based algorithms with statistical machine learning. At present six systems are ready:

-Punjabi to Hindi
-Hindi to Punjabi
-Telugu to Tamil
-Urdu to Hindi
-Hindi to Urdu
-Hindi to Telugu

The Sampark system is based on analyze- transfer-generate paradigm. First, analysis of the source language is done, then a transfer of vocabulary and structure to target language is carried out and finally the target language is generated. Each phase consists of multiple "modules" with 13 major ones.. An advantage of this approach is that a particular language analyzer, one for Punjabi, for example, can be developed once, independent of other languages and then paired with generators in other languages, besides Hindi. Because Indian languages are similar and share grammatical structures, only shallow parsing is done. Transfer grammar component has been kept simple. Domain specific aspects have been handled by building suitable domain dictionaries.

The 13 major modules together form a hybrid system that combines rule-based approaches with statistical methods in which the software in essence discovers its own rules through "training" on text tagged by human language experts.

The second attribute of this work is the system's software architecture. Due to the complexity of NLP System, and the heterogeneity of the available modules, it was decided that ILMT system should be develop using Blackboard Architecture to provide inter operability between heterogeneous modules. Hence all the modules were decided to operate on a common data representation called Shakti Standard Format (SSF) either in memory or in text stream.

This approach helps to control the complexity of the overall system and also helps to achieve unprecedented transparency for input and output for every module. The textual SSF output of a module is not only for human consumption but it is also used by the subsequent module in the data stream as its input. Readability of SSF helps in development and debugging because the input and output of any module can be easily seen. Even in case of module failure, the SSF format helps to run the modules without any effect on normal operation of system. In such case the output SSF would have unfilled value of an attribute and downstream modules continue to operate on the data stream.