A commonly asked question within the localization industry is which is better: Rule Based or Statistical Machine Translations systems. While both approaches have merits and advantages, the question in my mind is which offers the best future potential and best value for LSPs who are considering a future offering which includes an element of Machine Translation?
According to Don DePalma and his team at Common Sense Advisory, if you’re an LSP and haven’t been asked to provide an RFQ (Request for Quotation) that includes an element of Machine Translation, then you’re rapidly becoming the exception!
So as a successful LSP entrepreneur, which is the best wagon to hitch your horses to: Rule Based or Statistical Machine Translation?
First of all, what is Machine Translation?
Machine translation (MT) is automated translation or “translation carried out by a computer” – as defined in the Oxford English dictionary. It is the process by which computer software is used to translate a text from one natural language to another.
Machine Translation systems have been in development since the 1950s, however the technology required to develop successful MT systems was not up to par at this time and so research was largely put to the side. But in the last 15 years, as computational resources have became more mainstream and the internet opening up a wider multilingual and global community, interest in Machine Translation has been renewed.
There are three different types of Machine Translation systems available today. These are Rule-Based Machine Translation (RBMT), Statistical Machine Translation (SMT) and hybrid systems – a combination of RBMT and SMT.
Rule-Based Machine Translation Technology
Rule-based machine translation relies on countless built-in linguistic rules and gigantic bilingual dictionaries for each language pair. RBMT system works by parsing text and creating a transitional representation from which the text in the target language is generated. This process requires extensive lexicons with morphological, syntactic, and semantic information, and large sets of rules. RBMT uses a complex rule set and then transfers the grammatical structure of the source language into the target language.
In most cases, there are two steps: an initial investment that significantly increases the quality at a limited cost, and an ongoing investment to increase quality incrementally. While rule-based MT brings companies to a reasonable quality threshold, the quality improvement process is generally long and expensive. This has been a contributing factor to the slow adoption and usage of MT in the localization industry.
Surely, there must be a better approach!
Statistical Machine Translation Technology
Statistical Machine Translation (SMT) utilizes statistical translation models generated from the analysis of monolingual and bilingual content. Essentially this approach uses computing power to build sophisticated data models to translate one source language into another. This makes the use of SMT a far simpler option, and a significant factor in the broader adoption of statistical machine translation technology in the localization industry.
Building SMT models is a relatively quick and simple process. Using current systems – users can upload training material and have an MT engne generated in a matter of hours. While it is genereally thought that a minimum of two million words are required to train an engine for a specific domain, it is possible to reach an acceptable quality threshold with much less. The technology relies on bilingual corpora such as translation memories and glossaries for the system to learn the language patterns, and monolingual data is used to improve the fluency of the output as the engine has more text examples to choose from. SMT engines will prove to have a higher output quality if trained using domain specific training data such as; medical, financial or technical domains.
SMT technology is CPU intensive and requires an extensive hardware configuration to run translation models for acceptable performance levels. However, the introduction of cloud services, and the increasing availability of bilingual corpora are having a dramatic effect on the popularity of SMT systems, which is leading to a higher adoption rate in the language services industry.
RBMT vs. SMT
- RBMT can achieve good results but the training and development costs are very high for a good quality system. In terms of investment, the customization cycle needed to reach the quality threshold can be long and costly.
- RBMT systems can be built with much less data than SMT systems, instead using dictionaries and language rules to translate. This sometimes results in a lack of fluency.
- Language is constantly changing, which means rules must be managed and updated where necessary in RBMT systems.
- SMT systems can be built in much less time and do not require linguistic experts to apply language rules to the system.
- SMT models require state-of the-art computer processing power and storage capacity to build and manage large translation models.
- SMT systems can mimic the style of the training data to generate output based on the frequency of patterns allowing them to produce more fluent output.
Statistical Machine Translation technology is growing in acceptance and is by far, the clear leader between both technologies. The increasing availability of cloud-based computing is providing a solution to the high computer processing power and storage capacity required to run SMT technology effectively, making SMT a game changer for the localization industry.
Training data for SMT engines is becoming more widely available, thanks to the internet and the increasing volumes of multilingual content being created by both companies and private internet users. High quality aligned bilingual corpora is still expensive and time consuming to create but, once created becomes a valuable asset to any organization implementing SMT technology, with translations benefiting from economies of scale over time.
Tony O’Dowd, Founder and Chief Architect, KantanMT.com