A Short Introduction to the Statistical Machine Translation Model

The first ideas of Statistical Machine Translation were introduced by Warren Weaver as far back as 1947. He explained that language had an inherent logic that could be treated in the same way as any logical mathematical challenge. He contended that logical deduction could be used to identify “conclusions” in the target (untranslated) language based on what already existed in the source (translated) language. With the advent of the cloud, and affordability of powerful computers, the theory of Statistical Machine Translation (SMT) became a practical option.

Therefore, SMT is a machine translation paradigm where translations are generated based on statistical models, whose parameters are derived from the analysis of bilingual text corpora (text bodies) – a source text of translated material and a target text of untranslated material.

Statistical machine translation starts with a very large data set of approved previous translations. This is known as a corpus (corpora is plural) of texts that is then used to automatically deduce a statistical model of translation. This model is then applied to untranslated target texts to make a probability-driven match to suggest a reasonable translation.

Where are these large data sets sourced? Well, over many years most large global organisations, for example the EU, UN, World Bank, World Health Organisation etc develop enormous domain-specific corpora, in multiple source/target language combinations. Many of these have originated as human-translated texts. These are then made accessible to machine translation users/developers to further refine and use for MT purposes. The process is an evolutionary model with each existing corpus being refined, added to and updated on an ongoing basis.

This SMT paradigm is based on what’s known as ‘probabilistic mathematical theory’. Such theory suggests the chances (probability) of something occurring depending of different variables likely to influence the event. For example, tossing a dice gives the probability of one dice being a certain number is 1/6. So, a gambler has a one in six chance of landing on their chosen number. The probability of two dice being the same number is 1/6 x 1/6 = 1/36. In poker, the professional gambler keeps mental track of the cards being played and using probability theory he/she decides whether a gamble has a good chance of winning based on the odds they have calculated in their head.

In SMT, the MT engineer builds a Translation Model using the frequency of phrases appearing the training corpus into a table. This table stores the phrase and the number of times this repeats over the entirety of the training corpus. The more frequently a phrase is repeated in a training corpus, the more probable the target translation is correct. Each phrase (stored in the Phrase Table) can range from one to five words in length. This phrase table is referred to as the Translation Model.

Consequently, the MT engineer is using a probability model to hit on the right source/target translation combination. The process is evolutionary as the corpus is refined and adjusted after each translation run to eliminate/adjust any anomalies. The more frequently the corpus is used, the more perfected it becomes. The development of the corpus quality is a continuous organic R&D process of a highly valuable translation asset.

Additionally, the MT engineer builds a secondary model using the target translation data. This model helps determine the order in which the engineer needs to assemble phrases (from the Phrase Table) in order to optimise translation Fluency i.e. to give the translated text its natural language flow. Fluency ensures that literal translations (i.e. the words are all there, but the sense of the sentence is not) are replaced by a more natural sounding translation.

In order to translate a source sentence, the MT engineer goes through the following decode process (as outlined in the diagram below):

He/she breaks the source language sentence into phrases. (You can see the phrase as individual grey blocks in line two of the diagram.)
He/she then looks up each of these phrases in the Phrase Table/Translation Model and generates the target language translations. (You can see this in line three of the diagram.)
The engineer then uses the Phrase Table/Translation Model to re-order these phrases to optimise translation Fluency. (You can see this reordering in line four of the diagram.)

In my next blog, I will give a short introduction to the increasingly popular MT model Neural Machine Translation.

Aidan Collins, Marketing Manager at KantanMT.