SMT Quality Challenge

One of the biggest challenges when customizing Statistical Machine Translation (SMT) is improving the engine after its initial development. While you can build a baseline engine using existing Translation Memories (TM), terminology and monolingual training data assets – the real challenge is going beyond this, and achieving even higher levels of quality. More importantly, how can you do this rapidly with minimum cost and effort? A proactive approach to measuring the quality of your training data will greatly assist in doing this.

Kantan BuildAnalytics™ is a new technology that addresses this head-on and helps SMT developers to build engines that are production ready, fast!

What is Kantan BuildAnalytics?

Kantan BuildAnalytics brings a new level of transparency to the SMT building and training process, and KantanMT users can now build higher performing engines for each domain, resulting in less post-editing requirements.

How it works…

When you build a KantanMT engine, some of your training data is automatically extracted and kept to one side. This is called a Reference Data Set – and contains both source and target texts. After a KantanMT engine is built, this Reference Data Set is used to calculate a series of automated quality scores – including BLEU (Bilingual Evaluation Understudy), F-Measure and TER.

This Reference Data Set is also used to perform a Gap Analysis. Gap Analysis is a quick way to determine any missing words in the engine’s phrase-tables. I’ll come back to this later and demonstrate how Gap Analysis can improve the quality performance of KantanMT engines.

But for now, let’s focus on the automated quality scores of BLEU, F-Measure and TER.

BuildAnalytics uses the KantanMT data visualization library to graphically display the distribution of these automated scores based on the Reference Data Set. Since an automated score is calculated for each text segment within the Reference Data Set, this means we get a detailed view of how a KantanMT engine is performing and how it should generate translated output.

By analysing these scores and the Gap Analysis results, and examining the translated output, users of KantanMT are producing higher quality engines because their training data choices are more strategic and refined.

F-Measure

Let’s look at F-Measure first, as this is the most straightforward to understand and visualize. F-Measure scores show how precise a KantanMT engine is when retrieving words, and how many words it can retrieve or recall during translation. This is why it is commonly referred to as a Recall and Precision measurement. By expressing these two measurements as a ratio, it is a good indicator of the engines performance and its ability to translate content.

However, while your KantanMT engine may have a high F-Measure score – it doesn’t mean that these words are recalled in the correctly translated order. We need another metric to give us an indication of how well the engine translated the text and BLEU is one of the most recognized and automated metric for estimating the texts fluency.

BLEU

BLEU is an automatic evaluation metric well known in both the industry and academia, which calculates an estimation of text fluency. Fluency is a measure of the correspondence between a KantanMT engine output and that of a professional translator.

Since the Reference Data Set consists of both source and human translated equivalents, which were created by a professional translator, BLEU score can be calculated by comparing the output of a KantanMT engine to this Reference Data Set.

KantanMT BLEU score — KantanMT engine BLEU score distribution

In practice, BLEU achieves a high correlation with human judgement of quality and remains one of the most popular automated metrics in use today.

TER

TER standards for Translation Error Rate and is used to estimate the amount of post-editing required to transform a generated translation to its original human translation equivalent. In simple terms this is a count of the number of insertions, deletions and substitutions required to transform a segment to match its original human translation equivalent.

KantanMT TER score — KantanMT engine TER score distribution

So the lower this score, the less transformation required which means the less post-editing required too.

Working with Kantan BuildAnalytics™

BuildAnalytics is a really great way to see all these automated scores in action. It uses KantanMT data visualization technology to graphically present these scores, helping developers of KantanMT engines to fine-tune their training data and maximize their engine’s quality performance.

Let’s take a closer look at how this data visualization can be used to gain insights into an engine and determine if it is a high or low performing engine, and what steps we can take to improve it.

Here’s the summary distribution graphs for an engine that contains approx. 3.2m words. It’s a small engine within a technical domain. Its overall scores are:

These Summary Graphs show the distribution of scores, grouped into bands (i.e. <40%, 40-54% etc.), for each automated score. This is very helpful in determining the scores’ overall distribution, and how the KantanMT engine is likely to be performing.

Here are the detailed distribution graphs for each automated score:

By reviewing both the Summary Graphs and the more detailed Distribution Graphs we can make some observations of how this engine would most likely perform. My observations are included as part of the commentary in the table above.

It’s important to point out that no one individual score gives an absolute of how a KantanMT engine will perform. We need to take a holistic view on how to determine a general sense of the performance of the engine by reviewing all automated scores together.

Using Kantan BuildAnalytics users can get a good sense of how a KantanMT engine will perform in a production environment and with a little practice and experimentation, they can use this knowledge to build higher performing MT engines.

Gap Analysis

I mentioned this concept earlier in the post, so let’s take a closer look at this really helpful new feature. Gap Analysis determines how many untranslated words remain in the generated translations. These missing words, or ‘Gaps’ can quickly be identified and filled by introducing the most relevant training data to your KantanMT engine and re-training it.

The Gap Analysis feature not only lists the gaps, it also presents suitable training data, which can be post-edited and resubmitted as training data to improve overall engine’s performance. This makes filling the gaps just that little bit easier!

One more (very important) thing…

Most quality improvements for SMT systems will be created by fine tuning terminology and filling data gaps. Post-editing raw-MT output and a focus on minimizing data gaps will significantly improve the quality performance of your KantanMT engines. This cannot be done without the involvement of professional translators. They have the skills, knowledge and linguistic expertise to finesse terminology, identify gaps and choose better training data. While BuildAnalytics helps SMT developers get engines ready for production, ultimately, it’s the professional translator that should have the final say in how production-ready it truly is!

To get the most from your Machine Translation engine, always keep in mind:

Measuring and improving training data – high quality training data is the first step to building a successful Machine Translation engine.
Take a holistic approach to evaluating performance – automatic evaluation metrics can give a good indicator of how your KantanMT engine will perform, but metrics alone are insufficient for measuring post-editing effort.

Kantan BuildAnalytics is available to Enterprise members of KantanMT, but you can also experience this quality estimation and measurement software by signing up for a free trial on KantanMT.com.