There are several key metrics that developers of Statistical Machine Translation (SMT) engines need to pay attention to. These key performance indicators (KPIs) help you to understand which aspects of an engine are performing well and which need improvement. They also provide insight into how you can improve the overall performance of your SMT engine.
While the following list is not exhaustive, it is a good starting point in your understanding of SMT performance and how to improve it. You should remember that no single KPI can be analysed individually when determining the quality of an engine – they need to be analysed together, in a holistically manner, to give an accurate sense of the overall performance of the SMT engine.
F-Measure is a KPI that is used to determine the recall and precision capabilities of an SMT engine. Put simply, this measures how many words are picked from the SMT engine and how accurate the selection process is. Expressed as a ratio, F-Measure provides a good insight into the language coverage of an SMT engine.
If your F-Measure score is low, it indicates your engine is missing many words – if the score is high, it means that most words have been found in your engine.
While most SMT vendors present a single F-Measure value for an SMT engine, more progressive suppliers of SMT systems provide a distribution analysis of F-Measure scores.
Unfortunately, the level of precision and recall doesn’t give us information about the word order of a segment, for this we need to look at another KPI commonly referred to as BLEU.
BLEU (Bilingual Evaluation Understudy) is a KPI that measures the fluency of the translated output of an SMT engine, which means it measures how many words overlap in a given translation when compared to a reference translation. Higher scores are given to segments which contain a greater number of sequential words.
BLEU is a major improvement on F-Measure as it takes word-order in account!
A high BLEU KPI means that an SMT engine is producing highly fluent translations; a low score means that it’s generally producing garbage. BLEU scores of 60% or higher are normally required before any SMT engine is considered production ready.
BLEU Score is easy use and understand, it is language independent and correlates highly with human evaluation which is why it is the most widely used KPI determining the quality of SMT engines.
TER stands for Translation Error Rate and is an important KPI used to predict the most likely post-editing effort required for an SMT engine. It basically counts the number of insertions, deletions and substitutions that are required to transform a translation into a reference translation.
This is essentially what a professional translator would do in order to post-edit a translation to a level of publishable quality. Since post-editing is a timely and costly activity, SMT developers will try to minimize this and aim for low TER scores; less than 40% is a good benchmark for this KPI.
One more thing…
While older SMT systems produce these KPIs as single numerical values, a more modern approach is to look at the distribution of these KPIs across an SMT engine. This provides deeper insights and more accurate analysis of how an SMT engine is most likely going to perform in a production environment.
1. Always aim for a High BLEU score, a High F-Measure score and a Low TER score.
2. Look at all three scores together to get a more holistic assessment of your SMT engine.
3. Examine the distribution of these KPIs across your SMT engine – this will help you to make smarter data choices for future customizations.
In our next blog we shall take a detailed look at each of these scores and see how we can work with them to improve SMT systems.
Tony O’Dowd, Founder and Chief Architect, KantanMT