In our last blog post I discussed some of the Key Performance Indicators (KPIs) used by SMT developers to estimate the performance quality of their KantanMT engines. These KPIs help developers understand what aspects of their SMT engine are performing well and which need improvement.

In this blog I’m going to dive deep into F-Measure, a KPI which can provide insight into; the relevancy of your training data, the engine’s overall performance, and the suitability of an SMT engine for a particular domain or content type.

What is F-Measure?

F-Measure is a KPI which measures the precision and recall capabilities of an SMT system. It can also be viewed as a measure of translation accuracy and relevancy.

f-measure analagyBursting Red Balloons

In SMT, we can look at precision as a percentage of retrieved words that are relevant and recall (sometimes referred to as sensitivity) as the percentage of relevant words that are retrieved.

This is best explained using a thought experiment: So, imagine a box containing 10 red balloons and a few green balloons. Suppose we burst 5 balloons at random and 3 of these are red – we can calculate our precision as 3/5 (60%) and our recall as 3/10 (40%).

These two calculations offer a good estimation of the accuracy with which we are able to burst red balloons – the higher this calculation is, the better the chances that we will burst more red balloons.

So what has this thought experiment got to do with SMT systems?

Precision & Recall

Precision and recall are closely related to the understanding of accuracy.  Since SMT systems are based on pattern recognition, it is helpful to see how accurate they are at retrieving words and more importantly how relevant this retrieval is.

F-Measure is a calculation of both precision and recall and is expressed as a ratio.
If we go back to our balloon bursting experiment, precision was calculated as 60% and recall as 40%. To express these two values as a ratio, we can use the F-Measure formula as follows:-

f-measure     0.48

Source: Statistical Machine Translation by Philipp Koehn

In simple terms – we’re just not good at bursting red balloons 🙂

F-Measure and SMT engines

Using F-Measure we can get a general sense of the accuracy in which an SMT engine can retrieve words. If we examine the distribution of these scores across a set of reference translations we can get helpful insights which we can use to improve the training data and boost engine performance.

Here’s an example of an F-Measure distribution:

Statistical Machine Translation graph

Screen shot of Kantan BuildAnalytics F-Measure distributions

The overall F-Measure score for this particular SMT engine is 72%. This is a good value, and we can say that this engine is highly accurate at retrieving words for its target language and domain i.e. it has high precision in word retrieval and these are relevant to the target domain.

Also, the distribution of these scores across the reference translation set shows that the majority of these (60% of the total reference translations set) are in the 70-100% range. The distribution graph also shows that approximately 20% of the reference translations score less than 40%.  By examining this we can check to see if words/terminology are missing, and then create additional training material to improve the performance the engine.

Closing remarks…

F-Measure is a good starting point for understanding the quality of an SMT engine but it does have a major downfall, while it measures the recall and precision capabilities of an SMT engine, it doesn’t take into the account the order in which the words are retrieved.

So, as in the famous sketch with Andre Previn and Morecambe and Wise, we may know all the notes but not necessarily in the right order:

Morecambe_and_Wise_YT_screenshot

One more thing…
In order to improve the F-Measure score, an engine must become aware of word order, which is sometimes referred to as fluency. In the next post I will look at BLEU (Bilingual Evaluation Understudy) and examine how this metric helps us to further understand the quality of SMT engines.

KantanMT’s new BuildAnalytics technology illustrates the distribution of F-Measure, BLEU, and TER score across our members SMT engines. It also generates a Gap Analysis, highlighting missing words in members training data, and gives a provides KantanMT members with a training data rejects reports – great information that helps members of KantanMT.com develop a deep understanding of how their SMT engines work, and how to improve their performance.

You can watch a video of Kantan BuildAnalytics here>>