Which Neural MT Model Should I Choose?

As the localization industry strives at a fast pace to integrate Machine Translation into mainstream workflows to increase productivity, reduce cost and gain a competitive advantage, it’s worthwhile taking time to consider which type of Neural MT provides the best results in terms of translation quality and cost.

This is a question that has been occupying our minds here at KantanMT and eBay over the past several months. The fact is, Neural MT comes in many variants – with the different models available yielding remarkably different quality results.

Overview of Neural Network Types

The main models of Neural MT are:

Recurrent Neural Networks (RNNs) – these have been designed to recognize sequential characteristics of data and use the detected patterns to predict the next most likely sequence. Training happens in a both forward and backward direction; hence, the use of the descriptor recurrent. RNNs have been the predominant neural network of choice by most MT providers.

Fig 1: Image Courtesy of Jeremy Jordon

Convolutional Neural Networks (CNNs) – these are the main type of networks used in computer image processing (e.g., facial recognition and image searching) but can also be used for machine translation purposes. The model exploits the 2D-structure of input data. Training process is simplified and CNNs require less computational overhead to compute models.

Fig 1: Image Courtesy of Jeremy Jordon

Transformer Neural Networks (TNNs) – The predominant approach to MT is based on the recurrent/convolutional neural networks model of connecting the encoder and decoder through an attention mechanism. However, the Transformer Neural Networks model uses only the attention mechanisms aspect (e.g., contextual characteristics of input data). This completely avoids using the recurrence and convolution structures of the other models. This has the effect of simplifying the training process and reducing the computational requirements for TNN modelling.

Fig 3: Image Courtesy of “The Illustrated Transformer” by Jay Alammar

The eBay NMT Experiment

To determine which model yields the best translation outcomes, eBay and KantanMT collaborated and set up a controlled experiment using the KantanMT platform, which supports all three types of Neural Models.

The language arc English => Italian was chosen, and the domain defined as eBay’s Customer Support content. Each Kantan model variant was trained on identical training data sets which consisted of:

eBay’s in-domain translation memory
eBay’s glossaries and lists of brand names
Supplementary KantanLibrary training corpora

The Test Reference Set was created by the eBay MT Linguistic Team by sampling the eBay Translation Memory to mirror its segment length distribution (e.g., 10% short segments, 30% medium and 60% long).

To provide a comprehensive comparison and ranking of the performance of different models, the translation outputs from the following systems were included in our joint experiment:

Kantan TNN (Transformer Neural Network, customized)
Kantan CNN (Convolutional Neural Network, customized)
Kantan RNN (Recurrent Neural Network, customized)
Bing Translate (Transformer Neural Network, generic)
Google Translate (Transformer Neural Network, generic)

Human Translation (HT) was also included in this comparison and ranking to determine how neural machine translation outputs compare to translations provided by Professional Translators.

The evaluator was an eBay Italian MT language specialist with domain expertise and experience in ranking and assessing the quality of machine translation outputs.

The following Key Performance Indicators (KPIs) were chosen to determine the comparative fluency and adequacy of each system:

Fluency = Fluency determines the translation follows common grammatical rules and contains expected word collocation. This KPI measures whether the machine translation segment is formed in the same way a human translation would
Adequacy = Adequacy measures how much meaning is expressed in the machine translation segment. It measures whether the machine translation segment contains as much of the meaning as if it were translated by a human

Each KPI was rated on a 5-star scale, with 1 star being the lowest rating (i.e., No Fluency) and 5 stars being the highest rating (i.e., Human-Level Fluency).

KantanLQR was used to manage the assessment, randomise and anonymise the Test Reference Set, score the translation outputs, and collate the feedback from the eBay MT linguist.

The Results

Our Conclusions

The Custom Kantan Transformer Neural Network (Kantan TNN) performed the best in terms of Fluency and Adequacy. It outperformed RNNs in terms of Fluency by 9 percentage points (which is statistically significant), and 11 percentage points in terms of Adequacy. While there is still some way to go to achieve near-human-level quality (as depicted by the HT graphs), Transformer Neural Networks provide significant improvements in MT quality in terms of Fluency and Adequacy, and they offer the best-bang-for-your-buck in terms of training time and process simplification.

Since this blog was first published, comparative analysis has also been carried out for English=>German, English=>Spanish and English=>French language combinations and in all cases Kantan TNNs out-performed CNNs, RNNs, Google and Bing Translate.