Get the Best from Neural MT with Quality Data

In this post Pat Nagle, our Project Manager at KantanMT speaks about Neural MT and the importance of using high quality data while training MT engines. He delves deep into the various ways in which KantanMT data can be used in order to get the best translation output.

KantanMT creates hassle-free, convenient solutions for your translation needs. Most of the difficulty surrounding Machine Translation centres around Data – whether you have enough data to create an engine, is your data in the correct format? Is it aligned? Has your data been cleansed? The KantanMT platform offers several data solutions for your needs. Two of which are KantanLibrary™ and KantanFleet™.

KantanLibrary 

library-pie.jpgKantanLibrary, as the name suggests, is a library of data, catalogued by domain and language pair. Currently our library contains 8 domainsalong with a collection of dictionaries. Each domain has its own set of language pairs containing high quality, cleansed and aligned datasets in TMX format and all of them are made freely available to all users of the KantanMT platform. Approximately 80% of the data contained in Library has been sourced, cleansed and/or aligned by myself and all of it is licence free.

Library enables clients, who do not have their own data or enough data, to begin translating within a few hours of choosing KantanMT as their machine translation solution.

How to work with KantanLibrary: a quick introduction

Accessing KantanLibrary is just a matter of a few clicks of your mouse. Once you have created an engine and chosen your language pair click Training > Library.

1.jpg

The KantanLibrary selection screen will then be displayed, as seen in the image above. As you have already selected the language pair, only the datasets matching your selection will be displayed and multiple datasets can be selected regardless of domain. Once you are satisfied with your choices, click Save and then Build. This will begin the process of creating your custom machine translation engine and will be ready for translation within a few hours.

If time is of the essence or you do not have the luxury of waiting for an engine to build, don’t panic! KantanMT has the solution for that too. We call it KantanFleet.

KantanFleet

Along with the ease of KantanLibrary, we provide pre-built MT engines spanning the main domains of our Library. These engines have already been built with cleansed, high quality data chosen by me for your convenience and are a ‘ready to go’ translation solution. The Fleet is also customisable to your project. You can add your own terminology, translation memories, dictionaries etc and build it again to create a dedicated engine trained specifically with your project in mind.

These engines have also been built for quality. I have a few quality thresholds I adhere to when constructing our Fleet engines. Our quality metric thresholds are:

  • F-Measure: ≥ 70%
  • BLEU: ≥ 60%
  • TER: ≤ 40%

How to work with KantanFleet: a quick introduction

To access the KantanFleet; login to the platform with your credentials, once logged in, you can see the Dashboard go to Engines and click Fleet as indicated in the screenshot below. I have been working on expanding Fleet with the construction of a Generic domain. These new engines will combine the top five domains used in Library (Automotive, Financial, Legal, Medical, Technical) in available language pairs. Currently we have 13 Generic engines available from the KantanFleet section of the platform, with more to be added very soon.

2.jpg

Once the Fleet button is clicked the KantanFleet selection screen will be displayed. From this screen, you can select the domain of your new engine and click Next to select the language pair and direction. Clicking Next on this screen will display the final details of your pre-built engine which include its Name, Language pair, BLEU score, Total Word Count and Unique Word count. Once you click Add the engine will be copied to your account and you can begin translating at once.

Under the umbrella of KantanLabs, the Research and Development department of KantanMT, we have been researching and developing infrastructure for the current innovation in the field of MT:  Neural Machine Translation.

Neural Machine Translation and working with KantanMT data

Neural Machine Translation or Neural MT is the new industry ‘curiosity’ with many experts in the field of MT taking differing sides in the debate of whether Neural is the next leap forward for Machine Translation. Some of its advocates compare the Neural output to that of human quality translation. Realistically, this is not the case, but from a first-hand perspective, I can testify to the fact that Neural MT does in some cases improve the fluency and adequacy of MT translation; but as with all cases of Machine Translation, high quality, cleansed data is key.

KantanMT’s Neural journey began in late 2016, when Dr. Dimitar Shterionov, Head of MT Research and Development at KantanLabs, began testing our first Neural framework and in early 2017 I joined KantanLabs. Under Dimitar’s expert supervision we have progressed to the point where we can now offer the Neural Experience to you, our clients, so you can see if NeuralMT is the right move going forward.

We recently hosted our first KantanFest, where industry experts, academics and NeuralMT users came together to speak, debate and educate on the subject. To view our videos from the proceedings you can head to our YouTube channel.

collage.jpg

We call all our current NeuralFleet engines Single Domain engines. This means that the engine is predominantly constructed with data from 1 domain source e.g. Legal or Technical. As I mentioned earlier, we have already built numerous Multi-Domain/Generic engines. These Statistical engines will be converted to their Neural equivalent and will also be made available to you through the KantanMT platform.

KantanMT recently conducted an industry relevant Quality Evaluation of NeuralMT versus PBSMT. Without going into too much detail, KantanMT constructed PBSMT and Neural engines across five language pairs. Each engine pair (PBSMT and Neural) was trained on identical datasets to get a true comparison of how each framework would utilize the same data. The outputs of both engine types were then evaluated by fifteen Master level Translation students for fluency and adequacy. The results of our findings have been published and presented at EAMT 2017 and our methodology and evaluation results can be downloaded here.

As we have created both Neural and Statistical engines from the same datasets, any of our customers can initiate their own evaluation on the platform. You don’t have to take our word for it! All the tools, data and engines used in our evaluation are all available on the platform.

Neural MT: Under the hood

The translation industry relies on quality metrics to give us a generalised estimate of the output we can expect our engine to produce. With regards to Neural MT, we identified dozens of segments per language that were assigned a quality score (BLEU) of 0% but during our evaluation these segments were chosen by our professional translators as being of higher fluency and adequacy than that of its Statistical counterpart. If we can’t rely on quality scores like BLEU to evaluate Neural MT, what do we use?

Before we go into further details on the under-estimation of quality scores with regards to NeuralMT segments, let’s try to demystify the process of creating a NeuralMT engine on the KantanMT platform. As with any aspect of the KantanMT product range, we aim to make creating an NMT engine as convenient as possible.

First of all, to access to our NeuralFleet engines is the same process as acquiring a KantanFleet engine.3.jpgWhen selecting the domain for your KantanFleet engine, there is an option for NeuralMT. Select this option and click Next. As per the screenshot above, you can then select your language pair and then click Next to review a summary of your engine or click Add to copy your new NeuralMT engine to your account. Once it has completed coping, you can begin translating at once.

If you wish to convert an SMT engine that you have previously created, the process is painless and straight-forward. Click the name of your SMT engine from the Dashboard and then select Properties . The Edit Properties screen will then be displayed. Under Type you can select between StatisticalMT or NeuralMT. Select Neural MT and then Click Save. Your new NeuralMT engine is now available on your Dashboard and the neural nature is indicated by a diamond symbol that appears after its name. Remember, to finalise the creation of your NeuralMT engine, you will need to rebuild it by going to Training > Build.

4.jpg

NeuralMT engine’s framework has a lot more computational tasks to perform than that of SMT. All KantanMT’s Neural offerings are built using OpenNMT, an industrial strength, open source Neural Machine Translation system. Our models are built using Linux and GPUs (Graphical Processing Unit) together with a CPU (Central Processing Unit) to accelerate deep learning. More information on OpenNMT and GPUs can be found here.

Here are some terms that are often used when discussing and working with NeuralMT.

Encoder/Decoder: A bidirectional recurrent neural network (RNN), known as an Encoder, is used by the neural network to encode a source sentence for a second RNN, known as a Decoder, that is used to predict words in the target language.

Hidden Layer: This layer is set between the Encoder and Decoder and is where the actual processing is done. The hidden layer is made up of nodes which contain the function to decode the input as a target output.

Epoch: This describes the number of times the algorithm sees the entire data set, from start to finish.

Iteration: This describes the number of times a batch or sample of data has passed through the algorithm which means the sample passes through the encoder, hidden layer and decoder (forward pass) and back again through the layers (backward pass). Each time this happens is considered an iteration.

BPE: BPE or Byte Pair Encoding is a simple form of data compression in which the most common pair of consecutive bytes of data are replaced with a byte that does not occur within the data, essentially creating new words in the form of a table of replacements which is required to rebuild the original data.

Perplexity: is the measurement of how well a probability distribution or probability model predicts a sample. A low perplexity indicates the probability distribution is good at predicting the sample.

Neural MT Quality

I mentioned the inefficiency of BLEU with regards to its under-estimation of Neural quality earlier. Instead of using a quality metric to estimate how well our engine will translate, we use Perplexity. Perplexity values run from High to low. High at the beginning of the training because the probability distribution is negligible to low were we can say the distribution is good. Another deciding factor is how fast the perplexity reaches its low point. This can indicate how well your engine is learning. Too slow could mean your data is too varied or there is too much data being put through each iteration, which then increases the duration of each epoch and so on. As a rule, out perplexity threshold for Neural MT engines is ≤ 3.00.

This value can be viewed throughout your build cycle on the platform. When your NeuralMT engine is building, go to Jobs. Here all your active jobs are displayed. Find your engine build job on the list and click on Job Details on the right-hand side of the screen. Once your engine has reached the 20% progress mark, the perplexity score will be visible to check in Job details as indicated by the screenshot below.

5.jpg

The above figure display belongs to a NeuralFleet engine from the Legal domain. In the Job Details screen, the perplexity is shown at 2.59 at the 6th epoch. This engine was then stopped and evaluated. If the engine does not perform well after stopping at this point, you have 2 options. You can continue building your engine from the last Epoch it has completed. If I chose to continue with this engine it would begin its build progress from the 7th epoch as it has completed the first 6. The other option is to add more data, possible tuned in-domain data which could improve the perplexity value of the engine.

Neural Machine Translation is the “new kid on the block” when it comes to translation technologies and I hope I have been able to provide you with a brief idea of how we at KantanMT work with our Neural engines. More importantly, I hope it has helped to debunk some of the ongoing debates regarding quality. Always remember, “Quality in, Quality out”. Using good quality data will always help improve your translation outputs and KantanMT can offer this and more. More engines, both Statistical and Neural, will be added to the KantanMT catalogue in the near future. Watch this space.

If you have any questions, reach out and contact us. We want to know what you think. You can contact me at patn@kantanmt.com or contact our Professional Services team for queries on the platform and what we can offer you at professionalservices@kantanmt.com

 

About Pat Nagle:

Pat 700x700.pngPat Nagle is a Project Manager at KantanMT. He holds a B.Sc. in Software Systems Development from Waterford Institute of Technology. He brings to the team more than eight years of expereince in Localization and Client Support, in organisations like Lionbridge, Symantec, AOL and Sun Life Insurance.

Pat works with the Professional Services Team and has recently joined KantanLabs, the Research and Development department at KantanMT.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s