Q&A: Tips for Preparing Training Data for High Quality Machine Translation

Machine Translation (MT) has experienced a surge in popularity. However, achieving the right level of quality output can be challenging, even for the most expert MT engineers.

In the webinar ‘Tips for Preparing Training Data for High Quality MT’, KantanMT’s Founder and Chief Architect, Tony O’Dowd and Selçuk Özcan, Co-founder of Transistent Language Automation Services discussed how to best prepare training data to build high quality Statistical Machine Translation (SMT) engines. Here are their answers from the Q&A session.

Reading time: 5 mins

When it comes to Machine Translation, we know that quantity does not always equal quality. In your opinion, how many words will it take to build a fully functional engine?

Tony O’Dowd: Great question! Based on the entire community of Kantan users today, we have more than 7600 engines on our system. Those engines range from very small all the way up to very large. The biggest engines, which are in the eCommerce domain, contain about a billion words each.

If we exclude all the billion word MT engines so they don’t distort the results then the average size of a KantanMT engine today is approximately 5 million source words.

For example, if you look at our clients in the automotive industry, they have engines in and around 5 million source words, which are producing very high quality MT output.

How long does it take to build an engine of that size?

TOD: Again using KantanMT.com as an example. We can build an MT engine at approx. 4 million words per hour. Therefore, a 5 million-word engine takes approximately an hour to an hour and a half or 90 minutes to build. Compared with other MT providers in the industry, this is insanely fast.

This speed is possible because of our AWS cloud infrastructure. At the moment, we have 480 servers running the system. With such fast build times, our clients can retrain their engines more frequently, giving them higher levels of productivity and higher levels of quality output than most other systems. Read a client use case where speed had a positive impact on MT quality for eCommerce product descriptions (Netthandelen/Milengo case study).

How long does it take to accumulate that many words?

TOD: Most of our clients are able to deliver those words themselves, but our clients who don’t have 5 million source words will normally upload what they have and select one of our stock engines to help them reach a higher word count.

When we look at building an engine for a client, we look at the number of source words, but the key number for us is the number of unique words in an engine. For instance, if I want to have a high quality German engine in a narrow domain it might consist of 5 million source words. More importantly, the unique word count in that engine is going to be close to a million or slightly more than a million unique words.

If I have a high unique word count, I know the engine is going to know how to translate German correctly. Therefore, we don’t look at one word count, we look at a number of different word counts to achieve a high quality engine.

Another factor to consider is the level of inflected forms in the language. This is an indicator of how many words are needed. In order to educate and train the system we need more examples and usage examples of those inflected forms. Generally speaking, highly inflected languages require a lot more training data, so to build a Hungarian engine, which is an incredibly inflected language you will need in excess of 2-3 times the average word count to get workable high quality output.

What kind of additional monolingual data do you have?

TOD: There are 3 areas where we can help in improving suitable, relevant and quality monolingual data.

We have a library of training data stock engines on KantanMT.com, which all include monolingual data in a variety of domains (Medical, IT, Financial etc.).
In addition to stock engines, most of our clients upload their own monolingual data either as PDF files, docx or simple text files and we normalise that data. We have an automatic process in place to cleanse the data and convert it into a suitable format for machine translation/machine learning.
We also offer a spider service, where clients give us a list of domain related URLs where we can collect monolingual data. For example, we recently built a medical engine for a client in the US, in Mexican Spanish and we collected more than 150k medical terms from health service content, which provided a great boost to the quality and more importantly the fluency of the MT engine.

Selçuk Özcan: At Transistent, we collect data from open source projects and open source data. First, we define some filters to ensure that we have the relevant monolingual data from the open source tools, which also includes spidering techniques. We then create a total corpus with the monolingual data we collected, which is used for training the MT engine.

What is the difference between pre-normalisation and final normalisation?

SÖ: The normalisation process is related to the TMS (Translation Management System), CMS (Content Management System) and TM (Translation Memory) systems. Pre-normalisation is applied to the text extracted from your systems to assure that the job will be processed properly. Final normalisation is then applied to the MT output to ensure that content is successfully integrated into the systems.

Can pre-normalisation and final normalisation be applied to corpora from TMs?

SÖ: It is possible to implement normalisation rules to corpora from TM systems. You have to configure your rules depending on your TM tool. Each tool has its own identification and encoding features for tags, markups, non-translatable strings and attributes.

How many words is considered too many in a long segment?

TOD: As part of our data cleansing policy, any data uploaded to a Kantan engine goes through 12 phases of data cleansing. Only segments that pass those 12 phases are included in the engine training. That may seem like a very harsh regime, but it is in place for a very good reason.

At KantanMT, the 3 things we look for in training data is:

Quality
Relevance
Quantity

We make sure that all the data you upload is very clean from a structural and linguistic point of view before we include it in your engine. If the training data fails any of those 12 steps, it will be rejected. For example, one phase is to check for long segments. By default, any segments with more than 40 words are rejected. This can be changed depending on the language combination and domain, but the default is 40 words or 40 tokens.

SÖ: As Tony mentioned, it also depends on the language pairs. Nevertheless, you may also want to define the threshold value according to the dynamics of your system; i.e. data, domain, required target quality and so on. We usually split the segments with 40 – 45 words.

How long does it take to normalise the data?

SO: The time frame for normalising data depends on a number of factors, including the language pair and differences between linguistic structures that you are working on, how clean the data is and the source of the data. If you have lots of formulas or non-standard characters, it will take longer to normalise that data.

For Turkish it might take an average of 10-15 days to normalise an average of 10 million words. Of course, this depends on the size of the team involved and the volume of data to be processed.

TOD: The time required to normalise data is very much data driven. A rule of thumb in the Kantan Professional Services Team is; standard text consisting mostly of words, such as text from a book, online help or perhaps user interface text, where the predominant token is a word is normalised very quickly because there are no mixed tokens in the data set – only words.

However, if you have numerical data, scientific formulas and product specifications such as measurements with a lot of part numbers there is a high diversity of individual tokens as opposed to simple words. This type of data takes a little bit longer to normalise because you have to instruct the engine, which you can do using the GENTRY programming language and Named Entity Recognition (NER) software.

We have GENTRY and NER built into the KantanMT.com so we can educate the engine to recognise those tokens. This is important because if the engine doesn’t recognise the data it can’t handle it during the translation phase.

The more diverse the tokens in your input, the longer the normalisation process is, and conversely the less diverse the tokens are the quicker the data can be processed. If it’s just words the system can handle this automatically.

We use this rule of thumb when working with clients to estimate how long it will take to build their engines, as we need to be able to give them some sense of a schedule around building an actual MT engine.

What volume of words would you suggest for a good Turkish engine?

SÖ: It makes no sense to work on a Turkish MT system, if you do not have at least a million words of bilingual data and 3 million words of monolingual data. Even in this case, you will have to work more on analyses, testing procedures and rule sets. Ideally, you will have approx. 10 million words of bilingual data. It’s the basic equation in terms of SMT Engine training, the more data you have the higher quality you achieve.

How long does it take to build an engine for Turkish?

SÖ: It depends on the language pair and the field of expertise or domain. Things may be harder if you are working on a language pair each of which has different linguistic structure such as English and Turkish. However, it’s not impossible to maintain a mature MT system for such a language pair, you will just need to have a longer time on it. Another parameter that effects the required time to have a mature MT system is the quality of your data to be utilised for the intended MT system. It is hard to give a specific time estimation without looking at the data but in general, it will probably take 2 to 6 months to have the intended production system.