Q&A: Tips for Preparing Training Data for High Quality Machine Translation

KantanMT, Machine TranslationMachine Translation (MT) has experienced a surge in popularity. However, achieving the right level of quality output can be challenging, even for the most expert MT engineers.

In the webinar ‘Tips for Preparing Training Data for High Quality MT’, KantanMT’s Founder and Chief Architect, Tony O’Dowd and Selçuk Özcan, Co-founder of Transistent Language Automation Services discussed how to best prepare training data to build high quality Statistical Machine Translation (SMT) engines. Here are their answers from the Q&A session.

Reading time: 5 mins

When it comes to Machine Translation, we know that quantity does not always equal quality. In your opinion, how many words will it take to build a fully functional engine?

Tony O’Dowd: Great question! Based on the entire community of Kantan users today, we have more than 7600 engines on our system. Those engines range from very small all the way up to very large. The biggest engines, which are in the eCommerce domain, contain about a billion words each.

If we exclude all the billion word MT engines so they don’t distort the results then the average size of a KantanMT engine today is approximately 5 million source words.

For example, if you look at our clients in the automotive industry, they have engines in and around 5 million source words, which are producing very high quality MT output.

How long does it take to build an engine of that size?

TOD: Again using KantanMT.com as an example. We can build an MT engine at approx. 4 million words per hour. Therefore, a 5 million-word engine takes approximately an hour to an hour and a half or 90 minutes to build. Compared with other MT providers in the industry, this is insanely fast.

This speed is possible because of our AWS cloud infrastructure. At the moment, we have 480 servers running the system. With such fast build times, our clients can retrain their engines more frequently, giving them higher levels of productivity and higher levels of quality output than most other systems. Read a client use case where speed had a positive impact on MT quality for eCommerce product descriptions (Netthandelen/Milengo case study).

How long does it take to accumulate that many words?

TOD: Most of our clients are able to deliver those words themselves, but our clients who don’t have 5 million source words will normally upload what they have and select one of our stock engines to help them reach a higher word count.

When we look at building an engine for a client, we look at the number of source words, but the key number for us is the number of unique words in an engine. For instance, if I want to have a high quality German engine in a narrow domain it might consist of 5 million source words. More importantly, the unique word count in that engine is going to be close to a million or slightly more than a million unique words.

If I have a high unique word count, I know the engine is going to know how to translate German correctly. Therefore, we don’t look at one word count, we look at a number of different word counts to achieve a high quality engine.

Another factor to consider is the level of inflected forms in the language. This is an indicator of how many words are needed. In order to educate and train the system we need more examples and usage examples of those inflected forms. Generally speaking, highly inflected languages require a lot more training data, so to build a Hungarian engine, which is an incredibly inflected language you will need in excess of 2-3 times the average word count to get workable high quality output.

What kind of additional monolingual data do you have?

TOD: There are 3 areas where we can help in improving suitable, relevant and quality monolingual data.

  1. We have a library of training data stock engines on KantanMT.com, which all include monolingual data in a variety of domains (Medical, IT, Financial etc.).
  2. In addition to stock engines, most of our clients upload their own monolingual data either as PDF files, docx or simple text files and we normalise that data. We have an automatic process in place to cleanse the data and convert it into a suitable format for machine translation/machine learning.
  3. We also offer a spider service, where clients give us a list of domain related URLs where we can collect monolingual data. For example, we recently built a medical engine for a client in the US, in Mexican Spanish and we collected more than 150k medical terms from health service content, which provided a great boost to the quality and more importantly the fluency of the MT engine.

Selçuk Özcan: At Transistent, we collect data from open source projects and open source data. First, we define some filters to ensure that we have the relevant monolingual data from the open source tools, which also includes spidering techniques. We then create a total corpus with the monolingual data we collected, which is used for training the MT engine.

What is the difference between pre-normalisation and final normalisation?

SÖ: The normalisation process is related to the TMS (Translation Management System), CMS (Content Management System) and TM (Translation Memory) systems. Pre-normalisation is applied to the text extracted from your systems to assure that the job will be processed properly. Final normalisation is then applied to the MT output to ensure that content is successfully integrated into the systems.

Can pre-normalisation and final normalisation be applied to corpora from TMs?

SÖ: It is possible to implement normalisation rules to corpora from TM systems. You have to configure your rules depending on your TM tool. Each tool has its own identification and encoding features for tags, markups, non-translatable strings and attributes.

How many words is considered too many in a long segment?

KantanMT QualityTOD: As part of our data cleansing policy, any data uploaded to a Kantan engine goes through 12 phases of data cleansing. Only segments that pass those 12 phases are included in the engine training. That may seem like a very harsh regime, but it is in place for a very good reason.

At KantanMT, the 3 things we look for in training data is:

  1. Quality
  2. Relevance
  3. Quantity

We make sure that all the data you upload is very clean from a structural and linguistic point of view before we include it in your engine. If the training data fails any of those 12 steps, it will be rejected. For example, one phase is to check for long segments. By default, any segments with more than 40 words are rejected. This can be changed depending on the language combination and domain, but the default is 40 words or 40 tokens.

SÖ: As Tony mentioned, it also depends on the language pairs. Nevertheless, you may also want to define the threshold value according to the dynamics of your system; i.e. data, domain, required target quality and so on. We usually split the segments with 40 – 45 words.

How long does it take to normalise the data?

SO: The time frame for normalising data depends on a number of factors, including the language pair and differences between linguistic structures that you are working on, how clean the data is and the source of the data. If you have lots of formulas or non-standard characters, it will take longer to normalise that data.

For Turkish it might take an average of 10-15 days to normalise an average of 10 million words. Of course, this depends on the size of the team involved and the volume of data to be processed.

TOD: The time required to normalise data is very much data driven. A rule of thumb in the Kantan Professional Services Team is; standard text consisting mostly of words, such as text from a book, online help or perhaps user interface text, where the predominant token is a word is normalised very quickly because there are no mixed tokens in the data set – only words.

However, if you have numerical data, scientific formulas and product specifications such as measurements with a lot of part numbers there is a high diversity of individual tokens as opposed to simple words. This type of data takes a little bit longer to normalise because you have to instruct the engine, which you can do using the GENTRY programming language and Named Entity Recognition (NER) software.

We have GENTRY and NER built into the KantanMT.com so we can educate the engine to recognise those tokens. This is important because if the engine doesn’t recognise the data it can’t handle it during the translation phase.

The more diverse the tokens in your input, the longer the normalisation process is, and conversely the less diverse the tokens are the quicker the data can be processed. If it’s just words the system can handle this automatically.

We use this rule of thumb when working with clients to estimate how long it will take to build their engines, as we need to be able to give them some sense of a schedule around building an actual MT engine.

What volume of words would you suggest for a good Turkish engine?

SÖ: It makes no sense to work on a Turkish MT system, if you do not have at least a million words of bilingual data and 3 million words of monolingual data. Even in this case, you will have to work more on analyses, testing procedures and rule sets. Ideally, you will have approx. 10 million words of bilingual data. It’s the basic equation in terms of SMT Engine training, the more data you have the higher quality you achieve.

How long does it take to build an engine for Turkish?

SÖ: It depends on the language pair and the field of expertise or domain. Things may be harder if you are working on a language pair each of which has different linguistic structure such as English and Turkish. However, it’s not impossible to maintain a mature MT system for such a language pair, you will just need to have a longer time on it. Another parameter that effects the required time to have a mature MT system is the quality of your data to be utilised for the intended MT system. It is hard to give a specific time estimation without looking at the data but in general, it will probably take 2 to 6 months to have the intended production system.

View the Slide Deck

Watch the webinar recording:

To know more about the KantanMT platform, contact us (demo@kantanmt.com) now for a platform demonstration.

Video: Machine Translation Success – Milengo and KantanMT

Global Business with KantanMTMachine translation applications have sky rocketed, and we as consumers demand content to be readily available in our native language. We make purchases online quickly, and expect those purchases delivered to our doors regardless of language and shipping destination.

Common Sense Advisory identified that three quarters of online consumers prefer to buy in their own languages. This is significant for online business, and as such companies are aware that a localized product or service available online means a much greater customer pool, which in turn leads to more sales and a bigger return for stakeholders.

The global ecommerce market is growing at approx 30% per year and is estimated to reach $2 trillion in sales in 2015 (Democratization of Ecommerce Report, BigCommerce).

There is one big ‘wall’ still standing between more sales revenues and happy customers, and that is ‘multilingual support’. Traditional multilingual support requires a heavy investment in translation and localization workflows, not to mention a plethora of specialists needed to provide linguistic support.

However, ‘Big data’, computing capabilities and the cloud are creating unique possibilities to avoid such heavy investments and companies that choose to embrace these new opportunities are reaping the rewards.

KantanMT’s Founder and Chief Architect, Tony O’Dowd and Deepan Patel, Machine Translation Solutions Architect at Milengo Ltd. discuss the opportunities offered by implementing a cloud based machine translation solution. They examine Milengo’s experience using KantanMT to optimise its translation supply chain, and illustrate, with examples; how the leading translation company uses KantanMT.com to achieve excellent results in ongoing MT projects for some of the world’s major companies

Key Takeaways:

  1. Manage User Expectations: Clear communication with the client about the process, workflow and expected results will ensure trust and confidence in the project. Even without a pilot test, Milengo still managed to localize a web shop with 780,000 Danish words to Swedish in 17 days.
  2. Think to Scale: The localization process must always be scalable, each example for; software documentation (Interactive Intelligence), ecommerce (Netthandelen) and automotive parts data required an automated solution that could be scaled.
  3. Customise It: MT customisation can fulfil a wide variety of localization needs. Not only is it more cost efficient (Netthandelen achieved 62% cost savings), it enables engine retraining quickly, and improves its ability generate higher quality translations.

To learn how you can generate meaningful business intelligence that lets you manage and improve the ROI from Machine translation, contact us for a free consultation and/or personalised platform demonstration.

RTS Case Study

RoundTable Studio Logo

About RoundTable Studio:
RoundTable Studio is a leading provider of translation and localisation services for the Spanish and Brazilian Portuguese language markets. With production centres in Brazil and Argentina, and a small business and project management unit in Spain, RTS and its team of more than 50 linguists, project managers and technical experts serve a worldwide client base across many vertical markets including IT, business & finance, and manufacturing. The company’s quality ethos and high level of customer service are significant factors in their success and reputation within the industry, and their openness to new technology is proving to be an important factor in their increased competiveness and company growth.

Situation:
Global collaboration and increased competition are formidable factors which are affecting both the translation industry and the industries it serves. Consequently, LSP’s like RTS are beginning to search for scalable, cost effective solutions which will help them bridge the gap between meeting expectations of ever faster turnaround times and maintaining high quality standards at affordable rates.

Laura Grossi, Localisation Engineer at RTS confesses that “sometimes there are just not enough linguists to carry out jobs” – this is a feeling that many LSPs in search for a capacity solution are echoing.

Solution:
RTS has been active in Machine Translation since 2005 when the company started working with one of its key clients on a focused Machine Translation initiative.  This relatively early exposure led to the company building substantial technical and practical expertise, as well as a firm belief that Machine Translation has an important role to play in the future of the industry.  In addition to collaborating with client specific Machine Translation programs, RTS also realized the need to find a solution to help integrate Machine Translation selectively into its tools portfolio.  However, finding an Machine Translation provider which would give them the freedom to manage and control their Machine Translation activity in a strategic manner proved to be a difficult task. RTS assessed and tested a number of Machine Translation solutions before being introduced to KantanMT in October 2012. The company was immediately impressed by the simplicity, flexibility, control and evaluation metrics that the KantanMT platform provided.

Using KantanMT quality evaluation metrics, including BLEU, TER and F-Measure, RTS was able to expand its knowledge and improve engine quality and subsequently output quality. Grossi notes that a favourite feature of the platform is the KantanWatch™ reporting function, a measurement tool allowing her to track engine quality over time, helping the team to become more adept at choosing training data to reduce set up times and increase productivity.

Measuring KantanMT using KantanWatch

Strategic and selective deployment of Machine Translation within its workflow is enabling RTS to improve capacity and flexibility of translation throughput.  In combination with a focused investment in developing resources and capacity for post-editing in order to maintain the same quality standards as with human translation, the company considers it a critical tool for improving efficiency and growing business, as well as offering its customers optimal overall value and service.
Grossi concludes that RTS “has increased its productivity on certain translation jobs significantly”, and by implementing KantanMT, “has increased capacity levels to take on translation jobs they otherwise would have had to turn down”.
Adopting KantanMT technology has helped RTS successfully create a foundation for ensuring its future business competitiveness.

Read how Matrix Communications AG introduced Machine Translation into their workflow KantanMT.com >>