Why good data is so crucial in AI and Machine Translation
These days there is a lot of talk about data; big data, data mining, bad data, data sets… are all concepts that have entered our collective speech. We are surrounded by software products and applications that collect information and adapt their output accordingly to respond better to our needs. Data is the fuel that feeds the machine learning furnace, but how does it really impact on a company’s return on investment?
AI and Machine Translation systems are built and trained with data. In Machine Translation, the closer this data is to the content that needs to be translated, the better the translation results will be. For example, if a company needs to translate user reviews, it would be a really bad idea to train the engine with data belonging to, for instance, the construction industry, unless these customer reviews had to do with construction.
To use a simple analogy, let’s say that you have a dog called K9. If you would like K9 to pick up the post for you every morning without lunging at the postman, you are going to have to teach it the fetch command. If instead of that, K9 learns to chase away strangers from your property, you will never see another letter again!
What constitutes good data
In general terms, machine translation engines need very large amounts of bilingual data. Bilingual data, sometimes called, parallel corpora, are huge sets of terms, phrases and sentences aligned one by one with their translated correspondences. For instance, an English-speaking customer support supplier that desires to reach the German market will need this parallel corpus in English and German.
The closer to the topic at hand the training bilingual data is, the better the translation will be. This is what in the field is called in-domain data. In a real-life scenario this could mean that if the automatic translation is good enough to be understood by the English-speaking agent, then post-editing becomes unnecessary, saving the company time and money.
Machine translation systems require huge amounts of data to be created, sometimes hundreds of millions of words; for that reason, on occasion it becomes impossible to train them exclusively with the company’s own data. Never fear, to breach the gap, providers like KantanAI offer a range of in-domain data sets belonging to different industries and verticals that can be used as a starting point to train the system, together with the company’s data.
Examples of industry verticals:
Let’s go back to our dog K9. K9 could have learned the wrong set of commands and ended up chasing the postman, instead of retrieving the letter; but sometimes inconsistent or faulty training, in our case, data, is just as bad. For that reason, it is important to cleanse the data sets before using them to train a translation system.
Here are some recommended steps:
Remove inconsistencies: if for whatever reason there are inconsistent translations in the datasets, these should be removed or fixed. Sometimes, the translation of a term or product changes over time and appears in two different ways in the dataset; if that is the case the old inconsistent translation needs to be fixed.
Remove noise: excessively long sentences, translated segments that appear in the wrong language (yes, it happens!) or segments that only contain symbols or numbers, introduce “noise” in the training process and lower the quality of the results.
Anonymisation: in order to be GDPR compliant, all personal identification will need to be removed from the datasets.
KantanAI’s production pipeline contains thorough cleaning steps that can deal successfully with all these types of issues.
When bad data becomes good
Some systems created to translate user generated content (UGC) need to be able to identify and translate content that might not be fully grammatically correct, that might contain typos, abbreviations, emoticons, etc. Such systems, through a combination of specific training datasets and pre-production scripts, are able to translate this type of content successfully into the required languages.
Graphic with examples of user generated content.
Good data sets can save you time and money and the engine trained with them will produce high quality in-domain translations that will require little or no intervention.
Train your dog K9 with clear and consistent commands and you will never miss any post!