Building a KantanMT engine can be an enjoyable and rewarding experience, however it is important to remember that a bit of time and effort is needed to gather high quality data to get the results that you want. Studies have repeatedly shown that highly cleansed, domain specific training data produces much higher quality engines than generic, low quality data.
Traditionally, Translation Memories were the only form of training data, however, today you can use a variety of different resources to improve the quality of your engines including glossaries, stock engines, and monolingual text files.
Stock Engines: If you are new to Machine Translation and don’t have a huge library of TMX files, or if you would like to build bigger KantanMT engines than your resources allow – stock engines are a good starting point. KantanMT Stock engines are collections of highly cleansed bi-lingual training data that can be added to your client data to produce larger and more powerful engines.
There are over a hundred different stock engines to choose from on KantanMT.com. These include Medical, Legal, and Financial Engines.
Find a list of KantanMT Stock engines here >>
Translation Memory Files: These tend to be the best source of training data since the source and target texts are aligned. The optimal format for use with KantanMT is TMX (Translation Memory Exchange) format, however text files can also be used.
*Choose Translation Memories within the same domain as the engine you are building.
Monolingual Translated Text Files: Monolingual text files are used to create language models for a KantanMT engine. Language models are used for word and phrase selection and have a direct impact on the fluency and recall of KantanMT engines.
*Upload translated monolingual training data alongside bi-lingual training data when building your KantanMT engines.
Terminology Files: Terminology files or glossary files can also be used as training material. They ensure that your KantanMT engine uses the correct terminology of your clients, improving translation consistency and quality.
*Terminology files should be uploaded with your ‘client files’ and should be in a TBX file format.
Retraining:
Use your post-edited Machine Translation files to retrain and build even more powerful KantanMT engines.
For more information about preparing training data or engine re-training, please contact Kevin McCoy, MT Success Coach