Overcome Challenges of building High Quality MT Engines with Sparse Data

KantanMT Whitepaper Improving your MT

Many of us, involved with Machine Translation are familiar with the importance of using high quality parallel data to build and customize good quality MT engines. Building high quality MT engines with sparse data is a challenge faced not only by Language Service Providers (LSPs), but any company with limited bilingual resources. A more economical alternative to creating large quantities of high quality bilingual data can be found by adding monolingual data in the target language to an MT engine.

Statistical Machine Translation systems use algorithms to find the most probable translations, based on how often patterns occur in the training data, so it makes sense to use large volumes of bilingual training data. The best data to use for training MT engines is usually high quality bilingual data and glossaries, so it’s great if you have access to these language assets.

But what happens when access to high quality parallel data is limited?

Bilingual data is costly and time-consuming to produce in large volumes, so the smart option is to come up with more economical language assets, and monolingual data is one of those economical assets. MT output fluency improves dramatically, by using monolingual data to train an engine, especially in cases where good quality bilingual data is a sparse language resource.

More economical…

Many companies lack the necessary resources to develop their own high quality in domain parallel data. But, monolingual data – is readily available in large volumes across different domains. This target language content can be found anywhere; websites, blogs, customers and even company specific documents created for internal use.

Companies with sparse parallel data can really leverage their available language assets with monolingual data to produce better quality engines, producing more fluent output. Even those with access to large volumes of bilingual data can still take advantage of using monolingual data to improve target language fluency.

Target language monolingual data is introduced during the engine training process so the engine learns how to generate fluent output. The positive effects of including monolingual data in the training process have been proven both academically and commercially.  In a study for TAUS, Natalia Korchagina confirmed that using monolingual data when training SMT engines considerably improved the BLEU score for a Russian-French translation system.

Natalia’s study not only “proved the rule” that in domain monolingual data improves engine quality, she also identified that out of domain monolingual data also improves quality, but to a lesser extent.

Monolingual data can be particularly useful for improving scores in morphologically rich languages like; Czech, Finnish, German and Slovak, as these languages are often syntactically more complicated for Machine Translation.

Success with Monolingual Data…

KantanMT has had considerable success with its clients using monolingual data to improve their engines quality. An engine trained with sparse bilingual data (the sparse bilingual data was still greater than the amount of data in Korchagina’s study) in the financial domain showed a significant improvement in the engine’s overall quality metrics when financial monolingual data was added to the engine:

  • BLEU score showed approx. 40% improvement
  • F-Measure score showed approx. 12% improvement
  • TER (Total Error Rate), where a lower score is better saw a reduction of approx. 50%

The support team at KantanMT showed the client how to use monolingual data to their advantage, getting the most out of their engine, and empowering the client to improve and control the accuracy and fluency of their engines.

How will this Benefit LSPs…

Online shopping by users of what can be considered ‘lower density languages’ or languages with limited bilingual resources is driving demand for multilingual website localization. Online shoppers prefer to make purchases in their own language, and more people are going online to shop as global internet capabilities improve. Companies with an online presence and limited language resources are turning to LSPs to produce this multilingual content.

Most LSPs with access to vast amounts of high quality parallel data can still take advantage of monolingual data to help improve target language fluency. But LSPs building and training MT engines for uncommon language pairs or any language pair with sparse bilingual data will benefit the most by using monolingual data.

To learn more about leveraging monolingual data to train your KantanMT engine; send the KantanMT Team an email and we can talk you through the process (info@kantanmt.com), alternatively, check out our whitepaper on improving MT engine quality available from our resources page.

 

 

Motivate Post-Editors

KantanMT motivate post-editorsPost-editing is a necessary step in the Machine Translation workflow, but the role is still largely misunderstood. Language Service Providers (LSPs) are now experimenting more with the best practices for post-editing in the workflow. The lack of consistent training and reluctance within the industry to accept importance of the role are linked to the post-editors motivation. KantanMT looks at some of the more conventional attitudes towards motivation and their application to post-editing.

What is motivation and what studies have been done so far?

Understanding the concept of motivation has been a hot topic in many areas of organisation theory. Studies in the area really began to kick off with their application in the workplace, opening doors for pioneers to understand how employees could be motivated to do more work, and do better work.

Motivation Pioneers

  • Abraham Maslow and his well-known ‘Hierarchy of Needs’ indicates a person’s motivations are based on their position in the hierarchy pyramid.
  • Frederick Herzberg’s ‘two Factor Theory’ or Herzberg’s motivation-hygiene theory suggests professional activities like; professional acknowledgement, achievements and work responsibility, or job satisfiers have a positive effect on motivation.
  • Douglas McGregor used a black and white approach to motivation in his ‘Theory X and Theory Y’. He grouped employees into two categories; those who will only do the minimum and those who will push themselves.

As development of theories continued…

  • John Adair came up with the ‘fifty-fifty theory’ . According to the fifty-fifty theory, motivation is fifty percent the responsibility of the employee and fifty percent outside the employee’s control.

Even more recently, in 2010

  • Teresa Amabile and Steven Kramer carried out a study on the motivation levels of employees in a variety of settings. Their findings, suggest ‘Progress’ as the top performance motivator identified from an analysis of approx. 12,000 diary entries, daily ratings of motivation and emotions from hundreds of study participants.

To understand post-editor motivation we can combine the top performance motivator; progress with fifty-fifty theory.

Progress is a healthy motivator in the post-editing profession, it can help Localization Project Managers understand and encourage post-editor satisfaction and motivation. But while progress can be deemed an external factor, if we apply Adair’s ‘fifty-fifty’ rule, post-editors are also at least fifty percent responsible for their own motivation.

Post-editing as a profession is still only finding its feet, TAUS carried out a study in 2010 on the post editing practices of global LSPs. The study showed that, while post-editing is becoming a standard activity in the translation workflow it only accounts for a minor share of LSP business volume. This indicates that post-editors see their role as one of lesser importance because the industry views it as a role of lesser importance.

This attitude in the industry is highlighted by the lack of industry standards for post-editing best practices. Without evaluation practices to train post-editors and improve the post-editing process, post-editors are not making progress. This quite naturally is demotivating for the post-editor.

How to motivate post-editors

The first step in motivating post-editors is to recognise their role as autonomous to the role of a translator. The best post-editors are those, who are at least bilingual with some form of linguistic training, like a translator. Linguistic training is a major asset for editing the Machine Translated output.

TAUS offer a comparison of the translation process versus the post-editing process, highlighting the differences in the post-editing and translation processes.

KantanMT, Translator process Taus 2010
Translation process of a Translator (TAUS 2010)
KantanMT, Motivating Post-editors,
Translation process of a Post-editor (TAUS 2010)

One process is not more complicated that the other, only different. Translators, translate internally, while post-editors make “snap editing decisions” based on client requirements. As LSPs recognise these differences, they can successfully motivate their post-editors by providing them with the most suitable support, and work environment.

Progress as a Motivator

Translators make good post-editors, they have the linguistic ability to understand both the source and target texts, and if they enjoy editing or proof-reading, then the post-editing role will suit them. The right training is also important, if post-editors are trained properly they will become more aware of potential improvements to the workflow.

These improvements or ideas can be a great boost to post-editor motivation, if implemented the post-editor can take on more responsibility, which helps improve the translation workflow. A case where this could be applied is; if the post-editor is made responsible for updating the language assets used to retrain a Machine Translation system, they can take ownership and become responsible for the output quality rather than just post-editing Machine Translation output in isolation.

Fixing repetitive errors, can be frustrating for anyone, not just post-editors. But if they are responsible for the output quality, understand the system and can control the rules used to reduce these repetitive errors, they will experience motivation through progress.

This is only the tip of the iceberg on what motivates post-editors, each post-editor is different and how they feel about the role, whether it is just ‘another job’ or a major step in their career all play a part. The key is to provide proper training, foster an environment where post-editors can make progress by positively contributing to the role.

Translators often take pride and ownership of their translations, post-editors should also have the opportunity to take pride in their work, as it is their skills and experience that make it ‘publishable’ or even ‘fit for purpose’ quality.

Repetitive errors like diacritic marks or capitalisation can be easily fixed using KantanMT’s Post-Editing Automation (PEX) rules. PEX rules allow repetitive errors in a Machine Translation engine to be easily fixed using a ‘find and replace’ tool. These rules can be checked on a sample of the text by using the PEX Rule Editor.

The post-editor can correct repetitive errors during post-editing process, so the same errors don’t appear in future MT output, giving them responsibility over the Machine Translation engines quality.

Automatic Post-Editing

KantanMT - PEX Post EditorPost-Editing Machine Translation (PEMT) is an important and necessary step in the Machine Translation process. KantanMT is releasing a new, simple and easy to use PEX rule editor, which will make the post-editing process more efficient, saving both time, costs and the post-editors sanity.

As we have discussed in earlier posts, PEMT is the process of reviewing and editing raw MT output to improve quality. The PEX rule editor is a tool that can help to save time and cut costs. It helps post-editors, since they no longer have to manually correct the same repetitive mistakes in a translated text.

Post-editing can be divided into roughly two categories; light and full post-editing.  ‘Light’ post-editing, also called ‘gist’, ‘rapid’ or ‘fast’ post-editing focuses on transferring the most correct meaning without spending time correcting grammatical and stylistic errors. Correcting textual standards, like word order and coherence are less important in a light post-edit, compared to a more thorough ‘full’ or ‘conventional’ post-edit. Full post-edits need the correct meaning to be conveyed, correct grammar, accurate punctuation, and the correct transfer of any formatting such as tags or place holders.

The Client often dictates the type of post-editing required, whether it’s a full post-edit to get it up to ‘publishable quality’ similar to a human translation standard, or a light post-edit, which usually means ‘fit for purpose’. The engine’s quality also plays a part in the post-editing effort; using a high volume of in-domain training data during the build produce higher quality engines, which helps to cut post-editing efforts. Other factors such as language combination, domain and text type all contribute to post-editing effort.

Examples of repetitive errors

Some users may experience the following errors in their MT output.

  • Capitalization
  • Punctuation mistakes, hyphenation, diacritic marks etc.
  • Words added/omitted
  • Formatting – trailing spaces

SMT engines use a process of pattern matching to identify different regular expressions. Regular expressions or ‘regex’ are special text strings that describe patterns, these patterns need no linguistic analysis so they can be implemented easily across different language pairs. Regular expressions are also important components in developing PEX rules. KantanMT have a list of regular expressions used for both GENTRY Rule files (*.rul) and PEX post-edit files (*.pex).

Post-Editing Automation (PEX)

Repetitive errors can be fixed automatically by uploading PEX rule files. These rule files allow post-editors to spend less time correcting the same repetitive errors by automatically applying PEX constructs to translations generated from a KantanMT engine.

PEX works by incorporating “find and replace” rules. The rules are uploaded as a PEX file and applied while a translation job is being run.

PEX Rule Editor

KantanMT have designed a simple way to create, test and upload post-editing rules to a client profile.

KantanMT Pex Rule Editor

The PEX Rule editor, located in the ‘MykantanMT’ menu, has an easy to use interface. Users can copy a sample of the translated text into the upper text box ‘Test Content’ then input the rules to be applied in the ‘PEX Search Rules’ and their corrections to the ‘PEX Replacement Rules’ box. The user can test the new rules by clicking ‘test rules’ and instantly identify any incorrect rules, before they are uploaded to the profile.

The introduction of tools to assist in the post-editing process helps remove some of the more repetitive corrections for post-editors. The new PEX Editor feature helps improve the PEMT workflow by ensuring all uploaded rule files are correct leading to a more effective method for fixing repetitive errors.

Conference and Event Guide – December 2013

KantanMT eventsThings are winding down as we are getting closer to the end of the year, but there are still some great events and webinars coming up during the month of December that we can look forward to.

Here are some recommendations from KantanMT to keep you busy in the lead up to the festive season.

Listings

Dec 02 – Dec 05, 2013
Event: IEEE CloudCom 2013, Bristol, United Kingdom

Held in association with Hewlett-Packard Laboratories (HP Labs), the conference is open to researchers, developers, users, students and practitioners from the fields of big data, systems architecture, services research, virtualization, security and high performance computing.


Dec 04, 2013
Event: LANGUAGES & BUSINESS Forum – Hotel InterContinental Berlin

The forum highlights key issues in language education, particularly in the workplace and the new technologies that are becoming a key part of the process. The event, will promote international networking and has four main themes; Corporate Training, Pre-Experience Learners, Intercultural Communication and Online Learning.


Dec 05, 2013
Webinar: Effective Post-Editing in Human and Machine Translation Workflows

Stephen Doherty and Federico Gaspari, CNGL (Centre for Next Generation Localisation) will give an overview of post-editing and different post-editing scenarios from ‘gist’ to ‘full’ post-edits. They will also give advice on different post-editing strategies and how they differ for Machine Translation systems.


Dec 07 – Dec 09, 2013
Event: 6th Language and Technology Conference, Poznan, Poland

The conference will address the challenges of Human Language Technologies (HLT) in computer science and linguistics. The event covers a wide range of topics including; electronic language resources and tools, formalisation of natural languages, parsing and other forms of NL processing.


Dec 09 – Dec 13, 2013
Event: IEEE GLOBECOM 2013 – Power of Global Communications, Atlanta, Georgia USA

The conference, which is the second largest of the 38 IEEE technical societies will focus on the latest advancements in broadband, wireless, multimedia, internet, image and voice communications. Some of the topics presented referring to localization occur on the 10th December and include; Localization Schemes, Localization and Link Layer Issues, and Detection, Estimation and Localization.


Dec 10 – Dec 11, 2013
Event: Game QA & Localization 2013, San Francisco, California USA

This event brings together QA and Localisation Managers, Directors and VPs from game developers around the world to discuss key game localization industry challenges. The event in London, June 2013 was a huge success, as more than 120 senior QA and localization professionals from developers, publishers and 3rd party suppliers of all sizes and platforms came to learn, benchmark and network.


Dec 11 – Dec 15, 2013
Event: International Conference on Language and Translation, Thailand, Vietnam and Cambodia

The Association of Asian Translation Industry (AATI) is holding an International Conference on Language and Translation or “Translator Day” in three countries; Thailand on December 11, 2013, Vietnam on December 13, 2013, and Cambodia on December 15, 2013. The events provide translators, interpreters, translation agencies, foreign language centres, NGO’s, FDI financed enterprises and other translation purchasers with opportunities to meet.


Dec 12, 2013
Webinar: LSP Partnerships & Reseller Programs 16:00 GMT (11:00 EST/17:00 CET)

This webinar, which is hosted by GALA and presented by Terena Bell covers how to open up new revenue streams by introducing reseller programs to current business models. The webinar is aimed at world trade associations, language schools, and other non-translation companies wishing to offer their clients translation, interpreting, or localization services.


Dec 13 – Dec 14 2013
Event: The Twelfth Workshop on Treebanks and Linguistic Theories (TLT12), Sofia (Bulgaria)

The workshops, hosted by BulTreeBank Group­­­­­­­ serve to promote new and ongoing high-quality work related to syntactically-annotated corpora such as treebanks. Treebanks are important resources for Natural Language processing applications including Machine Translation and information extraction. The workshops will focus on different aspects of treebanking; descriptive, theoretical, formal and computational.


Are you planning to go to any events during December? KantanMT would like to hear about your thoughts on what makes a good event in the localization industry.

Crowdsourcing vs. Machine Translation

KantanMT CrowdsourcingCrowdsourcing is becoming more popular with both organizations and companies since the concept’s introduction in 2006, and has been adopted by companies who are using this new production model to improve their production capacity while keeping costs low. The web-based business model, uses an open call format to reach a wide network of people willing to volunteer their services for free or for a limited reward, for any activity including translation. The application of translation crowdsourcing models has opened the door for increased demand of multilingual content.

Jeff Howe, Wired magazine defined crowdsourcing as:

“…the act of taking a job traditionally performed by a designated agent (usually an employee) and outsourcing it to an undefined, generally large group of people in the form of an open call”.

Crowdsourcing costs equate to approx. 20% of a professional translation. Language Service Providers (LSPs) like Gengo and Moravia have realised the potential of crowdsourcing as part of a viable production model, which they are combining with professional translators and Machine Translation.

The crowdsourcing model is an effective method for translating the surge in User Generate Content (UGC). Erratic fluctuations in demand need a dynamic, flexible and scalable model. Crowdsourcing is definitely a feasible production model for translation services, but it still faces some considerable challenges.

Crowdsourcing Challenges

  • No specialist knowledge – crowdsourcing is difficult for technical texts that require specialised knowledge. It often involves breaking down a text to be translated into smaller sections to be sent to each volunteer. A volunteer may not be qualified in the domain area of expertise and so they end up translating small sections text, out of context, with limited subject knowledge which leads to lower quality or mistranslations.
  • Quality – translation quality is difficult to manage, and is dependent on the type of translation. There have been some innovative suggestions for measuring quality, including evaluation metrics such as BLEU and Meteor, but these are costly and time consuming to implement and need a reference translation or ‘gold standard’ to benchmark against.
  • Security – crowd management can be a difficult task and the moderator must be able to vet participants and make sure that they follow the privacy rules associated with the platform. Sensitive information that requires translation should not be released to volunteers.
  • Emotional attachment – humans can become emotionally attached to their translations.
  • Terminology and writing style inconsistency – when the project is divided amongst a number of volunteers, the final version’s style needs to be edited and checked for inconsistencies.
  • Motivation – decisions on how to motivate volunteers and keep them motivated can be an ongoing challenge for moderators.

Improvements in the quality of Machine Translation have had an influence on crowdsourcing popularity and the majority of MT post-editing and proofreading tasks fit into crowdsourcing models nicely. Content can be classified into ‘find-fix-verify’ phases and distributed easily among volunteers.

There are some advantages to be gained when pairing MT technology and collaborative crowdsourcing.

Combined MT/Crowdsourcing

Machine Translation will have a pivotal role to play within new translation models, which focus on translating large volumes of data in cost-effective and powerful production models. Merging both Machine Translation and crowdsourcing tasks will create not only fit-for-purpose, but also high quality translations.

  • Quality – as the overall quality of Machine Translation output improves, it is easier for crowdsourcing volunteers with less experience to generate better quality translations. This will in turn increase the demand for crowdsourcing models to be used within LSPs and organizations. MT quality metrics will also make post-editing tasks more straightforward and easier to delegate among volunteers based on their experience.
  • Training data word alignment and engine evaluations can be done through crowd computing, and parallel corpora created by volunteers can be used to train and/or retrain existing SMT engines.
  • Security – customized Machine Translation engines are more secure when dealing with sensitive product or client information. General or publicly available information is more suited to crowdsourcing.
  • Terminology and writing style consistency – writing style and terminology can be controlled and updated through a straightforward process when using MT. This avoids the idiosyncrasies of volunteer writing styles. There is no risk of translator bias when using Machine Translation.
  • Speed – Statistical Machine Translation (SMT) engines can process translations quickly and efficiently. When there is a need for a high volume of content to be translated within a short period of time it is better to use Machine Translation. Output is guaranteed within a designated time and crowdsourcing post-editing tasks speeds up the production process before final checks are carried out by experienced translators or post-editors.
crowdsource and Machine Translation model
Use of crowdsourcing for software localization. Source: V. Muntes-Mulero and P. Paladini, CA Technologies and M. Solé and J. Manzoor, Universitat Politècnica de Catalunya.

Last chance for a FREE TRIAL for KantanAnalytics™ for all members until November 30th 2013. KantanAnalytics will be available on the Enterprise Plan.

MT Lingo

293-blueman-thinking-designMachine Translation TerminologyMT technology can be overwhelming for those new to the industry, and getting to grips with the jargon can be a daunting task even for some of the most and industry savvy gurus. KantanMT put together a list of some acronyms, popular buzzwords and numeronyms, which are abbreviations that use numbers, so that you can keep up with the MT professionals or just brush up on your tech vocabulary.

Numeronyms:

  • L10n – Localization/Localisation the process of adapting and translating a product or service so that it is culturally acceptable for a specific country or region.
  • I18n – Internationalization/Internationalisation is a process implemented in the planning stages of a product or application, it ensures the infrastructure (coding) suits future translations or localizations. Some of the more common Internationalization preparation for software products involves supporting international character sets like Unicode, or ensuring there is enough space in the User Interface (UI) for text to be translated from languages like English with single-byte character codes to the multiple-byte character codes used in Chinese and Japanese Kanji.
  • G11n – Globalization/Globalisation refer to the internationalization and localization preparations for products and services to be released in global markets. It usually incorporates ‘sim-ship’ or simultaneous shipment to different regions.

Acronyms:

  • MT – Machine Translation or Automated Translation is a translation carried out by computer. A piece of natural language text like English is translated by computer software into another language like French.  Cloud MT is Machine Translation based on the cloud. There are different types of MT systems available.
  • RBMT – Rule-Based Machine Translation system that uses a list of syntactic, grammatical and translation rules to generate the most appropriate translations.
  • SMT – Statistical Machine Translation systems are data driven and have a statistical modelling architecture using algorithms to find the most probable match between source and target segments.
  • API – Application Programming Interface is an interface that allows communication and interoperability between two applications or software programs.
  • LSP – Language Service Provider sometimes referred to as a Localization Service Provider, is a service provider that carries out the translation and localization of different types of content for specific countries or locales.
  • TM – Translation Memory is a database of aligned source and target translations called segments. Segments can be words, sentences or paragraphs. TMs can be integrated with CAT tools and they help speed up the translation process. TM files can be used as training data to train SMT engines.
  • SPE – Statistical Post-editing is when Machine Translation output that has been post-edited is re-used as training data and fed back into the SMT engine or used to train a new engine.

Popular Buzzwords:

  • Normalization is the checking and cleaning up a Translation Memory so it can be included as training data for a SMT engine. Things to identify and correct are tags, mistranslations, sentence mismatches and stylistic features like upper and lower case inconsistencies.
  • CAT tools – Computer-Aided Translation tools/ Computer-assisted Translation tools are used by humans to support the translation process by managing MT, TM and glossaries.
  • Glossaries are vocabulary lists of specialised terminology, usually specific to an industry or organisation. These files can be uploaded as additional training data to an SMT engine.
  • Bilingual corpus/ Bi-text database is a large text document with source and target languages. If the corpus is aligned it can be used as training data for an SMT engine.

If you know any new terms or interesting words you heard from your experience in the language and localization industry, KantanMT would love to hear about them, just pop them into the comment box below.

Interview: Working on KantanMT – a Developers Perspective

Eduardo shanahan
Eduardo Shanahan, CNGL

Eduardo Shanahan, a Senior Software Engineer at CNGL spent time working on KantanMT during its early days. KantanMT asked Eduardo to talk about what it was like to work with Founder and Chief Architect, Tony O’Dowd and the rest of the team developing the KantanMT product.

What was your initial impression, when you joined DLab in DCU?

This past year was a different kind adventure. After more than two decades working with Microsoft products like Visual Studio, so it was a big change, moving to Dublin City University (DCU) to be part of the Design and Innovation Lab, or DLab as we call it. The work in DLab consists of transforming code written by researchers into industrial quality products.

One of the first changes was to get a Mac and start deploying code in Linux, with no Visual Studio or even Mono. Instead I worked mostly with Python and NodeJS, and piles of shell scripts. Linux and Python, were not new to me but they did take some adjusting to using them.

This was a completely new environment and a new experience, and I was working in a whole new area. Back then, my relationship with Artificial Intelligence (AI) was informal to say the least, and I wasn’t even aware that something like Statistical Machine Translation (SMT) existed.

How did you get involved with working on KantanMT?

Starting out, I was working on a variety of different projects simultaneously.  A few months into it though, I started working full time with a couple of researchers creating new functionality for Tony and his KantanMT product, which is based on open source Moses technology. Moses technology uses aligned target and source texts of parallel corpora to train a SMT translation system. Once the system is trained, search algorithms are applied to find the most suitable translation matches. This translation model can be applied to any language pair.

What were your goals working on the KantanMT project?

Tony is doing a great job, deploying it on Amazon Web Services and creating a set of tools to streamline the operations for end users. His request to CNGL, was to provide more advanced insight into the translation quality produced by Moses.

To accomplish this, the task was mapped to two successive projects with different researchers on each project. The pace was very intense, we wanted state of the art results that showed up in the applications. Sandipan Dandapat, Assistant Professor in the Department of Computer Science and Engineering, IIT Guwahati and Aswarth Dara, Research Assistant at CNGL, DCU worked on adding real value to the KantanMT product during those long weeks, while I was rewriting their code time after time until it passed all the tests and then some. Our hard work paid off when KantanWatch™ and KantanAalytics™ were born.

Each attempt to deliver was an experience in itself, Tony was quick to detect any inconsistencies and wanted to be extra sure about understanding all the details and steps on the research and implementation.

In your opinion was the work a success?

The end result, is something that has made me proud. The mix between being a scientist and having a real product to implement is a very good combination. The guys at DCU have done a great job on the product base and DLab is a fantastic research and work environment.  The no nonsense attitude from Tony’s side created a very interesting situation and It’s something that we can really celebrate after a year of hard work.

The CNGL Centre for Global Intelligent Content

The CNGL Centre for Global Intelligent Content (Dublin City University, Ireland) is supported by the Science Foundation Ireland. During its academic-industry collaborative research it has not only driven standards in content and localization service integration, but it is also pioneering advancements in Machine Translation through the development of disruptive and cutting edge processing technologies. These technologies are revolutionising global content value chains across a number of different industries.

The CNGL research centre draws its talent and expertise from a combined 150 researchers from Trinity College Dublin, Dublin City University, University College Dublin and University of Limerick. The centre also works closely with industry partners to produce disruptive technologies that will have a positive impact both socially and economically.

KantanMT allows users to build a customised translation engine with training data that will be specific to their needs. KantanMT are continuing to offer a 14 day free trial to new members.

Pricing PEMT 2

KantanMT blog, Pricing PEMT

Segment-by-segment Machine Translation Quality Estimation (QE) scores are reforming current Language Service Provider (LSP) business models.

Pricing Machine Translation is one of the most widely debated topics within the translation and localization industries. Many agree that there is no ‘black and white’ approach, because a number of variables must always be taken into consideration when costing a project. Industry experts are in agreement that levels of post-editing effort and payment should be calculated through a fair and easily replicated formula. This transparency is the goal KantanMT had in mind during the development of KantanAnalytics™, a “game-changing” technology in the localization industry.

New Business Model

The two greatest challenges facing Localization Project Managers are; how to cost and schedule Machine Translation projects. Experienced PM’s can quickly gauge how long a project will take to complete, but there is still an element of guesswork and contingency planning involved. This is intensified when you add Machine Translation. Although, not a new technology, the practical application in a business environment is still in infancy stages.

Powerful Machine Translation engines can be easily integrated into an LSP workflow. Measuring Machine Translation quality on a segment-by-segment basis and calculating post-editing effort on those segments allows LSPs to create more streamlined business models.

Studies have shown post-editing Machine Translation can be more productive than translating a document from scratch. This is especially true when translators or post-editors have a broad technical or subject knowledge of the text’s domain. In these cases they can capitalise on their knowledge with higher post-editing productivity.

So, how should a Machine Translation pricing model look?

The development of a technology that can evaluate a translation on a segment-by-segment basis and assign an accurate QE score to a Machine Translated text is critical for the successful integration of this technology into a project’s workflow.

The segment-by-segment breakdown and ‘fuzzy match’ percentage scoring system ensured the commercialisation of Translation Memories into LSP workflows. This system has been adopted as an industry standard for pricing translation jobs where translation memories or Computer Aided Translation (CAT) tools can be implemented. The next natural evolution, is to create a similar tiered ‘fuzzy’ matching system for Machine Translation.

Segment level QE technology is now available where Machine Translated segments are assigned percentage match values, similar to translation memory match values. Post-editing costs, similar to the costing of translation memory matches can be assigned. The match value also gives a clear indication of how long a project should take to post-edit based on the quality of the match and the post-editors skills and experience.

How can we trust the quality score?

The Machine Translation engine’s quality is based on the quality of the training data used to build the engine. The engines quality can be monitored with BLEU scores, F-measure and TER scoring. These automatic evaluation metrics indicate the engines quality, and combined with the ‘fuzzy’ match score, can be adjusted to get a more accurate picture of how post-editing effort is calculated and how projects should be priced. There are a number of variables that dictate how to create and implement a pricing model.

Variables to be considered when creating a pricing model

The challenge in measuring PEMT stems from a number of variables, which need to be considered by PMs when creating a pricing model:

  • Intended purpose – does the text require; a light, fast or full post-edit
  • Language pair and direction – Roman languages tend to provide better MT output
  • Quality of the MT system – better quality, domain specific engines produce better results
  • Post-editing effort – degree of editing required – minor edits or full retranslate
  • Post-editor skill and experience – post-editors with extensive domain expertise

Traditional Models

To overcome these challenges PMs traditionally opted for hourly or daily rates. However, hourly rates do not provide enough transparency or cost breakdown and can make a project difficult to schedule. These rates must also be calculated to take into consideration the translator or post-editors productivity and language pair.

Rates are usually calculated based on the translator or post-editor’s average post-editing speed within the specified domain. Day rates can be a good cost indicator for PMs based on the post-editors capabilities and experience, but again the cost breakdown is not completely transparent. Difficulties usually occur when a post-editor comes across a part of the text that requires more time or effort to post-edit, then productivity automatically drops.

As an example of the differing opinions in the translation community, pricing PEMT is dependent on the post-editing circumstances. Some posters on the Proz.com forum suggest that PEMT is priced as 30-50% or similar to editing a human translation. Others suggest, the output quality of a Machine Translation system is priced around the same as a ‘fuzzy’ match of 50-74% from a translation memory. These are broad subjective figures which do not take variables into consideration.

Calculation of the Machine Translated text on a segment-by-segment basis allows PMs to calculate post-editing effort based on the quality of customised Machine Translation engines. PMs can then use these calculations to build an accurate pricing model for the project, which incorporates all relevant variables. It also makes it possible to distribute post-editing work evenly across translators and post-editors making the most efficient use of their skills. Benefits to calculating post-editing effort are also seen in scheduling and project turnaround times.


KantanAnalytics™ is a segment-by-segment quality estimation scoring technology, which when applied to a Machine Translated text will generate a quality score for each segment, similar to the fuzzy match scoring system used in translation memories.

Sign up for a free trail to experience KantanAnalytics until November 30th 2013 KantanAnalytics will be available on the Enterprise Plan to sign up or upgrade to this plan please email KantanMT’s Success Coach, Kevin McCoy (kevinmcc@kantanmt.com).

KantanMT and MemSource Cloud Connector

Tony O' Dowd KantanMT’s Founder and Chief Architect
Tony O’ Dowd

Cloud technology and web-based applications have made a significant impact on the localization industry, levelling the playing field between large and smaller Language service providers (LSPs). LSPs who leverage cloud technology can be more competitive. The ‘content explosion’ has also driven the need for on-demand translation services, and taking advantage of cloud technology is the most strategic option for translating large volumes of content securely and in real-time.

David Canek, CEO of MemSource Technologies
David Canek

Software integration plays an important role in achieving a centralised localization management structure. The MemSource Cloud connector developed to integrate with KantanMT will ensure greater control and productivity in localization and translation workflows.

To acknowledge the connector’s release, I caught up with, David Canek, CEO of MemSource Technologies and Tony O’Dowd, KantanMT’s Founder and Chief Architect to get their thoughts on the impact of cloud software integration in the localization industry.

MemSource recently developed a connector to integrate KantanMT and MemSource Cloud, can you explain how the connector works and what this will mean for its users?

[David] Yes, we have developed a connector that lets all of our 10 thousand users very easily select KantanMT as their preferred MT engine for their MemSource Cloud translation projects. The connector is part of our 3.8 release and available as from 3 November 2013. The KantanMT integration supports all of our Machine Translation features, including our post-editing features, specifically the post-editing analysis.

[Tony] The team at MemSource have developed a straightforward mechanism to integrate Machine Translation services into their cloud platform. The MemSource community of LSPs and professional translators can easily select KantanMT as their preferred Machine Translation engine. Integration between both platforms using this new KantanMT connector will boost translation productivity, reduce project costs and improve project margins for the MemSource community.

This partnership is a great example of synergy between two related businesses within the translation industry. How do you think integration will create value for clients and the industry?

[David] Machine Translation has become an integral part of the human translation process and so we found it a logical step to integrate an innovative player in the Machine Translation scene, such as KantanMT.

[Tony] KantanMT combines the speed and accuracy of traditional Translation Memory with the speed and cost-advantages of Machine Translation into a single seamless platform. The current economic climate indicates the localization industry can be certain of only two things – that margin erosion and price compression will continue to put pressure on LSPs to operate with higher levels of efficiency while lowering overall costs.

MemSource and KantanMT customers will benefit from achieving economies of scale when they integrate Machine Translation directly into their existing translation workflows. KantanMT scales effortlessly with business demands and growth, and KantanMT members will benefit from increased profitability as greater volumes of client data are processed.  This helps LSP’s achieve higher levels of operational efficiency while also delivering cost savings to their customers.

There is a lot of buzz around “moving to the cloud” in the tech world, particularly for translation and localization services. As a supplier of both cloud and server translation technology, have you noticed any preference for one over the other, which do your clients prefer and why?

[David] Our clients just like us prefer the cloud version of any technology, including MemSource technology. Therefore, we really focus on providing MemSource Cloud and it is only a question of time when we discontinue offering MemSource in the server option.

[Tony] Progressive companies cannot ignore the financial and operational efficiencies the cloud delivers. The cloud helps organisations achieve economies of scale through reduced capital costs, which are often associated with the investment and maintenance of a technology infrastructure. Combine this with new pricing models like lower monthly subscription fees, which are replacing large upfront software license fees, operating on the cloud ensures a competitive business. This is even more so in the localization industry where the translation of ‘big data’ from the content explosion has increased the need for on-demand localization and translation services. The cloud’s multi-tenant architecture offers LSPs a flexible solution for efficiently managing large volumes of data.

In your opinion, what will the integration of these technologies mean for the future translation industry in the short and longer terms?

[David] Machine Translation has become mainstream technology and will soon have the same importance as Translation Memory in the localization industry. We have believed in this vision right from the start of developing MemSource. This is why we have pioneered the post-editing analysis and other features in MemSource that bring Machine Translation to the forefront and seamlessly integrate it with existing technologies such as Translation Memory.

[Tony] In the short-term, the technology with the greatest impact in the translation industry will be the availability of high speed, on-demand Machine Translation services. It will be used as a tool to boost translator productivity, reduce project costs and improve margins. In using the KantanMT connector, LSPs can integrate Machine Translation into their translation workflows quickly and easily, immediately offering improved services to their clients.

Over the longer-term, like MemSource, KantanMT believes there will be a continuous push to blend Machine Translation and traditional Translation Memory systems into one seamless service. At KantanMT, we’ve made significant progress on this vision by fusing traditional Translation Memory with advanced Machine Translation into the KantanMT platform, and also through the recent development of predictive segment quality estimation technology called KantanAnalytics™.


Thank you, to both David and Tony who gave up time from their busy schedules to be interviewed.


There are still a couple of weeks left to take advantage of the KantanAnalytics™ feature. KantanAnalytics™ is available for ALL KantanMT members until 30th November. When the offer ends it will become an Enterprise Plan only feature.

For more information about the KantanMT Enterprise plan please contact Aidan (aidanc@kantanmt.com)

Training Data

KantanMT Training DataBuilding a KantanMT Engine: Training Data

When the decision is made to incorporate a KantanMT engine into a translation model, the next obvious and most difficult question to answer is what to use to train the engine? This is often followed by: what are the optimum training data requirements to yield a highly productive engine? And how will I curate my training data?

The engine’s target domain and objectives should be clearly mapped out ahead of the build. If the documents are for a specific client or domain then the relevant in-domain training data should be used to build the engine. This also ensures the best possible translation results.

KantanMT recommends a minimum of 2 million training words for each domain specific engine. Higher quantities of in-domain “unique words” will also improve the potential for building an “intelligent” engine.

The quality of the engine is based on the language or translation assets used to build the engine. Studies by TAUS have shown quality is more important than quantity. “Intelligently selected training data” generated higher BLEU scores than an engine built with more generic data. The studies also indicated, a proactive approach in customising or adapting the engine with translation assets led to better quality results.

Translation assets are the best source of suitable training data for building KantanMT engines, they include:

Stock Training Data: KantanMT stock engines are collections of highly cleansed bi-lingual training data sets. Quality is ensured as each data set shows the source corpora and approximate number of words used to create each stock engine. These can be added to client data to produce much larger and more powerful engines. There are over a hundred different stock engines to choose from, including industry specific sets such as IT, Legal, Medical and Finance. Find a list of KantanMT Stock engines here >>

Stock engines are a good starting point if you have limited TMX (Translation Memory Exchange) files in the required domain, or if you would simply like to build bigger KantanMT engines.

Translation Memory Files: This is the best source of high quality training data since both source and target texts are aligned. Translation Memories used for previous translations in a similar domain will also have been verified for quality. This guarantees the engine’s quality will be representative of the Translation Memory quality. As the old expression in the translation industry goes “garbage in, garbage out”, good quality Translation Memory files will yield a good quality Machine Translation engine. The TMX file format is the optimal format for use with KantanMT, however, text files can also be used.

Monolingual Translated Text Files: Monolingual text files are used to create language models for a KantanMT engine. Language models are used for word and phrase selection and have a direct impact on the fluency and recall of KantanMT engines. Translated monolingual training data should be uploaded alongside bi-lingual training data when building KantanMT engines.

Glossary Files: Terminology or glossary files can also be used as training material. Including a glossary improves terminology consistency and translation quality. Terminology files are uploaded with your ‘files to be translated’ and should also be in a TBX file format.

KantanISR™: Instant segment retraining technology allows users to input edited segments via the KantanISR editor. The segments then become training data and are stored in the KantanISR cache. The new segments are incorporated into the engine, avoiding the need to rebuild. As corrected data is included, the engine will improve in quality becoming an even more powerful and productive KantanMT engine.

KantanISR Instant Segment Retrainer
KantanISR editor

Building your KantanMT engine can be a very rewarding process. While some time is needed to gather the best data for a domain specific engine, there are many ways to enhance your engine that require little effort.

For more information about preparing training data or engine re-training, please contact Kevin McCoy, KantanMT Success Coach.