Post-Editing Machine Translation

Statistical Machine Translation (SMT) has many uses – from the translation of User Generated Content (UGC) to Technical Documents, to Manuals and Digital Content. While some use cases may only need a ‘gist’ translation without post-editing, others will need a light to full human post-edit, depending on the usage scenario and the funding available.

Post-editing is the process of ‘fixing’ Machine Translation output to bring it closer to a human translation standard. This, of course is a very different process than carrying out a full human translation from scratch and that’s why it’s important that you give full training for staff who will carry out this task.

Training will make sure that post-editors fully understand what is expected of them when asked to complete one of the many post-editing type tasks. Research (Vasconcellos – 1986a:145) suggests that post-editing is a honed skill which takes time to develop, so remember your translators may need some time to reach their greatest post-editing productivity levels. KantanMT works with many companies who are post-editing at a rate over 7,000 words per day, compared to an average of 2,000 per day for full human translation.

Types of Training: The Translation Automation User Society (TAUS) is now holding online training courses for post-editors.

post-editing

Post-editing Levels

Post-editing quality levels vary greatly and will depend largely by the client or end-user. It’s important to get an exact understanding of user expectations and manage these expectations throughout the project.

Typically, users of Machine Translation will ask for one of the following types of post-editing:

  • Light post-editing
  • Full post-editing

The following diagram gives a general outline of what is involved in both light and full post-editing. Remember however, the effort to meet certain levels of quality will be determined by the output quality your engine is able to produce

post-editing machine translation

Generally, MT users would carry out productivity tests before they begin a project. This determines the effectiveness of MT for the language pair, in a particular domain and their post-editors ability to edit the output with a high level of productivity. Productivity tests will help you determine the potential Return on Investment of MT and the turnaround time for projects. It is also a good idea to carry out productivity tests periodically to understand how your MT engine is developing and improving. (Source: TAUS)

You might also develop a tailored approach to suit your company’s needs, however the above diagram offers some nice guidelines to start with. Please note that a well-trained MT engine can produce near human translations and a light touch up might be all that is required. It’s important to examine the quality of the output with post-editors before setting productivity goals and post-editing quality levels.

PEX Automatic Post-editing

Post-Editor Skills

In recent years, post-editing skills have become much more of an asset and sometimes a requirement for translators working in the language industry. Machine Translation has grown considerably in popularity and the demand for post-editing services has grown in line with this. TechNavio predicted that the market for Machine Translation will grow at a compound annual growth rate (CAGR) of 18.05% until 2016, and the report attributes a large part of this rise to “the rapidly increasing content volume”.

While the task of post-editing is markedly different to human translation, the skill set needed is almost on par.

According to Johnson and Whitelock (1987), post-editors should be:

  • Expert in the subject area, the text type and the contrastive language.
  • Have a perfect command of the target language

Is it also widely accepted that post-editors who have a favourable perception of Machine Translation perform better at post-editing tasks than those who do not look favourably on MT.

How to improve Machine Translation output quality

Pre-editing

Pre-editing is the process of adjusting text before it has been Machine Translated. This includes fixing spelling errors, formatting the document correctly and tagging text elements that must not be translated. Using a pre-processing tool like KantanMT’s GENTRY can save a lot of time by automating the correction of repetitive errors throughout the source text.

More pre-editing Steps:

Writing Clear and Concise Sentences: Shorter unambiguous segments (sentences) are processed much more effectively by MT engines. Also, when pre-editing or writing for MT, make sure that each sentence is grammatically complete (begins with a capital letter, has at least one main clause, and has an ending punctuation).

Using the Active Voice: MT engines work impressively on text that is clear and unambiguous, that’s why using the active voice, which cuts out vagueness and ambiguity can result in much better MT output.

There are many pre-editing steps you can carry out to produce better MT output. Also, keep in mind writing styles when developing content for Machine Translation to cut the amount of pre-editing required. Get tips on writing for MT here.

For more information about any of KantanMT’s post-editing automation tools, please contact: Gina Lawlor, Customer Relationship Manager (ginal@kantanmt.com).

Overcome Challenges of building High Quality MT Engines with Sparse Data

KantanMT Whitepaper Improving your MT

Many of us, involved with Machine Translation are familiar with the importance of using high quality parallel data to build and customize good quality MT engines. Building high quality MT engines with sparse data is a challenge faced not only by Language Service Providers (LSPs), but any company with limited bilingual resources. A more economical alternative to creating large quantities of high quality bilingual data can be found by adding monolingual data in the target language to an MT engine.

Statistical Machine Translation systems use algorithms to find the most probable translations, based on how often patterns occur in the training data, so it makes sense to use large volumes of bilingual training data. The best data to use for training MT engines is usually high quality bilingual data and glossaries, so it’s great if you have access to these language assets.

But what happens when access to high quality parallel data is limited?

Bilingual data is costly and time-consuming to produce in large volumes, so the smart option is to come up with more economical language assets, and monolingual data is one of those economical assets. MT output fluency improves dramatically, by using monolingual data to train an engine, especially in cases where good quality bilingual data is a sparse language resource.

More economical…

Many companies lack the necessary resources to develop their own high quality in domain parallel data. But, monolingual data – is readily available in large volumes across different domains. This target language content can be found anywhere; websites, blogs, customers and even company specific documents created for internal use.

Companies with sparse parallel data can really leverage their available language assets with monolingual data to produce better quality engines, producing more fluent output. Even those with access to large volumes of bilingual data can still take advantage of using monolingual data to improve target language fluency.

Target language monolingual data is introduced during the engine training process so the engine learns how to generate fluent output. The positive effects of including monolingual data in the training process have been proven both academically and commercially.  In a study for TAUS, Natalia Korchagina confirmed that using monolingual data when training SMT engines considerably improved the BLEU score for a Russian-French translation system.

Natalia’s study not only “proved the rule” that in domain monolingual data improves engine quality, she also identified that out of domain monolingual data also improves quality, but to a lesser extent.

Monolingual data can be particularly useful for improving scores in morphologically rich languages like; Czech, Finnish, German and Slovak, as these languages are often syntactically more complicated for Machine Translation.

Success with Monolingual Data…

KantanMT has had considerable success with its clients using monolingual data to improve their engines quality. An engine trained with sparse bilingual data (the sparse bilingual data was still greater than the amount of data in Korchagina’s study) in the financial domain showed a significant improvement in the engine’s overall quality metrics when financial monolingual data was added to the engine:

  • BLEU score showed approx. 40% improvement
  • F-Measure score showed approx. 12% improvement
  • TER (Total Error Rate), where a lower score is better saw a reduction of approx. 50%

The support team at KantanMT showed the client how to use monolingual data to their advantage, getting the most out of their engine, and empowering the client to improve and control the accuracy and fluency of their engines.

How will this Benefit LSPs…

Online shopping by users of what can be considered ‘lower density languages’ or languages with limited bilingual resources is driving demand for multilingual website localization. Online shoppers prefer to make purchases in their own language, and more people are going online to shop as global internet capabilities improve. Companies with an online presence and limited language resources are turning to LSPs to produce this multilingual content.

Most LSPs with access to vast amounts of high quality parallel data can still take advantage of monolingual data to help improve target language fluency. But LSPs building and training MT engines for uncommon language pairs or any language pair with sparse bilingual data will benefit the most by using monolingual data.

To learn more about leveraging monolingual data to train your KantanMT engine; send the KantanMT Team an email and we can talk you through the process (info@kantanmt.com), alternatively, check out our whitepaper on improving MT engine quality available from our resources page.

 

 

Quality Estimation Tool for MT: KantanAnalytics™

KantanMT AnalyticsKantanMT recently announced the forthcoming release of KantanAnalytics™, a tool that provides segment level quality analysis for Machine Translation output. KantanMT has developed this new technology in partnership with the CNGL Centre for Global Intelligent Content, which is also based at Dublin City University.

KantanAnalytics
KantanAnalytics measures the quality of the translations generated by KantanMT engines. The measurement provides a quality score for each segment translated through a KantanMT engine. This means that Language Service Providers (LSPs)will be able to:

  • accurately identify segments that require the most post-editing effort
  • accurately identify segments that match the client’s quality standards
  • better predict project completion times
  • offer more accurate pricing to their clients and set a price during the early stages of the project
  • build secure commercial Machine Translation frameworks

KantanAnalytics is being rolled out to a sample of KantanMT members this month, July 2013. It will be made available to all members of the KantanMT platform in September 2013.

CNGL

The CNGL Centre for Global Intelligent Content
CNGL was established in 2007 as a collaborative academia-industry research centre aiming to break new ground in digital intelligent content and to “revolutionise the global content value chain for enterprises, communities, and individuals” (CNGL, 2013).

CNGL says that it intends to “pioneer development of advanced content processing technologies for content creation, multilingual discovery, translation and localization, personalisation, and multimodal interaction across global markets”. Its adds that “these technologies will revolutionise the integration and unification of multilingual, multi-modal and multimedia content and interactions, and drive innovation across the global content value chain” (CNGL, 2013)

The body has received over €43 million euro in funding from Science Foundation Ireland (SFI) and key industry partners. Research for the KantanMT Analytics project was co-funded by the SFI in association with Enterprise Ireland.

CNGL has researchers at Trinity College Dublin, University College Dublin, University of Limerick, and Dublin City University. These researchers produce the aforementioned technologies in association with industry partners. Aside from KantanMT, CNGL has also entered partnerships with Microsoft, Intel, and Symantec to name but a few.

KantanAnalytics is the latest milestone in the partnership between KantanMT and CNGL and it will help to redefine current Machine Translation business models.

Please feel free to comment on this post or any ones previous-we’d love to hear from you!!

If you would like to find out more about KantanMT and KantanMT Analytics, visit KantanMT.com.

Featured Image Source: http://lovebeingretired.com/2011/04/21/a-lifetime-in-review/magnifying-glass-2/