Crowdsourcing vs. Machine Translation

KantanMT CrowdsourcingCrowdsourcing is becoming more popular with both organizations and companies since the concept’s introduction in 2006, and has been adopted by companies who are using this new production model to improve their production capacity while keeping costs low. The web-based business model, uses an open call format to reach a wide network of people willing to volunteer their services for free or for a limited reward, for any activity including translation. The application of translation crowdsourcing models has opened the door for increased demand of multilingual content.

Jeff Howe, Wired magazine defined crowdsourcing as:

“…the act of taking a job traditionally performed by a designated agent (usually an employee) and outsourcing it to an undefined, generally large group of people in the form of an open call”.

Crowdsourcing costs equate to approx. 20% of a professional translation. Language Service Providers (LSPs) like Gengo and Moravia have realised the potential of crowdsourcing as part of a viable production model, which they are combining with professional translators and Machine Translation.

The crowdsourcing model is an effective method for translating the surge in User Generate Content (UGC). Erratic fluctuations in demand need a dynamic, flexible and scalable model. Crowdsourcing is definitely a feasible production model for translation services, but it still faces some considerable challenges.

Crowdsourcing Challenges

  • No specialist knowledge – crowdsourcing is difficult for technical texts that require specialised knowledge. It often involves breaking down a text to be translated into smaller sections to be sent to each volunteer. A volunteer may not be qualified in the domain area of expertise and so they end up translating small sections text, out of context, with limited subject knowledge which leads to lower quality or mistranslations.
  • Quality – translation quality is difficult to manage, and is dependent on the type of translation. There have been some innovative suggestions for measuring quality, including evaluation metrics such as BLEU and Meteor, but these are costly and time consuming to implement and need a reference translation or ‘gold standard’ to benchmark against.
  • Security – crowd management can be a difficult task and the moderator must be able to vet participants and make sure that they follow the privacy rules associated with the platform. Sensitive information that requires translation should not be released to volunteers.
  • Emotional attachment – humans can become emotionally attached to their translations.
  • Terminology and writing style inconsistency – when the project is divided amongst a number of volunteers, the final version’s style needs to be edited and checked for inconsistencies.
  • Motivation – decisions on how to motivate volunteers and keep them motivated can be an ongoing challenge for moderators.

Improvements in the quality of Machine Translation have had an influence on crowdsourcing popularity and the majority of MT post-editing and proofreading tasks fit into crowdsourcing models nicely. Content can be classified into ‘find-fix-verify’ phases and distributed easily among volunteers.

There are some advantages to be gained when pairing MT technology and collaborative crowdsourcing.

Combined MT/Crowdsourcing

Machine Translation will have a pivotal role to play within new translation models, which focus on translating large volumes of data in cost-effective and powerful production models. Merging both Machine Translation and crowdsourcing tasks will create not only fit-for-purpose, but also high quality translations.

  • Quality – as the overall quality of Machine Translation output improves, it is easier for crowdsourcing volunteers with less experience to generate better quality translations. This will in turn increase the demand for crowdsourcing models to be used within LSPs and organizations. MT quality metrics will also make post-editing tasks more straightforward and easier to delegate among volunteers based on their experience.
  • Training data word alignment and engine evaluations can be done through crowd computing, and parallel corpora created by volunteers can be used to train and/or retrain existing SMT engines.
  • Security – customized Machine Translation engines are more secure when dealing with sensitive product or client information. General or publicly available information is more suited to crowdsourcing.
  • Terminology and writing style consistency – writing style and terminology can be controlled and updated through a straightforward process when using MT. This avoids the idiosyncrasies of volunteer writing styles. There is no risk of translator bias when using Machine Translation.
  • Speed – Statistical Machine Translation (SMT) engines can process translations quickly and efficiently. When there is a need for a high volume of content to be translated within a short period of time it is better to use Machine Translation. Output is guaranteed within a designated time and crowdsourcing post-editing tasks speeds up the production process before final checks are carried out by experienced translators or post-editors.
crowdsource and Machine Translation model
Use of crowdsourcing for software localization. Source: V. Muntes-Mulero and P. Paladini, CA Technologies and M. Solé and J. Manzoor, Universitat Politècnica de Catalunya.

Last chance for a FREE TRIAL for KantanAnalytics™ for all members until November 30th 2013. KantanAnalytics will be available on the Enterprise Plan.

Pricing PEMT 2

KantanMT blog, Pricing PEMT

Segment-by-segment Machine Translation Quality Estimation (QE) scores are reforming current Language Service Provider (LSP) business models.

Pricing Machine Translation is one of the most widely debated topics within the translation and localization industries. Many agree that there is no ‘black and white’ approach, because a number of variables must always be taken into consideration when costing a project. Industry experts are in agreement that levels of post-editing effort and payment should be calculated through a fair and easily replicated formula. This transparency is the goal KantanMT had in mind during the development of KantanAnalytics™, a “game-changing” technology in the localization industry.

New Business Model

The two greatest challenges facing Localization Project Managers are; how to cost and schedule Machine Translation projects. Experienced PM’s can quickly gauge how long a project will take to complete, but there is still an element of guesswork and contingency planning involved. This is intensified when you add Machine Translation. Although, not a new technology, the practical application in a business environment is still in infancy stages.

Powerful Machine Translation engines can be easily integrated into an LSP workflow. Measuring Machine Translation quality on a segment-by-segment basis and calculating post-editing effort on those segments allows LSPs to create more streamlined business models.

Studies have shown post-editing Machine Translation can be more productive than translating a document from scratch. This is especially true when translators or post-editors have a broad technical or subject knowledge of the text’s domain. In these cases they can capitalise on their knowledge with higher post-editing productivity.

So, how should a Machine Translation pricing model look?

The development of a technology that can evaluate a translation on a segment-by-segment basis and assign an accurate QE score to a Machine Translated text is critical for the successful integration of this technology into a project’s workflow.

The segment-by-segment breakdown and ‘fuzzy match’ percentage scoring system ensured the commercialisation of Translation Memories into LSP workflows. This system has been adopted as an industry standard for pricing translation jobs where translation memories or Computer Aided Translation (CAT) tools can be implemented. The next natural evolution, is to create a similar tiered ‘fuzzy’ matching system for Machine Translation.

Segment level QE technology is now available where Machine Translated segments are assigned percentage match values, similar to translation memory match values. Post-editing costs, similar to the costing of translation memory matches can be assigned. The match value also gives a clear indication of how long a project should take to post-edit based on the quality of the match and the post-editors skills and experience.

How can we trust the quality score?

The Machine Translation engine’s quality is based on the quality of the training data used to build the engine. The engines quality can be monitored with BLEU scores, F-measure and TER scoring. These automatic evaluation metrics indicate the engines quality, and combined with the ‘fuzzy’ match score, can be adjusted to get a more accurate picture of how post-editing effort is calculated and how projects should be priced. There are a number of variables that dictate how to create and implement a pricing model.

Variables to be considered when creating a pricing model

The challenge in measuring PEMT stems from a number of variables, which need to be considered by PMs when creating a pricing model:

  • Intended purpose – does the text require; a light, fast or full post-edit
  • Language pair and direction – Roman languages tend to provide better MT output
  • Quality of the MT system – better quality, domain specific engines produce better results
  • Post-editing effort – degree of editing required – minor edits or full retranslate
  • Post-editor skill and experience – post-editors with extensive domain expertise

Traditional Models

To overcome these challenges PMs traditionally opted for hourly or daily rates. However, hourly rates do not provide enough transparency or cost breakdown and can make a project difficult to schedule. These rates must also be calculated to take into consideration the translator or post-editors productivity and language pair.

Rates are usually calculated based on the translator or post-editor’s average post-editing speed within the specified domain. Day rates can be a good cost indicator for PMs based on the post-editors capabilities and experience, but again the cost breakdown is not completely transparent. Difficulties usually occur when a post-editor comes across a part of the text that requires more time or effort to post-edit, then productivity automatically drops.

As an example of the differing opinions in the translation community, pricing PEMT is dependent on the post-editing circumstances. Some posters on the Proz.com forum suggest that PEMT is priced as 30-50% or similar to editing a human translation. Others suggest, the output quality of a Machine Translation system is priced around the same as a ‘fuzzy’ match of 50-74% from a translation memory. These are broad subjective figures which do not take variables into consideration.

Calculation of the Machine Translated text on a segment-by-segment basis allows PMs to calculate post-editing effort based on the quality of customised Machine Translation engines. PMs can then use these calculations to build an accurate pricing model for the project, which incorporates all relevant variables. It also makes it possible to distribute post-editing work evenly across translators and post-editors making the most efficient use of their skills. Benefits to calculating post-editing effort are also seen in scheduling and project turnaround times.


KantanAnalytics™ is a segment-by-segment quality estimation scoring technology, which when applied to a Machine Translated text will generate a quality score for each segment, similar to the fuzzy match scoring system used in translation memories.

Sign up for a free trail to experience KantanAnalytics until November 30th 2013 KantanAnalytics will be available on the Enterprise Plan to sign up or upgrade to this plan please email KantanMT’s Success Coach, Kevin McCoy (kevinmcc@kantanmt.com).

Training Data

KantanMT Training DataBuilding a KantanMT Engine: Training Data

When the decision is made to incorporate a KantanMT engine into a translation model, the next obvious and most difficult question to answer is what to use to train the engine? This is often followed by: what are the optimum training data requirements to yield a highly productive engine? And how will I curate my training data?

The engine’s target domain and objectives should be clearly mapped out ahead of the build. If the documents are for a specific client or domain then the relevant in-domain training data should be used to build the engine. This also ensures the best possible translation results.

KantanMT recommends a minimum of 2 million training words for each domain specific engine. Higher quantities of in-domain “unique words” will also improve the potential for building an “intelligent” engine.

The quality of the engine is based on the language or translation assets used to build the engine. Studies by TAUS have shown quality is more important than quantity. “Intelligently selected training data” generated higher BLEU scores than an engine built with more generic data. The studies also indicated, a proactive approach in customising or adapting the engine with translation assets led to better quality results.

Translation assets are the best source of suitable training data for building KantanMT engines, they include:

Stock Training Data: KantanMT stock engines are collections of highly cleansed bi-lingual training data sets. Quality is ensured as each data set shows the source corpora and approximate number of words used to create each stock engine. These can be added to client data to produce much larger and more powerful engines. There are over a hundred different stock engines to choose from, including industry specific sets such as IT, Legal, Medical and Finance. Find a list of KantanMT Stock engines here >>

Stock engines are a good starting point if you have limited TMX (Translation Memory Exchange) files in the required domain, or if you would simply like to build bigger KantanMT engines.

Translation Memory Files: This is the best source of high quality training data since both source and target texts are aligned. Translation Memories used for previous translations in a similar domain will also have been verified for quality. This guarantees the engine’s quality will be representative of the Translation Memory quality. As the old expression in the translation industry goes “garbage in, garbage out”, good quality Translation Memory files will yield a good quality Machine Translation engine. The TMX file format is the optimal format for use with KantanMT, however, text files can also be used.

Monolingual Translated Text Files: Monolingual text files are used to create language models for a KantanMT engine. Language models are used for word and phrase selection and have a direct impact on the fluency and recall of KantanMT engines. Translated monolingual training data should be uploaded alongside bi-lingual training data when building KantanMT engines.

Glossary Files: Terminology or glossary files can also be used as training material. Including a glossary improves terminology consistency and translation quality. Terminology files are uploaded with your ‘files to be translated’ and should also be in a TBX file format.

KantanISR™: Instant segment retraining technology allows users to input edited segments via the KantanISR editor. The segments then become training data and are stored in the KantanISR cache. The new segments are incorporated into the engine, avoiding the need to rebuild. As corrected data is included, the engine will improve in quality becoming an even more powerful and productive KantanMT engine.

KantanISR Instant Segment Retrainer
KantanISR editor

Building your KantanMT engine can be a very rewarding process. While some time is needed to gather the best data for a domain specific engine, there are many ways to enhance your engine that require little effort.

For more information about preparing training data or engine re-training, please contact Kevin McCoy, KantanMT Success Coach.

PEMT Standards

KantanMT PEMT standardsIn this blog series, we are discussing the area of post-editing. In our earlier posts, ‘The Rise of PEMT‘ and ‘Cutting PEMT Times‘ we have discussed the meaning of automated post-editing, why its popularity is growing among Language Service Providers (LSPs), and how you can cut your post-editing times.

Machine Translated text can be post-edited to different quality levels. This post is based on post-editing guidelines that have been developed by TAUS with, among others, KantanMT’s partners DCU and CNGL. A link to these guidelines is available at the end of this post.

Post-editing to an understandable level
An understandable level of post-editing is a standard by which the main content of the message is correct and understandable for the user. However, the documents readability may not be perfect and there may be a number of styling errors. Correct styling however is not essential as long as the main message content is understandable.

Follow these rules to post-edit a translated text to an understandable level

  • Ensure that the meaning of the translated text is the same as the source text and that it is understandable to the user
  • Read through the document to make sure that there is no missing or excess information
  • Because the translation is part of the localization process, make sure that the content is not offensive or culturally insensitive
  • Correct basic spelling errors
  • Errors that only effect the styling of the document do not need to be changed, so, there is no need to correct the following sentence, “Kantanmt is cloud based statistical machine translator platform”. Note: The stylistically correct version is “KantanMT is a cloud-based Statistical Machine Translation platform”
  • Remember that the fewer post-edits there are the better – use as much of the original Machine Translation output as possible
  • Don’t restructure sentences to improve the flow if the meaning is comprehensible

easelly_visual(4)

Post-editing to a quality standard similar to human translation
TAUS defines this level as being, “comprehensible (i.e. an end-user perfectly understands the content of the message), correct (i.e. it communicates the same meaning as the source text), stylistically fine, though the style may not be as good as that achieved by a native-speaking human translator. Syntax is normal, grammar and punctuation are correct”

Follow these rules to post-edit a translated text to this standard

  • Ensure that content is grammatically complete and structured logically, and that the meaning of the message is clear to the user
  • Check the translation of terms that are essential to the document and make sure that any untranslated terms have been requested to stay as such by the client
  • Read through the document to make sure that there is no missing or excess information
  • Because the translation is part of the localization process, make sure that the content is not offensive or culturally insensitive
  • Remember that the fewer post-edits there are the better – use as much of the original MT output as possible
  • Correct spelling errors and make sure that the document is correctly punctuated and well formatted

And that’s it! For errors such as misspellings or formatting mistakes, you can use KantanMT’s PEX technology to find and correct any repetitive errors throughout a document. This will help to speed up post-editing times while reducing post-editing costs.

TAUS Machine Translation Post-Editing Guidelines

You can find out more about KantanMT by visiting KantanMT.com and signing up to our free 14 day trial.

Cutting PEMT Times

KantanMT Cutting PEMT timesIn our last post, The Rise of PEMT, we discussed what automated post-editing means and why it is becoming more and more popular among Language Service Providers (LSPs). One of the most important things to remember about the post-editing process is that the less of it, the better.

In this post, we are going to look at some of the ways that you can keep your post-editing times to a minimum. This post is based on post-editing guidelines that have been developed by TAUS with, among others, KantanMT’s partners DCU and CNGL. A link to these guidelines is available at the end of this post.

7 steps to reducing your post-editing times

1. Train your KantanMT engine to improve translations
The quality of a KantanMT engine’s output increases as it is re-trained. This means running high quality training data through it and re-training using post-edited translations. The more you train your KantantMT engine with good training data, the more accurate your engine’s output will be. All of this means less post-editing time.

2. Make sure your training data is high quality
This rule stems directly from the previous point; a KantanMT engine’s accuracy will not improve if it is trained with poor quality data. Poor quality training data can be diagnosed by a number of factors such as poor writing style, unaligned segments, and data that is not specific to the client’s domain. Keep your training data clean and well-written.

3. Writing style/Pre-editing
It is very important to make sure that pre-translated documents are well written and grammatically correct. That means you should avoid misspellings, ambiguities, and make sure that sentences are grammatically complete. A Machine Translation engine does not correct writing errors so make sure that these mistakes are corrected before the source text is translated. See our previous blogs, Style Guides in MT and How to Write for MT for more information on this topic.

4.Terminology management
Ensure that terminology management is integrated “across source text authoring, Machine Translation and TM systems” (TAUS). Terminology management means defining terms and their rules of usage, and implementing these definitions and rules throughout a document. This safeguards a consistent level of accuracy and legibility across translation outputs.

easelly_visual(3)

5. Set realistic timelines
Make sure that you assess the quality of raw Machine Translation output before agreeing upon a price and the size of the translation order. Naturally, the poorer the output, the more post-editing time that will be required.

6. Decide upon a quality standard of post-editing
For some clients, an understandable document is all that is required. This means that stylistic issues are generally ignored but the meaning of the document is still accurately conveyed. For many clients however, the content must be perfect and this requires a degree of post-editing that also incorporates corrections to stylistic issues. O’ Brien et al, quoting Allen, say that the standard of post-editing output is determined by

•    “User Requirements
•    Volume
•    Quality Expectations
•    Turn-Around Time
•    Perishability
•    Text Function”

Remember to agree upon a post-editing standard with your client. The lower the expected standard of output, the less time consuming the post-editing process should be.

7. Use KantanMT’s Post-Editing Automation technology (PEX)
In our last post, The Rise of PEMT, we discussed the benefits for post-editors in using automated post-editing within their workflow. Here is a quick reminder:

A document has been translated by a KantanMT engine but there is a word that begins with a lower case letter which should begin with a capital letter. This mistake has been repeated throughout the document several hundred times. Rather than a post-editor having to manually find and correct each occurrence of this error, KantanMT’s PEX technology can find and correct the mistake with its rule system. You can find out more about PEX by clicking here.

This means that post-editors can save time and turn their attention to fixing more complex stylistic errors. All of this results in faster project completion times and lower costs.

In our next post, we will look at guidelines to achieving both understandable post-editing output and high quality post-editing output.

TAUS Machine Translation Post-Editing Guidelines

You can find out more about KantanMT by visiting KantanMT.com and signing up to our free 14 day trial.

The Rise of PEMT

KantanMT The Rise of PEMTMore companies want multilingual content produced cheaply and quickly by Language Service Providers (LSPs); Machine Translation is becoming a more popular choice as a result.

TechNavio predicted that the market for Machine Translation will grow at a compound annual growth rate (CAGR) of 18.05% until 2016, and the report attributes a large part of this rise to “the rapidly increasing content volume”. Of course, while Machine Translation may help to cut costs and turnaround times, its success is ultimately judged on whether it can not only produce correct translations-but also content that meets the quality standards of each individual client.

This places the spotlight firmly on the post-editing stage of the Machine Translation process. In this post, we are going to examine the Machine Translation post-editing stage and discuss how automatic post-editing can be incorporated into it.

What is Machine Translation post-editing?
Jeff Allen says the purpose of the post-editing stage, or more specifically the post-editor, is to “edit, modify, and/or correct pre-translated text that has been processed by an MT system from a source language into (a) target language(s)”. The most important thing to take from this is that post-editing is not the same as translation.

The fundamental aim of the post-editing process is to make Machine Translation output understandable or stylistically appropriate (depending on client requirements). Automatic post-editing is when computer technology is used to complete parts of the post-editing process.

post-editing

Does this mean some stages of the post-editing process can be completely automated?
Not exactly. Automated post-editing is not an entirely mechanised process whereby a machine parses and corrects a document without human intervention. Humans must still proofread translation output and make sure that the each client’s standards are met.  However, post-editing technologies can automate a number of steps that would have previously required manual intervention and multiple edits by the post-editor.

As Bartolomé Mesa-Lao of Copenhagen Business School in Denmark says, the less edits required the better a post-editors productivity. This is one of the main reasons why, in an age where companies want multilingual user content on-demand, post-editing technologies are becoming increasingly more important to LSPs. If we take an example of using KantanMT’s post-editing technologies as part of the post-editing process, we can see how it works:

A document has been translated by a KantanMT engine but there is a word that begins with a lower case letter which should begin with a capital letter. This mistake has been repeated throughout the document several hundred times. Rather than a post-editor having to manually find and correct each occurrence of this error, KantanMT’s PEX technology can find and correct the mistake using its “find and replace” rules. This means that post-editors can save time and turn their attention to fixing more complex stylistic errors. All of this results in faster project completion times and lower costs.

In our next post, we will look at some of the best practices you can use to make sure that you keep your post-editing times to a minimum.

You can find out more about Machine Translation and KantanMT by going to KantanMT.com and signing up to our free 14 day trial.