Tips for Training Post-editors

A good quality Machine Translation engine relies on the quality of the bilingual data used to train it. For most MT users, this bilingual data can be translated by humans, or it can be fully post-edited MT output. In both cases, the quality of the data will influence the engines quality. 
Selçuk Özcan, Transistent’s Co-founder will discuss the differences and give some tips for successful post-editing.Selçuk Özcan, Transistent’s Co-founder has given KantanMT permission to publish his blog post on Translation Quality. This post was originally published in Dragosfer and on the GALA Blog website.

We have entered a new age, and a new technology has come into play: Machine Translation (MT). It’s globally accepted that MT systems dramatically increase productivity but it’s a hard struggle to integrate this technology into your production process. Apart from handling the engine building and optimizing procedures, you have to transform your traditional workflow:

cp2

The traditional roles of the linguists (translators, editors, reviewers etc.) are reconstructed and converged to find a suitable place in this new, innovative workflow. The emerging role is called ‘post-edit’ and the linguists assigned to this role are called ‘post-editors’. You may want to recruit some willing linguists for this role, or persuade your staff to adopt a different point of view. But whatever the case may be, some training sessions are a must.

What are covered in training sessions?

1. Basic concepts of MT systems

Post-editors should have a notion of the dynamics of MT systems. It is important to focus on the system that is utilized (RBMT/SMT/Hybrid). For widely used SMT systems, it’s necessary for them to know:

  • how the systems behave
  • the functions of the Translation Model and Language Model*
  • input (given set of data) and output (raw MT output) relationship
  • what changes in different domains

* It’s not a must to give detailed information about that topics but touching on the issue will make a difference in determining the level of technical backgrounds of candidates. Some of the candidates may be included in testing team.

2. The characteristics of raw MT output

Post-editors should know the factors affecting MT output. On the other hand, the difference between working on fuzzy TM systems and with SMT systems has to be mentioned during a proper training session. Let’s try to figure out what to be given:

  • MT process is not the ‘T’ of the TEP workflow and raw MT output is not the target text expected to be output of ‘T’ process.
  • In the earlier stages of SMT engines, the output quality varies depending on the project’s dynamics and errors are not identical. As the system improves quality level becomes more even and consistent within the same domain.
  • There may be some word or phrase gaps in the systems’ pattern mappings. (Detecting these gaps is one of the main responsibilities of testing team but a successful post-editor must be informed about the possible gaps.)

3. Quality issues

This topic has two aspects: defining required target (end product) quality, and evaluation and estimation of output quality. The first one gives you the final destination and the second one makes you know where you are.

Required quality level is defined according to the project requirements but it mostly depends on target audience and intended usage of the target text. It seems similar to the procedure in TEP workflow. However, it’s slightly different; engine improvement plan should also be considered while defining the target quality level. Basically, this parameter is classified into two groups: publishable andunderstandable quality.

Evaluation and estimation aspect is a little bit more complicated. The most challenging factor is standardizing measurement metrics. Besides, the tools and systems used to evaluate and estimate the quality level have some more complex features. If you successfully establish your quality system, then adversities become easier to cope with.

It’s post-editors’duty to apprehend the dynamics of MT quality evaluation, and the distinction between MT and HT quality evaluation procedures. Thus, they are supposed to be aware of the expected error patterns. It will be more convenient to utilize the error categorization with your well-trained staff (QE staff and post-editors).

4. Post-editing Technique

The fourth and the last topic is the key to success. It covers appropriate method and principles, as well as the perspective post-editors usually acquire. Post-edit technique is formed using the materials prepared for the previous topics and the data obtained from the above mentioned procedures, and it is separately defined for almost every individual customized engines.

The core rule for this topic is that post-edit technique, as a concept, is likely to be definitely differentiated from traditional edit and/or review technique(s). Post-editors are likely to be capable of:

  • reading and analyzing the source text, raw MT output and categorized and/or annotated errors as a whole.
  • making changes where necessary.
  • considering the post-edited data as a part of data set to be used in engine improvement, and performing his/her work accordingly.
  • applying the rules defined for the quality expectation levels.

As briefly described in topic #3, the distance between the measured output quality and required target quality may be seen as the post-edit distance. It roughly defines the post-editor’s tolerance and the extent to which he/she will perform his work. Other criterion allowing us to define the technique and the performance is the target quality group. If the target text is expected to be of publishable quality then it’s called full post-edit and otherwise light post-edit. Light & full post-edit techniques can be briefly defined as above but the distinction is not always so clear. Besides, under/over edit concepts are likely to be included to above mentioned issues. You may want to include some more details about these concepts in the post-editor training sessions; enriching the training materials with some examples would be a great idea!

About Selçuk Özcan

Selçuk Özcan has more than 5 years’ experience in the language industry and is a co-founder of Transistent Language Automation Services. He holds degrees in Mechanical Engineering and Translation Studies and has a keen interest in linguistics, NLP, language automation procedures, agile management and technology integration. Selçuk is mainly responsible for building high quality production models including Quality Estimation and deploying the ‘train the trainers’ model. He also teaches Computer-aided Translation and Total Quality Management at the Istanbul Yeni Yuzyil University, Translation & Interpreting Department.

Transistent Partner PR Image 320x320

 

Read More about KantanMT’s Partnership with Transistent in the official News Release, or if you are interested in joining the KantanMT Partner Program, contact Louise (info@kantanmt.com) for more details on how to get involved. 

 

 

Language Industry Interview: A Chat with Deepan Patel from Milengo

Deepan croppedKantanMT caught up with Milengo’s Machine Translation Solutions Architect, Deepan Patel earlier this week for a quick chat about his experience using machine translation. Next Month, Deepan will be joining Tony O’Dowd in a free live webinar, to talk about how Milengo maximized it’s ROI for machine translation.

KantanMT: Can you tell me a little about yourself and, how you got involved in the industry?

Deepan Patel: To be honest, I sort of fell into the localization industry but I am certainly very glad that I did! I am a Modern Languages graduate from the University of Oxford which provided a very traditional approach to translation, certainly a million miles away from the realities of life in the localization industry.

I moved to Berlin after graduating in late 2008 and within a year I was fortunate enough to be accepted on a trainee program by my current employer Milengo Ltd, a language services provider which was founded in 2005. The first ever project I ever worked on was one that involved the customization of statistical machine translation (SMT) engines for a customer wishing to test the long-term viability of incorporating machine translation and post-editing into their localization operations.

It was a tremendous experience for both myself and Milengo; it was really that initial project that has laid the foundations for the MT-related services that we now offer. The main focus of my work at Milengo relates to testing and deploying customized machine translation and post-editing workflows for clients requiring a completely outsourced MT solution.

KMT: How has MT affected or changed your business models at Milengo?

DP: I believe that having machine translation and post-editing as part of our service spectrum has lent us a significant competitive advantage. This was very apparent in September last year when we were approached by an eCommerce company with quite a formidable challenge: namely, they had 19 days in which to launch a new web shop for Sweden and around  780,000 words that needed to be localized from Danish into Swedish. And of course they had a very tight budget!

Through the experiences that we have gained running large-scale machine translation and post-editing projects over the years, we were able to confidently provide a compelling MT-based workflow solution which fell within our client’s budget and would deliver high-quality translated content before their launch date. When providing their reasons for choosing us as for that project, it was our confidence in stating that we could deliver in time that was the main factor. Without our experience with machine translation, we would not have been able to win that project – it is as simple as that. We were able to deliver high-quality localized content within budget and before the initial deadline request. And now we enjoy regular work from this client, localizing all the updates to their product descriptions across three language pairs.

So in essence, MT has enabled us to win those large-scale projects where customer budgets are limited, turnaround time is crucial but quality expectations are high, that we may not have stood a favourable chance of winning previously.

KMT: How do you use machine translation for your clients?

DP: When answering this question I must take pains to emphasize that our MT service offerings always involve post-editing. For one of our clients within the IT domain, we localize the online help to their software products across five language pairs using customized engines that have been built using their own language assets. The requirement there is to deliver high-quality localized content at a significant cost reduction to a human-only translation model. For this particular customer we have achieved cost savings of between 27 – 40 % depending on the language pair.

For another of our clients within the automotive sector, we have built custom MT systems across 3 language pairs to provide a cost-effective but high-quality localization solution for their huge volume of parts data. The initial challenge presented to us was to localize around 300,000 words of this data within a fairly tight timeframe – though not as challenging as our eCommerce client! We were first able to demonstrate the viability of customized machine translation and post-editing for this type of content via our free Machine Translation and Post-editing (MT-PE) feasibility study, after which point we deployed our workflow solution for their three requested target languages. Again for this customer, we have implemented cost savings of between 25 – 40% when compared to the traditional translation model and are enjoying continued business from them.

The third main scenario where we apply MT-PE is for our eCommerce client that I mentioned in my response to your previous question. They add new products to their web shop on a weekly basis and their very repetitive product descriptions need to be localized as soon as possible, so the content can go “live” on the different language sites. Together with this customer we are now focusing on automating as much of the project process as possible with regard to transfer of content via API connectors and using our customized MT systems as a fully-integrated part of their localization project workflow.

For all of these clients, we have been able to offer tiered-pricing packages based on the premise that the more content that we post-edit and feed back into their MT systems during re-training cycles, the better the system will perform on future projects. Consequently we can offer lower rates for localization at defined intervals. Really it is all about being able to demonstrate the long-term cost-savings possible with a customized MT-PE solution.

KMT: What advice can you give to translation buyers, interested in implementing a machine translation workflow strategy?

DP: Well, firstly I would encourage translation buyers to evaluate whether they have the time, budget  and most importantly the relevant personnel within their organization to develop a custom MT solution, or whether it would make sense to turn to external help in the form of MT tech providers like KantanMT, or LSPs such as Milengo who would additionally be able to provide post-editing solutions as well.

I would also encourage translation buyers to evaluate how MT can be applied in different usage scenarios. For example, it would certainly be worth investigating MT-PE for large volume, highly repetitive content (user manuals, support documentation, catalogue data) where you can achieve significant cost-savings and quicker turnaround without compromise on the language quality (with excellent post-editors of course). Another worthwhile scenario for MT would be if your company produces a lot of short life-cycle or customer support content which needs to be available in the languages of your customers as quickly possible, and where transfer of meaning takes precedence over linguistic quality.

Thirdly I would ask the respective translation buyer to examine the state and volume of any language assets that they can use for customizing MT systems. Do you have enough of a training corpus to build MT systems which produce good quality MT output? Have your language assets been maintained well enough to ensure as much consistency in translation as possible? Remember that an MT system will only ever be as good as the material you use to train it. Again here external help may be useful in terms of applying data cleaning and normalization to the training corpus before you get round to building your MT systems.

Finally, I would always advise prospective translation buyers to consider the wider impact benefits of incorporating MT into their localization practices. The more you make use of your custom MT systems and more post-edited content you incorporate into system re-training cycles, the better your systems will perform. This of course leads to greater productivity benefits and reduced costs for localization. Which in turn means that you should free up more of your budget to turn your attentions towards localizing content that was previously considered too cost-prohibitive.

Maximizing ROI webinar KantanMTThank you Deepan, for taking time out of your busy schedule to take part in this interview, and we look forward to hearing more from you in KantanMT’s upcoming partner webinar. The webinar, Maximizing ROI for Machine Translation will be held on Wed, Mar 11, 2015 3:00 PM – 4:00 PM GMT.

Register for Webinar

Translation Quality: How to Deal with It?

KantanMTSelcuk Ozcan, Transistent, KantanMT started the New Year on a high note with the addition of the Turkish Language Service Provider, Transistent to the KantanMT Preferred MT Supplier partner program.

Selçuk Özcan, Transistent’s Co-founder has given KantanMT permission to publish his blog post on Translation Quality. This post was originally published in Dragosfer and the Transistent Blog.

 

 

Literally, the word quality has several meanings, one of them being “a high level of value or excellence” according to Merriam-Webster’s dictionary. How should one deal with this idea of “excellence” when the issue at hand is translation quality? What is required, it looks like, is a more pragmatic and objective answer to the abovementioned question.

This brings us to the question “how could an approach be objective?” Certainly, the issue should be assessed through empirical findings. But how? We are basically in need of an assessment procedure with standardized metrics. Here, we encounter another issue; standardization of translation quality. From now on, we need to associate these concepts with the context itself in order to make them clear.

Image 1 blog Transistent

Monolingual issues

Bilingual issues

As it’s widely known, three sets of factors have an effect on the quality of the translation process in general. Basically, analyzing source text’s monolingual issues, target text’s monolingual issues and bilingual issues defines the quality of the work done. Nevertheless, the procedure should be based on the requirements of the domain, audience and linguistic structure of both languages (source and target); and in each step, this key question should be considered: ‘Does the TT serve to the intended purpose?’

We still have not dealt with the standardization and quality of acceptable TT’s. The concept of “acceptable translation” has always been discussed throughout the history of translation studies. No one is able to precisely explain the requirements. However, a further study on dynamic QA models needs to go into details.There are various QA approaches and models. For most of them, acceptable translation falls into somewhere between bad and good quality, depending on the domain and target audience. The quality level is measured through the translation error rates developed to assess MT outputs (BLEU, F-Measure and TER) and there are four commonly accepted quality levels; bad, acceptable, good and excellent.

The formula is so simple: the TT containing more errors is considered to be worse quality. However, the errors should be correlated with the context and many other factors, such as importance for the client, expectations of the audience and so on. These factors define the errors’ severity as minor, major, and critical. A robust QA model should be based upon accurate error categorization so that reliable results may be obtained.

We tried to briefly describe the concept of QA modeling. Now, let’s see what’s going on in practice. There are three publicly available QA models which inspired many software developers on their QA tool development processes. One of them is LISA (Localization Industry Standards Association) QA Model. The LISA Model is very well known in the localization and translation industry and many company-specific QA models have been derived from it.

The second one is J2450 standard that was generated by SAE (Society for Automotive Engineers) and the last one is EN15038 standard, approved by CEN (Comité Européen de Normalisation) in 2006. All of the above mentioned models are the static QA models. One should create his/her own frameworks in compliance with the demands of the projects. Nowadays, many of the institutes have been working on dynamic QA models (EU Commission and TAUS). These models enable creating different metrics for several translation/localization projects.

About Selçuk Özcan

Selçuk Özcan has more than 5 years’ experience in the language industry and is a co-founder of Transistent Language Automation Services. He holds degrees in Mechanical Engineering and Translation Studies and has a keen interest in linguistics, NLP, language automation procedures, agile management and technology integration. Selçuk is mainly responsible for building high quality production models including Quality Estimation and deploying the ‘train the trainers’ model. He also teaches Computer-aided Translation and Total Quality Management at the Istanbul Yeni Yuzyil University, Translation & Interpreting Department.

Read More about KantanMT’s Partnership with Transistent in the official News Release, or if you are interested in joining the KantanMT Partner Program, contact Louise (info@kantanmt.com) for more details on how to get involved. 

Transistent KantanMT Preferred MT Supplier

 

Machine Translation Technology and Internet Security

Joseph Wojowski, Machine Translation Technology and internet security
Joseph Wojowski

KantanMT is delighted to republish, with permission a post on machine translation technology and internet security that was recently written by Joseph Wojowski. Joseph Wojowski is the Director of Operations at Foreign Credits and Chief Technology Officer at Morningstar Global Translations LLC.

Machine Translation Technology and Internet Security

An issue that seems to have been brought up once in the industry and never addressed again are the data collection methods used by Microsoft, Google, Yahoo!, Skype, and Apple as well as the revelations of PRISM data collection from those same companies, thanks to Edward Snowden. More and more, it appears that the industry is moving closer and closer to full Machine Translation Integration and Usage, and with interesting, if alarming, findings being reported on Machine Translation’s usage when integrated into Translation Environments, the fact remains that Google Translate, Microsoft Bing Translator, and other publicly-available machine translation interfaces and APIs store every single word, phrase, segment, and sentence that is sent to them.

Terms and Conditions

What exactly are you agreeing to when you send translation segments through the Google Translate or Bing Translator website or API?

1 – Google Terms and Conditions

Essentially, in using Google’s services, you are agreeing to permit them to store the segment to use for creating more accurate translations in the future, they can also publish, display, and distribute the content.

“When you upload, submit, store, send or receive content to or through our Services, you give Google (and those we work with) a worldwide license to use, host, store, reproduce, modify, create derivative works (such as those resulting from translations, adaptations or other changes we make so that your content works better with our Services), communicate, publish, publicly perform, publicly display and distribute such content.” (Google Terms of Service – 14 April 2014, accessed on 8 December 2014)

Oh, and did I mention that in using the service, the user is bearing all liability for“LOST PROFITS, REVENUES, OR DATA, FINANCIAL LOSSES OR INDIRECT, SPECIAL, CONSEQUENTIAL, EXEMPLARY, OR PUNITIVE DAMAGES.” (Google Terms of Service – 14 April 2014, accessed on 8 December 2014)

So if it is discovered that a client’s confidential content is also located on Google’s servers because of a negligent translator, that translator is liable for losses and Google relinquishes liability for distributing what should have been kept confidential.

Alright, that’s a lot of legal wording, not the best news, and a lot to take in if this is the first time you’re hearing about this. What about Microsoft Bing Translator?

2 – Microsoft Services Agreement (correction made to content – see below)

In writing their services agreement, Microsoft got very tricky. They start out positively by stating that you own your own content.

“Except for material that we license to you that may be incorporated into your own content (such as clip art), we do not claim ownership of the content you provide on the services. Your content remains your content, and you are responsible for it. We do not control, verify, pay for, or endorse the content that you and others make available on the services.” (Microsoft Services Agreement – effective 19 October 2012, accessed on 8 December 2014)

Bing! Bing! Bing! Bing! Bing! We have a winner! Right? Hold your horses, don’t install the Bing API yet. It continues on in stating,

“When you transmit or upload Content to the Services, you’re giving Microsoft the worldwide right, without charge, to use Content as necessary: to provide the Services to you, to protect you, and to improve Microsoft products and services.”(Microsoft Services Agreement – effective 19 October 2012, accessed on 8 December 2014)

So again with Bing, while they originally state that you own the content you submit to their services, they also state that in doing so, you are giving them the right to use the information as they see fit and (more specifically) to improve the translation engine.

How do these terms affect the translation industry, then?

The problem arises whenever translators are working with documents that contain confidential or restricted-access information. Aside from his/her use of webmail hosted by Microsoft, Google, Apple, etc. – which also poses a problem with confidentiality – contents of documents that are sent through free, public machine translation engines; whether through the website or API, are leaking the information the translator agreed to keep confidential in the Non-Disclosure Agreement (if established) with the LSP; a clear and blatant breach of confidentiality.

But I’m a professional translator and have been for years, I don’t use MT and no self-respecting professional translator would.

Well, yes and no; a conflict arises from that mode of thinking. In theory, yes, a professional translator should know better than to blindly use Machine Translation because of its inaccurate and often unusable output. A professional translator; however, should also recognize that with advancements in MT Technology, Machine Translation can be a very powerful tool in the translator’s toolbox and can, at times, greatly aid in the translation of certain documents.

The current state of the use of MT more echoes the latter than the former. In 2013 research conducted by Common Sense Advisory, 64% of the 239 people who responded to the survey reported that colleagues frequently use free Machine Translation Engines; 62% of those sampled were concerned about free MT usage.

In the November/December 2014 Issue of the ATA Chronicle, Jost Zetzsche relayed information on how users were using the cloud-based translation tool MemSource. Of particular interest are the Machine Translation numbers relayed to him by David Canek, Founder of MemSource. 46.2% of its around 30,000 users (about 13,860 translators) were using Machine Translation; of those, 98% were using the Google Translate or a variant of the Bing Translator API. And of still greater alarm, a large percentage of users using Bing Translator chose to employ the “Microsoft with Feedback” option which sends the finalized target segment back to Microsoft (a financially appealing option since when selected, use of the API costs nothing).

As you can imagine, while I was reading that article, I was yelling at all 13.9 thousand of them through the magazine. How many of them were using Google or Bing MT with documents that should not have been sent to either Google or Microsoft? How many of these users knew to shut off the API for such documents – how many did?

There’s no way to be certain how much confidential information may have been leaked due to translator negligence, in the best scenario perhaps none, but it’s clear that the potential is very great.

On the other hand, in creating a tool as dynamic and ever-changing as a machine translation engine, the only way to train it and make it better is to use it, a sentiment that is echoed throughout the industry by developers of MT tools and something that can be seen in the output of Google translate over the past several years.

So what options are there for me to have an MT solution for my customers without risking a breach in confidentiality?

There are numerous non-public MT engines available – including Apertium, a developing open-source MT platform – however, none of them are as widely used (and therefore, as well-trained) as Google Translate or Bing Translator (yes, I realize that I just spent over 1,000 words talking about the risk involved in using Google Translate or Bing Translator).

So, is there another way? How can you gain the leverage of arguably the best-trained MT Engines available while keeping confidential information confidential?

There are companies who have foreseen this problem and addressed it, without pitching their product, here’s how it works. It acts as an MT API but before any segments are sent across your firewall to Google, it replaces all names, proper nouns, locations, positions, and numbers with an independent, anonymous token or placeholder. After the translated segment has returned from Google and is safely within the confines of your firewall, the potentially confidential material then replaces the tokens leaving you with the MT translated segment. On top of that, it also allows for customized tokenization rules to further anonymize sensitive data such as formulae, terminology, processes, etc.

While the purpose of this article was not to prevent translators from using MT, it is intended to get translators thinking about its use and increase awareness of the inherent risks and solution options available.

— Correction —

As I have been informed, the information in the original post is not as exact as it could be, there is a Microsoft Translator Privacy Agreement that more specifically addresses use of the Microsoft Translator. Apparently, with Translator, they take a sample of no more than 10% of “randomly selected, non-consecutive sentences from the text” submitted. Unused text is deleted within 48 hours after translation is provided.

If the user subscribes to their data subscriptions with a maximum of 250 million characters per month (also available at levels of 500 million, 635 million, and one billion) , he or she is then able to opt-out of logging.

There is also Microsoft Translator Hub which allows the user to personalize the translation engine where “The Hub retains and uses submitted documents in full  in order to provide your personalized translation system and to improve the Translator service.”  And it should be noted that, “After you remove a document from your Hub account we may continue to use it for improving the Translator service.”

***

So let’s analyze this development. 10% of the full text submitted is sampled and unused text is deleted within 48 hours of its service to the user. The text is still potentially from a sensitive document and still warrants awareness of the issue.

If you use The Translator Hub, it uses the full document to train the engine and even after you remove the document from your Hub, and they may also use it to continue improving the Translator service.

Now break out the calculators and slide rules, kids, it’s time to do some math.

In order to opt-out of logging, you need to purchase a data subscription of 250 million characters per month or more (the 250 million character level costs $2,055.00/month). If every word were 50 characters each, that would be 5 million words per month (where a month is 31 days)  and a post-editor would have to process 161,290 words per day (working every single day of this 31-day month). It’s physically impossible for a post-editor to process 161,290 words in a day, let alone a month (working 8 hours a day for 20 days a month, 161,290 words per month would be 8,064.5 words per day). So we can safely assume that no freelance translator can afford to buy in at the 250 million character/month level especially when even in the busiest month, a single translator comes no where near being able to edit the amount of words necessary to make it a financially sound expense.

In the end, I still come to the same conclusion, we need to be more cognizant of what we send through free, public, and semi-public Machine Translation engines and educate ourselves on the risks associated with their use and the safer, more secure solutions available when working with confidential or restricted-access information.

The KantanMT team would like to thank Joseph Wojowski for allowing us to republish his very interesting and topical post on machine translation security. You can view the original post here.

KantanMT Security Key to translation success

At KantanMT, security, integrity and the privacy of our customers’ data is a top priority. We believe this is vital to their business operations and to our own success. Therefore, we use a multilayered approach to protect and encrypt this information. The KantanMT Data Privacy statement ensures that no client data is re-published, re-tasked or re-purposed and will also be fully encrypted during storage and transmission.

Read more about the KantanMT Data Privacy Infrastructure (PDF Download)

For more information about our security infrastructure please contact the KantanMT Sales Team (sales@kantanmt.com).

Language Industry Interview: KantanMT speaks with Maxim Khalilov, bmmt Technical Lead

Language Industry Interview: KantanMT speaks with Maxim Khalilov, bmmt Technical LeadThis year, both KantanMT and its preferred Machine Translation supplier, bmmt, a progressive Language Service Provider with an MT focus, exhibited side by side at the tekom Trade Fair and tcworld conference in Stuttgart, Germany.

As a member of the KantanMT preferred partner program, bmmt works closely with KantanMT to provide MT services to its clients, which include major players in the automotive industry. KantanMT was able to catch up with Maxim Khalilov, technical lead and ‘MT guru’ to find out more about his take on the industry and what advice he could give to translation buyers planning to invest in MT.

KantanMT: Can you tell me a little about yourself and, how you got involved in the industry?

Maxim Khalilov: It was a long and exciting journey. Many years ago, I graduated from the Technical University in Russia with a major in computer science and economics. After graduating, I worked as a researcher for a couple of years in the sustainable energy field. But, even then I knew I still wanted to come back to IT Industry.

In 2005, I started a PhD at Universitat Politecnica de Catalunya (UPC) with a focus on Statistical Machine Translation, which was a very new topic back then. By 2009, after successfully defending my thesis, I moved to Amsterdam where I worked as a post-doctoral researcher at the University of Amsterdam and later as a RD manager at TAUS.

Since February 2014, I’ve been a team lead at bmmt GmbH, which is a German LSP with strong focus on machine translation.

I think my previous experience helped me to develop a deep understanding of the MT industry from both academic and technical perspectives.  It also gave me a combination of research and management experience in industry and academia, which I am applying by building a successful MT business at bmmt.

KMT: As a successful entrepreneur, what were the three greatest industry challenges you faced this year?

MK: This year has been a challenging one for us from both technical and management perspectives. We started to build an MT infrastructure around MOSES practically from scratch. MOSES was developed by academia and for academic use, and because of this we immediately noticed that many industrial challenges had not yet been addressed by MOSES developers.

The first challenge we faced was that the standard solution does not offer a solid tag processing mechanism – we had to invest into a customization of the MOSES code to make it compatible with what we wanted to achieve.

The second challenge we faced was that many players in the MT market are constantly talking about the lack of reliable, quick and cheap quality evaluation metrics. BLEU-like scores unfortunately are not always applicable for real world projects. Even if they are useful when comparing different iterations of the same engines, they are not useful for cross language or cross client comparison.

Interestingly, the third problem has a psychological nature; Post-Editors are not always happy to post edit MT output for many reasons, including of course the quality of MT. However, in many situations the problem is that MT post-editing requires a different skillset in comparison with ‘normal’ translation and it will take time before translators adopt fully to post editing tasks.

KMT: Do you believe MT has a say in the future, and what is your view on its development in global markets?

MK: Of course, MT will have a big say in the language services future. We can see now that the MT market is expanding quickly as more and more companies are adopting a combination TM-MT-PE framework as their primary localization solution.

“At the same time, users should not forget that MT has its clear niche”

I don’t think a machine will be ever able to translate poetry, for example, but at the same time it does not need to – MT has proved to be more than useful for the translation of technical documentation, marketing material and other content which represents more than 90% of the daily translators load worldwide.

Looking at the near future I see that the integration of MT and other cross language technologies with Big Data technologies will open new horizons for Big Data making it a really global technology.

KMT: How has MT affected or changed your business models?

MK: Our business model is built around MT; it allows us to deliver translations to our customers quicker and cheaper than without MT, while at the same time preserving the same level of quality and guaranteeing data security. We not only position MT as a competitive advantage when it comes to translation, but also as a base technology for future services. My personal belief, which is shared by other bmmt employees is that MT is a key technology that will make our world different – where translation is available on demand, when and where consumers need it, at a fair price and at its expected quality.

KMT: What advice can you give to translation buyers, interested in machine translation?

MK: MT is still a relatively new technology, but at the same time there is already a number of best practices available for new and existing players in the MT market. In my opinion, the four key points for translation buyers to remember when thinking about adopting machine translation are:

  1. Don’t mix it up with TM – While TMs mostly support human translators storing previously translated segments, MT translates complete sentences in an automatic way, the main difference is in these new words and phrases, which are not stored in a TM database.
  2. There is more than one way to use MT – MT is flexible, it can be a productivity tool that enables translators to deliver translations faster with the same quality as in the standard translation framework. Or MT can be used for ‘gisting’ without post-editing at all – something that many translation buyers forget about, but, which can be useful in many business scenarios. A good example of this type of scenario is in the integration of MT into chat widgets for real-time translation.
  3. Don’t worry about quality – Quality Assurance is always included in the translation pipeline and we, like many other LSPs guarantee, a desired level of quality to all translations independently of how the translations were produced.
  4. Think about time and cost – MT enables translation delivery quicker and cheaper than without MT.

A big ‘thank you’ to Maxim for taking time out of his busy schedule to take part in this interview, and we look forward to hearing more from Maxim during the KantanMT/bmmt joint webinar ‘5 Challenges of Scaling Localization Workflows for the 21st Century’ on Thursday November 20th (4pm GMT, 5pm CET and 8am PST).

KantanMT Industry Webinar 5 Challenges of Scaling Localization for the 21st Century_Webinar

Register here for the webinar or to receive a copy of the recording. If you have any questions about the services offered from either bmmt or KantanMT please contact:

Peggy Linder, bmmt (peggy.lindner@bmmt.eu)

Louise Irwin, KantanMT (louisei@kantanmt.com)

Post-Editing Machine Translation

Statistical Machine Translation (SMT) has many uses – from the translation of User Generated Content (UGC) to Technical Documents, to Manuals and Digital Content. While some use cases may only need a ‘gist’ translation without post-editing, others will need a light to full human post-edit, depending on the usage scenario and the funding available.

Post-editing is the process of ‘fixing’ Machine Translation output to bring it closer to a human translation standard. This, of course is a very different process than carrying out a full human translation from scratch and that’s why it’s important that you give full training for staff who will carry out this task.

Training will make sure that post-editors fully understand what is expected of them when asked to complete one of the many post-editing type tasks. Research (Vasconcellos – 1986a:145) suggests that post-editing is a honed skill which takes time to develop, so remember your translators may need some time to reach their greatest post-editing productivity levels. KantanMT works with many companies who are post-editing at a rate over 7,000 words per day, compared to an average of 2,000 per day for full human translation.

Types of Training: The Translation Automation User Society (TAUS) is now holding online training courses for post-editors.

post-editing

Post-editing Levels

Post-editing quality levels vary greatly and will depend largely by the client or end-user. It’s important to get an exact understanding of user expectations and manage these expectations throughout the project.

Typically, users of Machine Translation will ask for one of the following types of post-editing:

  • Light post-editing
  • Full post-editing

The following diagram gives a general outline of what is involved in both light and full post-editing. Remember however, the effort to meet certain levels of quality will be determined by the output quality your engine is able to produce

post-editing machine translation

Generally, MT users would carry out productivity tests before they begin a project. This determines the effectiveness of MT for the language pair, in a particular domain and their post-editors ability to edit the output with a high level of productivity. Productivity tests will help you determine the potential Return on Investment of MT and the turnaround time for projects. It is also a good idea to carry out productivity tests periodically to understand how your MT engine is developing and improving. (Source: TAUS)

You might also develop a tailored approach to suit your company’s needs, however the above diagram offers some nice guidelines to start with. Please note that a well-trained MT engine can produce near human translations and a light touch up might be all that is required. It’s important to examine the quality of the output with post-editors before setting productivity goals and post-editing quality levels.

PEX Automatic Post-editing

Post-Editor Skills

In recent years, post-editing skills have become much more of an asset and sometimes a requirement for translators working in the language industry. Machine Translation has grown considerably in popularity and the demand for post-editing services has grown in line with this. TechNavio predicted that the market for Machine Translation will grow at a compound annual growth rate (CAGR) of 18.05% until 2016, and the report attributes a large part of this rise to “the rapidly increasing content volume”.

While the task of post-editing is markedly different to human translation, the skill set needed is almost on par.

According to Johnson and Whitelock (1987), post-editors should be:

  • Expert in the subject area, the text type and the contrastive language.
  • Have a perfect command of the target language

Is it also widely accepted that post-editors who have a favourable perception of Machine Translation perform better at post-editing tasks than those who do not look favourably on MT.

How to improve Machine Translation output quality

Pre-editing

Pre-editing is the process of adjusting text before it has been Machine Translated. This includes fixing spelling errors, formatting the document correctly and tagging text elements that must not be translated. Using a pre-processing tool like KantanMT’s GENTRY can save a lot of time by automating the correction of repetitive errors throughout the source text.

More pre-editing Steps:

Writing Clear and Concise Sentences: Shorter unambiguous segments (sentences) are processed much more effectively by MT engines. Also, when pre-editing or writing for MT, make sure that each sentence is grammatically complete (begins with a capital letter, has at least one main clause, and has an ending punctuation).

Using the Active Voice: MT engines work impressively on text that is clear and unambiguous, that’s why using the active voice, which cuts out vagueness and ambiguity can result in much better MT output.

There are many pre-editing steps you can carry out to produce better MT output. Also, keep in mind writing styles when developing content for Machine Translation to cut the amount of pre-editing required. Get tips on writing for MT here.

For more information about any of KantanMT’s post-editing automation tools, please contact: Gina Lawlor, Customer Relationship Manager (ginal@kantanmt.com).

KantanMT – 2013 Year in Review

KantanMT 2013 year in ReviewKantanMT had an exciting year as it transitioned from a publicly funded business idea into a commercial enterprise that was officially launched in June 2013. The KantanMT team are delighted to have surpassed expectations, by developing and refining cutting edge technologies that make Machine Translation easier to understand and use.

Here are some of the highlights for 2013, as KantanMT looks back on an exceptional year.

Strong Customer Focus…

The year started on a high note, with the opening of a second office in Galway, Ireland, and KantanMT kept the forward momentum going as the year progressed. The Galway office is focused on customer service, product education and Customer Relationship Management (CRM), and is home to Aidan Collins, User Engagement Manager, Kevin McCoy, Customer Relationship Manager and MT Success Coach, and Gina Lawlor, Customer Relationship co-ordinator.

KantanMT officially launched the KantanMT Statistical Machine Translation (SMT) platform as a commercial entity in June 2013. The platform was tested pre-launch by both industry and academic professionals, and was presented at the European OPTIMALE (Optimizing Professional Translator Training in a Multilingual Europe) workshop in Brussels. OPTIMALE is an academic network of 70 partners from 32 European countries, and the organization aims to promote professional translator training as the translation industry merges with the internet and translation automation.

The KantanMT Community…

The KantanMT member’s community now includes top tier Language Service Providers (LSPs), multinationals and smaller organizations. In 2013, the community has grown from 400 members in January to 3400 registered members in December, and in response to this growth, KantanMT introduced two partner programs, with the objective of improving the Machine Translation ecosystem.

The Developer Partner Program, which supports organizations interested in developing integrated technology solutions, and the Preferred Supplier of MT Program, dedicated to strengthening the use of MT technology in the global translation supply chain. KantanMT’s Preferred Suppliers of MT are:

KantanMT’s Progress…

To date, the most popular target languages on the KantanMT platform are; French, Spanish and Brazilian-Portuguese. Members have uploaded more than 67 billion training words and built approx. 7,000 customized KantanMT engines that translated more than 500 million words.

As usage of the platform increased, KantanMT focused on developing new technologies to improve the translation process, including a mobile application for iOS and Android that allows users to get access to their KantanMT engines on the go.

KantanMT’s Core Technologies from 2013…

KantanMT have been kept busy continuously developing and releasing new technologies to help clients build robust business models to integrate Machine Translation into existing workflows.

  • KantanAnalytics™ – segment level Quality Estimation (QE) analysis as a percentage ‘fuzzy match’ score on KantanMT translations, provides a straightforward method for costing and scheduling translation projects.
  • BuildAnalytics™ – QE feature designed to measure the suitability of the uploaded training data. The technology generates a segment level percentage score on a sample of the uploaded training data.
  • KantanWatch™ – makes monitoring the performance of KantanMT engines more transparent.
  • TotalRecall™ – combines TM and MT technology, TM matches with a ‘fuzzy match’ score of less than 85% are automatically put through the customized MT engine, giving the users the benefits of both technologies.
  • KantanISR™ Instant Segment Retraining technology that allows members near instantaneous correction and retraining of their KantanMT engines.
  • PEX Rule Editor – an advanced pattern matching technology that allows members to correct repetitive errors, making a smoother post-editing process by reducing post-editing effort, cost and times.
  • Kantan API – critical for the development of software connectors and smooth integration of KantanMT into existing translation workflows. The success of the MemoQ connector, led to the development of subsequent connectors for MemSource and XTM.

KantanMT sourced and cleaned a range of bi-directional domain specific stock engines that consist of approx. six million words across legal, medical and financial domains and made them available to its members. KantanMT also developed support for Traditional and Simplified Chinese, Japanese, Thai and Croatian Languages during 2013.

Recognition as Business Innovators…

KantanMT received awards for business innovation and entrepreneurship throughout the year. Founder and Chief Architect, Tony O’Dowd was presented with the ICT Commercialization award in September.

In October, KantanMT was shortlisted for the PITCH start-up competition and participated in the ALPHA Program for start-ups at Dublin’s Web Summit, the largest tech conference in Europe. Earlier in the year KantanMT was also shortlisted for the Vodafone Start-up of the Year awards.

KantanMT were silver sponsors at the annual 2013 ASLIB Conference ‘Adopting the theme Translating and the Computer’ that took place in London, in November, and in October, Tony O’Dowd, presented at the TAUS Machine Translation Showcase at Localization World in Silicon Valley.

KantanMT have recently published a white paper introducing its cornerstone Quality Estimation technology, KantanAnalytics, and how this technology provides solutions to the biggest industry challenges facing widespread adoption of Machine Translation.

KantanAnalytics WhitePaper December 2013

For more information on how to introduce Machine Translation into your translation workflow contact Niamh Lacy (niamhl@kantanmt.com).