Machine Translation Technology and Internet Security

Joseph Wojowski, Machine Translation Technology and internet security
Joseph Wojowski

KantanMT is delighted to republish, with permission a post on machine translation technology and internet security that was recently written by Joseph Wojowski. Joseph Wojowski is the Director of Operations at Foreign Credits and Chief Technology Officer at Morningstar Global Translations LLC.

Machine Translation Technology and Internet Security

An issue that seems to have been brought up once in the industry and never addressed again are the data collection methods used by Microsoft, Google, Yahoo!, Skype, and Apple as well as the revelations of PRISM data collection from those same companies, thanks to Edward Snowden. More and more, it appears that the industry is moving closer and closer to full Machine Translation Integration and Usage, and with interesting, if alarming, findings being reported on Machine Translation’s usage when integrated into Translation Environments, the fact remains that Google Translate, Microsoft Bing Translator, and other publicly-available machine translation interfaces and APIs store every single word, phrase, segment, and sentence that is sent to them.

Terms and Conditions

What exactly are you agreeing to when you send translation segments through the Google Translate or Bing Translator website or API?

1 – Google Terms and Conditions

Essentially, in using Google’s services, you are agreeing to permit them to store the segment to use for creating more accurate translations in the future, they can also publish, display, and distribute the content.

“When you upload, submit, store, send or receive content to or through our Services, you give Google (and those we work with) a worldwide license to use, host, store, reproduce, modify, create derivative works (such as those resulting from translations, adaptations or other changes we make so that your content works better with our Services), communicate, publish, publicly perform, publicly display and distribute such content.” (Google Terms of Service – 14 April 2014, accessed on 8 December 2014)

Oh, and did I mention that in using the service, the user is bearing all liability for“LOST PROFITS, REVENUES, OR DATA, FINANCIAL LOSSES OR INDIRECT, SPECIAL, CONSEQUENTIAL, EXEMPLARY, OR PUNITIVE DAMAGES.” (Google Terms of Service – 14 April 2014, accessed on 8 December 2014)

So if it is discovered that a client’s confidential content is also located on Google’s servers because of a negligent translator, that translator is liable for losses and Google relinquishes liability for distributing what should have been kept confidential.

Alright, that’s a lot of legal wording, not the best news, and a lot to take in if this is the first time you’re hearing about this. What about Microsoft Bing Translator?

2 – Microsoft Services Agreement (correction made to content – see below)

In writing their services agreement, Microsoft got very tricky. They start out positively by stating that you own your own content.

“Except for material that we license to you that may be incorporated into your own content (such as clip art), we do not claim ownership of the content you provide on the services. Your content remains your content, and you are responsible for it. We do not control, verify, pay for, or endorse the content that you and others make available on the services.” (Microsoft Services Agreement – effective 19 October 2012, accessed on 8 December 2014)

Bing! Bing! Bing! Bing! Bing! We have a winner! Right? Hold your horses, don’t install the Bing API yet. It continues on in stating,

“When you transmit or upload Content to the Services, you’re giving Microsoft the worldwide right, without charge, to use Content as necessary: to provide the Services to you, to protect you, and to improve Microsoft products and services.”(Microsoft Services Agreement – effective 19 October 2012, accessed on 8 December 2014)

So again with Bing, while they originally state that you own the content you submit to their services, they also state that in doing so, you are giving them the right to use the information as they see fit and (more specifically) to improve the translation engine.

How do these terms affect the translation industry, then?

The problem arises whenever translators are working with documents that contain confidential or restricted-access information. Aside from his/her use of webmail hosted by Microsoft, Google, Apple, etc. – which also poses a problem with confidentiality – contents of documents that are sent through free, public machine translation engines; whether through the website or API, are leaking the information the translator agreed to keep confidential in the Non-Disclosure Agreement (if established) with the LSP; a clear and blatant breach of confidentiality.

But I’m a professional translator and have been for years, I don’t use MT and no self-respecting professional translator would.

Well, yes and no; a conflict arises from that mode of thinking. In theory, yes, a professional translator should know better than to blindly use Machine Translation because of its inaccurate and often unusable output. A professional translator; however, should also recognize that with advancements in MT Technology, Machine Translation can be a very powerful tool in the translator’s toolbox and can, at times, greatly aid in the translation of certain documents.

The current state of the use of MT more echoes the latter than the former. In 2013 research conducted by Common Sense Advisory, 64% of the 239 people who responded to the survey reported that colleagues frequently use free Machine Translation Engines; 62% of those sampled were concerned about free MT usage.

In the November/December 2014 Issue of the ATA Chronicle, Jost Zetzsche relayed information on how users were using the cloud-based translation tool MemSource. Of particular interest are the Machine Translation numbers relayed to him by David Canek, Founder of MemSource. 46.2% of its around 30,000 users (about 13,860 translators) were using Machine Translation; of those, 98% were using the Google Translate or a variant of the Bing Translator API. And of still greater alarm, a large percentage of users using Bing Translator chose to employ the “Microsoft with Feedback” option which sends the finalized target segment back to Microsoft (a financially appealing option since when selected, use of the API costs nothing).

As you can imagine, while I was reading that article, I was yelling at all 13.9 thousand of them through the magazine. How many of them were using Google or Bing MT with documents that should not have been sent to either Google or Microsoft? How many of these users knew to shut off the API for such documents – how many did?

There’s no way to be certain how much confidential information may have been leaked due to translator negligence, in the best scenario perhaps none, but it’s clear that the potential is very great.

On the other hand, in creating a tool as dynamic and ever-changing as a machine translation engine, the only way to train it and make it better is to use it, a sentiment that is echoed throughout the industry by developers of MT tools and something that can be seen in the output of Google translate over the past several years.

So what options are there for me to have an MT solution for my customers without risking a breach in confidentiality?

There are numerous non-public MT engines available – including Apertium, a developing open-source MT platform – however, none of them are as widely used (and therefore, as well-trained) as Google Translate or Bing Translator (yes, I realize that I just spent over 1,000 words talking about the risk involved in using Google Translate or Bing Translator).

So, is there another way? How can you gain the leverage of arguably the best-trained MT Engines available while keeping confidential information confidential?

There are companies who have foreseen this problem and addressed it, without pitching their product, here’s how it works. It acts as an MT API but before any segments are sent across your firewall to Google, it replaces all names, proper nouns, locations, positions, and numbers with an independent, anonymous token or placeholder. After the translated segment has returned from Google and is safely within the confines of your firewall, the potentially confidential material then replaces the tokens leaving you with the MT translated segment. On top of that, it also allows for customized tokenization rules to further anonymize sensitive data such as formulae, terminology, processes, etc.

While the purpose of this article was not to prevent translators from using MT, it is intended to get translators thinking about its use and increase awareness of the inherent risks and solution options available.

— Correction —

As I have been informed, the information in the original post is not as exact as it could be, there is a Microsoft Translator Privacy Agreement that more specifically addresses use of the Microsoft Translator. Apparently, with Translator, they take a sample of no more than 10% of “randomly selected, non-consecutive sentences from the text” submitted. Unused text is deleted within 48 hours after translation is provided.

If the user subscribes to their data subscriptions with a maximum of 250 million characters per month (also available at levels of 500 million, 635 million, and one billion) , he or she is then able to opt-out of logging.

There is also Microsoft Translator Hub which allows the user to personalize the translation engine where “The Hub retains and uses submitted documents in full  in order to provide your personalized translation system and to improve the Translator service.”  And it should be noted that, “After you remove a document from your Hub account we may continue to use it for improving the Translator service.”

***

So let’s analyze this development. 10% of the full text submitted is sampled and unused text is deleted within 48 hours of its service to the user. The text is still potentially from a sensitive document and still warrants awareness of the issue.

If you use The Translator Hub, it uses the full document to train the engine and even after you remove the document from your Hub, and they may also use it to continue improving the Translator service.

Now break out the calculators and slide rules, kids, it’s time to do some math.

In order to opt-out of logging, you need to purchase a data subscription of 250 million characters per month or more (the 250 million character level costs $2,055.00/month). If every word were 50 characters each, that would be 5 million words per month (where a month is 31 days)  and a post-editor would have to process 161,290 words per day (working every single day of this 31-day month). It’s physically impossible for a post-editor to process 161,290 words in a day, let alone a month (working 8 hours a day for 20 days a month, 161,290 words per month would be 8,064.5 words per day). So we can safely assume that no freelance translator can afford to buy in at the 250 million character/month level especially when even in the busiest month, a single translator comes no where near being able to edit the amount of words necessary to make it a financially sound expense.

In the end, I still come to the same conclusion, we need to be more cognizant of what we send through free, public, and semi-public Machine Translation engines and educate ourselves on the risks associated with their use and the safer, more secure solutions available when working with confidential or restricted-access information.

The KantanMT team would like to thank Joseph Wojowski for allowing us to republish his very interesting and topical post on machine translation security. You can view the original post here.

KantanMT Security Key to translation success

At KantanMT, security, integrity and the privacy of our customers’ data is a top priority. We believe this is vital to their business operations and to our own success. Therefore, we use a multilayered approach to protect and encrypt this information. The KantanMT Data Privacy statement ensures that no client data is re-published, re-tasked or re-purposed and will also be fully encrypted during storage and transmission.

Read more about the KantanMT Data Privacy Infrastructure (PDF Download)

For more information about our security infrastructure please contact the KantanMT Sales Team (sales@kantanmt.com).

Language Industry Interview: KantanMT speaks with Maxim Khalilov, bmmt Technical Lead

Language Industry Interview: KantanMT speaks with Maxim Khalilov, bmmt Technical LeadThis year, both KantanMT and its preferred Machine Translation supplier, bmmt, a progressive Language Service Provider with an MT focus, exhibited side by side at the tekom Trade Fair and tcworld conference in Stuttgart, Germany.

As a member of the KantanMT preferred partner program, bmmt works closely with KantanMT to provide MT services to its clients, which include major players in the automotive industry. KantanMT was able to catch up with Maxim Khalilov, technical lead and ‘MT guru’ to find out more about his take on the industry and what advice he could give to translation buyers planning to invest in MT.

KantanMT: Can you tell me a little about yourself and, how you got involved in the industry?

Maxim Khalilov: It was a long and exciting journey. Many years ago, I graduated from the Technical University in Russia with a major in computer science and economics. After graduating, I worked as a researcher for a couple of years in the sustainable energy field. But, even then I knew I still wanted to come back to IT Industry.

In 2005, I started a PhD at Universitat Politecnica de Catalunya (UPC) with a focus on Statistical Machine Translation, which was a very new topic back then. By 2009, after successfully defending my thesis, I moved to Amsterdam where I worked as a post-doctoral researcher at the University of Amsterdam and later as a RD manager at TAUS.

Since February 2014, I’ve been a team lead at bmmt GmbH, which is a German LSP with strong focus on machine translation.

I think my previous experience helped me to develop a deep understanding of the MT industry from both academic and technical perspectives.  It also gave me a combination of research and management experience in industry and academia, which I am applying by building a successful MT business at bmmt.

KMT: As a successful entrepreneur, what were the three greatest industry challenges you faced this year?

MK: This year has been a challenging one for us from both technical and management perspectives. We started to build an MT infrastructure around MOSES practically from scratch. MOSES was developed by academia and for academic use, and because of this we immediately noticed that many industrial challenges had not yet been addressed by MOSES developers.

The first challenge we faced was that the standard solution does not offer a solid tag processing mechanism – we had to invest into a customization of the MOSES code to make it compatible with what we wanted to achieve.

The second challenge we faced was that many players in the MT market are constantly talking about the lack of reliable, quick and cheap quality evaluation metrics. BLEU-like scores unfortunately are not always applicable for real world projects. Even if they are useful when comparing different iterations of the same engines, they are not useful for cross language or cross client comparison.

Interestingly, the third problem has a psychological nature; Post-Editors are not always happy to post edit MT output for many reasons, including of course the quality of MT. However, in many situations the problem is that MT post-editing requires a different skillset in comparison with ‘normal’ translation and it will take time before translators adopt fully to post editing tasks.

KMT: Do you believe MT has a say in the future, and what is your view on its development in global markets?

MK: Of course, MT will have a big say in the language services future. We can see now that the MT market is expanding quickly as more and more companies are adopting a combination TM-MT-PE framework as their primary localization solution.

“At the same time, users should not forget that MT has its clear niche”

I don’t think a machine will be ever able to translate poetry, for example, but at the same time it does not need to – MT has proved to be more than useful for the translation of technical documentation, marketing material and other content which represents more than 90% of the daily translators load worldwide.

Looking at the near future I see that the integration of MT and other cross language technologies with Big Data technologies will open new horizons for Big Data making it a really global technology.

KMT: How has MT affected or changed your business models?

MK: Our business model is built around MT; it allows us to deliver translations to our customers quicker and cheaper than without MT, while at the same time preserving the same level of quality and guaranteeing data security. We not only position MT as a competitive advantage when it comes to translation, but also as a base technology for future services. My personal belief, which is shared by other bmmt employees is that MT is a key technology that will make our world different – where translation is available on demand, when and where consumers need it, at a fair price and at its expected quality.

KMT: What advice can you give to translation buyers, interested in machine translation?

MK: MT is still a relatively new technology, but at the same time there is already a number of best practices available for new and existing players in the MT market. In my opinion, the four key points for translation buyers to remember when thinking about adopting machine translation are:

  1. Don’t mix it up with TM – While TMs mostly support human translators storing previously translated segments, MT translates complete sentences in an automatic way, the main difference is in these new words and phrases, which are not stored in a TM database.
  2. There is more than one way to use MT – MT is flexible, it can be a productivity tool that enables translators to deliver translations faster with the same quality as in the standard translation framework. Or MT can be used for ‘gisting’ without post-editing at all – something that many translation buyers forget about, but, which can be useful in many business scenarios. A good example of this type of scenario is in the integration of MT into chat widgets for real-time translation.
  3. Don’t worry about quality – Quality Assurance is always included in the translation pipeline and we, like many other LSPs guarantee, a desired level of quality to all translations independently of how the translations were produced.
  4. Think about time and cost – MT enables translation delivery quicker and cheaper than without MT.

A big ‘thank you’ to Maxim for taking time out of his busy schedule to take part in this interview, and we look forward to hearing more from Maxim during the KantanMT/bmmt joint webinar ‘5 Challenges of Scaling Localization Workflows for the 21st Century’ on Thursday November 20th (4pm GMT, 5pm CET and 8am PST).

KantanMT Industry Webinar 5 Challenges of Scaling Localization for the 21st Century_Webinar

Register here for the webinar or to receive a copy of the recording. If you have any questions about the services offered from either bmmt or KantanMT please contact:

Peggy Linder, bmmt (peggy.lindner@bmmt.eu)

Louise Irwin, KantanMT (louisei@kantanmt.com)

Post-Editing Machine Translation

Statistical Machine Translation (SMT) has many uses – from the translation of User Generated Content (UGC) to Technical Documents, to Manuals and Digital Content. While some use cases may only need a ‘gist’ translation without post-editing, others will need a light to full human post-edit, depending on the usage scenario and the funding available.

Post-editing is the process of ‘fixing’ Machine Translation output to bring it closer to a human translation standard. This, of course is a very different process than carrying out a full human translation from scratch and that’s why it’s important that you give full training for staff who will carry out this task.

Training will make sure that post-editors fully understand what is expected of them when asked to complete one of the many post-editing type tasks. Research (Vasconcellos – 1986a:145) suggests that post-editing is a honed skill which takes time to develop, so remember your translators may need some time to reach their greatest post-editing productivity levels. KantanMT works with many companies who are post-editing at a rate over 7,000 words per day, compared to an average of 2,000 per day for full human translation.

Types of Training: The Translation Automation User Society (TAUS) is now holding online training courses for post-editors.

post-editing

Post-editing Levels

Post-editing quality levels vary greatly and will depend largely by the client or end-user. It’s important to get an exact understanding of user expectations and manage these expectations throughout the project.

Typically, users of Machine Translation will ask for one of the following types of post-editing:

  • Light post-editing
  • Full post-editing

The following diagram gives a general outline of what is involved in both light and full post-editing. Remember however, the effort to meet certain levels of quality will be determined by the output quality your engine is able to produce

post-editing machine translation

Generally, MT users would carry out productivity tests before they begin a project. This determines the effectiveness of MT for the language pair, in a particular domain and their post-editors ability to edit the output with a high level of productivity. Productivity tests will help you determine the potential Return on Investment of MT and the turnaround time for projects. It is also a good idea to carry out productivity tests periodically to understand how your MT engine is developing and improving. (Source: TAUS)

You might also develop a tailored approach to suit your company’s needs, however the above diagram offers some nice guidelines to start with. Please note that a well-trained MT engine can produce near human translations and a light touch up might be all that is required. It’s important to examine the quality of the output with post-editors before setting productivity goals and post-editing quality levels.

PEX Automatic Post-editing

Post-Editor Skills

In recent years, post-editing skills have become much more of an asset and sometimes a requirement for translators working in the language industry. Machine Translation has grown considerably in popularity and the demand for post-editing services has grown in line with this. TechNavio predicted that the market for Machine Translation will grow at a compound annual growth rate (CAGR) of 18.05% until 2016, and the report attributes a large part of this rise to “the rapidly increasing content volume”.

While the task of post-editing is markedly different to human translation, the skill set needed is almost on par.

According to Johnson and Whitelock (1987), post-editors should be:

  • Expert in the subject area, the text type and the contrastive language.
  • Have a perfect command of the target language

Is it also widely accepted that post-editors who have a favourable perception of Machine Translation perform better at post-editing tasks than those who do not look favourably on MT.

How to improve Machine Translation output quality

Pre-editing

Pre-editing is the process of adjusting text before it has been Machine Translated. This includes fixing spelling errors, formatting the document correctly and tagging text elements that must not be translated. Using a pre-processing tool like KantanMT’s GENTRY can save a lot of time by automating the correction of repetitive errors throughout the source text.

More pre-editing Steps:

Writing Clear and Concise Sentences: Shorter unambiguous segments (sentences) are processed much more effectively by MT engines. Also, when pre-editing or writing for MT, make sure that each sentence is grammatically complete (begins with a capital letter, has at least one main clause, and has an ending punctuation).

Using the Active Voice: MT engines work impressively on text that is clear and unambiguous, that’s why using the active voice, which cuts out vagueness and ambiguity can result in much better MT output.

There are many pre-editing steps you can carry out to produce better MT output. Also, keep in mind writing styles when developing content for Machine Translation to cut the amount of pre-editing required. Get tips on writing for MT here.

For more information about any of KantanMT’s post-editing automation tools, please contact: Gina Lawlor, Customer Relationship Manager (ginal@kantanmt.com).

KantanMT – 2013 Year in Review

KantanMT 2013 year in ReviewKantanMT had an exciting year as it transitioned from a publicly funded business idea into a commercial enterprise that was officially launched in June 2013. The KantanMT team are delighted to have surpassed expectations, by developing and refining cutting edge technologies that make Machine Translation easier to understand and use.

Here are some of the highlights for 2013, as KantanMT looks back on an exceptional year.

Strong Customer Focus…

The year started on a high note, with the opening of a second office in Galway, Ireland, and KantanMT kept the forward momentum going as the year progressed. The Galway office is focused on customer service, product education and Customer Relationship Management (CRM), and is home to Aidan Collins, User Engagement Manager, Kevin McCoy, Customer Relationship Manager and MT Success Coach, and Gina Lawlor, Customer Relationship co-ordinator.

KantanMT officially launched the KantanMT Statistical Machine Translation (SMT) platform as a commercial entity in June 2013. The platform was tested pre-launch by both industry and academic professionals, and was presented at the European OPTIMALE (Optimizing Professional Translator Training in a Multilingual Europe) workshop in Brussels. OPTIMALE is an academic network of 70 partners from 32 European countries, and the organization aims to promote professional translator training as the translation industry merges with the internet and translation automation.

The KantanMT Community…

The KantanMT member’s community now includes top tier Language Service Providers (LSPs), multinationals and smaller organizations. In 2013, the community has grown from 400 members in January to 3400 registered members in December, and in response to this growth, KantanMT introduced two partner programs, with the objective of improving the Machine Translation ecosystem.

The Developer Partner Program, which supports organizations interested in developing integrated technology solutions, and the Preferred Supplier of MT Program, dedicated to strengthening the use of MT technology in the global translation supply chain. KantanMT’s Preferred Suppliers of MT are:

KantanMT’s Progress…

To date, the most popular target languages on the KantanMT platform are; French, Spanish and Brazilian-Portuguese. Members have uploaded more than 67 billion training words and built approx. 7,000 customized KantanMT engines that translated more than 500 million words.

As usage of the platform increased, KantanMT focused on developing new technologies to improve the translation process, including a mobile application for iOS and Android that allows users to get access to their KantanMT engines on the go.

KantanMT’s Core Technologies from 2013…

KantanMT have been kept busy continuously developing and releasing new technologies to help clients build robust business models to integrate Machine Translation into existing workflows.

  • KantanAnalytics™ – segment level Quality Estimation (QE) analysis as a percentage ‘fuzzy match’ score on KantanMT translations, provides a straightforward method for costing and scheduling translation projects.
  • BuildAnalytics™ – QE feature designed to measure the suitability of the uploaded training data. The technology generates a segment level percentage score on a sample of the uploaded training data.
  • KantanWatch™ – makes monitoring the performance of KantanMT engines more transparent.
  • TotalRecall™ – combines TM and MT technology, TM matches with a ‘fuzzy match’ score of less than 85% are automatically put through the customized MT engine, giving the users the benefits of both technologies.
  • KantanISR™ Instant Segment Retraining technology that allows members near instantaneous correction and retraining of their KantanMT engines.
  • PEX Rule Editor – an advanced pattern matching technology that allows members to correct repetitive errors, making a smoother post-editing process by reducing post-editing effort, cost and times.
  • Kantan API – critical for the development of software connectors and smooth integration of KantanMT into existing translation workflows. The success of the MemoQ connector, led to the development of subsequent connectors for MemSource and XTM.

KantanMT sourced and cleaned a range of bi-directional domain specific stock engines that consist of approx. six million words across legal, medical and financial domains and made them available to its members. KantanMT also developed support for Traditional and Simplified Chinese, Japanese, Thai and Croatian Languages during 2013.

Recognition as Business Innovators…

KantanMT received awards for business innovation and entrepreneurship throughout the year. Founder and Chief Architect, Tony O’Dowd was presented with the ICT Commercialization award in September.

In October, KantanMT was shortlisted for the PITCH start-up competition and participated in the ALPHA Program for start-ups at Dublin’s Web Summit, the largest tech conference in Europe. Earlier in the year KantanMT was also shortlisted for the Vodafone Start-up of the Year awards.

KantanMT were silver sponsors at the annual 2013 ASLIB Conference ‘Adopting the theme Translating and the Computer’ that took place in London, in November, and in October, Tony O’Dowd, presented at the TAUS Machine Translation Showcase at Localization World in Silicon Valley.

KantanMT have recently published a white paper introducing its cornerstone Quality Estimation technology, KantanAnalytics, and how this technology provides solutions to the biggest industry challenges facing widespread adoption of Machine Translation.

KantanAnalytics WhitePaper December 2013

For more information on how to introduce Machine Translation into your translation workflow contact Niamh Lacy (niamhl@kantanmt.com).

Motivate Post-Editors

KantanMT motivate post-editorsPost-editing is a necessary step in the Machine Translation workflow, but the role is still largely misunderstood. Language Service Providers (LSPs) are now experimenting more with the best practices for post-editing in the workflow. The lack of consistent training and reluctance within the industry to accept importance of the role are linked to the post-editors motivation. KantanMT looks at some of the more conventional attitudes towards motivation and their application to post-editing.

What is motivation and what studies have been done so far?

Understanding the concept of motivation has been a hot topic in many areas of organisation theory. Studies in the area really began to kick off with their application in the workplace, opening doors for pioneers to understand how employees could be motivated to do more work, and do better work.

Motivation Pioneers

  • Abraham Maslow and his well-known ‘Hierarchy of Needs’ indicates a person’s motivations are based on their position in the hierarchy pyramid.
  • Frederick Herzberg’s ‘two Factor Theory’ or Herzberg’s motivation-hygiene theory suggests professional activities like; professional acknowledgement, achievements and work responsibility, or job satisfiers have a positive effect on motivation.
  • Douglas McGregor used a black and white approach to motivation in his ‘Theory X and Theory Y’. He grouped employees into two categories; those who will only do the minimum and those who will push themselves.

As development of theories continued…

  • John Adair came up with the ‘fifty-fifty theory’ . According to the fifty-fifty theory, motivation is fifty percent the responsibility of the employee and fifty percent outside the employee’s control.

Even more recently, in 2010

  • Teresa Amabile and Steven Kramer carried out a study on the motivation levels of employees in a variety of settings. Their findings, suggest ‘Progress’ as the top performance motivator identified from an analysis of approx. 12,000 diary entries, daily ratings of motivation and emotions from hundreds of study participants.

To understand post-editor motivation we can combine the top performance motivator; progress with fifty-fifty theory.

Progress is a healthy motivator in the post-editing profession, it can help Localization Project Managers understand and encourage post-editor satisfaction and motivation. But while progress can be deemed an external factor, if we apply Adair’s ‘fifty-fifty’ rule, post-editors are also at least fifty percent responsible for their own motivation.

Post-editing as a profession is still only finding its feet, TAUS carried out a study in 2010 on the post editing practices of global LSPs. The study showed that, while post-editing is becoming a standard activity in the translation workflow it only accounts for a minor share of LSP business volume. This indicates that post-editors see their role as one of lesser importance because the industry views it as a role of lesser importance.

This attitude in the industry is highlighted by the lack of industry standards for post-editing best practices. Without evaluation practices to train post-editors and improve the post-editing process, post-editors are not making progress. This quite naturally is demotivating for the post-editor.

How to motivate post-editors

The first step in motivating post-editors is to recognise their role as autonomous to the role of a translator. The best post-editors are those, who are at least bilingual with some form of linguistic training, like a translator. Linguistic training is a major asset for editing the Machine Translated output.

TAUS offer a comparison of the translation process versus the post-editing process, highlighting the differences in the post-editing and translation processes.

KantanMT, Translator process Taus 2010
Translation process of a Translator (TAUS 2010)
KantanMT, Motivating Post-editors,
Translation process of a Post-editor (TAUS 2010)

One process is not more complicated that the other, only different. Translators, translate internally, while post-editors make “snap editing decisions” based on client requirements. As LSPs recognise these differences, they can successfully motivate their post-editors by providing them with the most suitable support, and work environment.

Progress as a Motivator

Translators make good post-editors, they have the linguistic ability to understand both the source and target texts, and if they enjoy editing or proof-reading, then the post-editing role will suit them. The right training is also important, if post-editors are trained properly they will become more aware of potential improvements to the workflow.

These improvements or ideas can be a great boost to post-editor motivation, if implemented the post-editor can take on more responsibility, which helps improve the translation workflow. A case where this could be applied is; if the post-editor is made responsible for updating the language assets used to retrain a Machine Translation system, they can take ownership and become responsible for the output quality rather than just post-editing Machine Translation output in isolation.

Fixing repetitive errors, can be frustrating for anyone, not just post-editors. But if they are responsible for the output quality, understand the system and can control the rules used to reduce these repetitive errors, they will experience motivation through progress.

This is only the tip of the iceberg on what motivates post-editors, each post-editor is different and how they feel about the role, whether it is just ‘another job’ or a major step in their career all play a part. The key is to provide proper training, foster an environment where post-editors can make progress by positively contributing to the role.

Translators often take pride and ownership of their translations, post-editors should also have the opportunity to take pride in their work, as it is their skills and experience that make it ‘publishable’ or even ‘fit for purpose’ quality.

Repetitive errors like diacritic marks or capitalisation can be easily fixed using KantanMT’s Post-Editing Automation (PEX) rules. PEX rules allow repetitive errors in a Machine Translation engine to be easily fixed using a ‘find and replace’ tool. These rules can be checked on a sample of the text by using the PEX Rule Editor.

The post-editor can correct repetitive errors during post-editing process, so the same errors don’t appear in future MT output, giving them responsibility over the Machine Translation engines quality.

Automatic Post-Editing

KantanMT - PEX Post EditorPost-Editing Machine Translation (PEMT) is an important and necessary step in the Machine Translation process. KantanMT is releasing a new, simple and easy to use PEX rule editor, which will make the post-editing process more efficient, saving both time, costs and the post-editors sanity.

As we have discussed in earlier posts, PEMT is the process of reviewing and editing raw MT output to improve quality. The PEX rule editor is a tool that can help to save time and cut costs. It helps post-editors, since they no longer have to manually correct the same repetitive mistakes in a translated text.

Post-editing can be divided into roughly two categories; light and full post-editing.  ‘Light’ post-editing, also called ‘gist’, ‘rapid’ or ‘fast’ post-editing focuses on transferring the most correct meaning without spending time correcting grammatical and stylistic errors. Correcting textual standards, like word order and coherence are less important in a light post-edit, compared to a more thorough ‘full’ or ‘conventional’ post-edit. Full post-edits need the correct meaning to be conveyed, correct grammar, accurate punctuation, and the correct transfer of any formatting such as tags or place holders.

The Client often dictates the type of post-editing required, whether it’s a full post-edit to get it up to ‘publishable quality’ similar to a human translation standard, or a light post-edit, which usually means ‘fit for purpose’. The engine’s quality also plays a part in the post-editing effort; using a high volume of in-domain training data during the build produce higher quality engines, which helps to cut post-editing efforts. Other factors such as language combination, domain and text type all contribute to post-editing effort.

Examples of repetitive errors

Some users may experience the following errors in their MT output.

  • Capitalization
  • Punctuation mistakes, hyphenation, diacritic marks etc.
  • Words added/omitted
  • Formatting – trailing spaces

SMT engines use a process of pattern matching to identify different regular expressions. Regular expressions or ‘regex’ are special text strings that describe patterns, these patterns need no linguistic analysis so they can be implemented easily across different language pairs. Regular expressions are also important components in developing PEX rules. KantanMT have a list of regular expressions used for both GENTRY Rule files (*.rul) and PEX post-edit files (*.pex).

Post-Editing Automation (PEX)

Repetitive errors can be fixed automatically by uploading PEX rule files. These rule files allow post-editors to spend less time correcting the same repetitive errors by automatically applying PEX constructs to translations generated from a KantanMT engine.

PEX works by incorporating “find and replace” rules. The rules are uploaded as a PEX file and applied while a translation job is being run.

PEX Rule Editor

KantanMT have designed a simple way to create, test and upload post-editing rules to a client profile.

KantanMT Pex Rule Editor

The PEX Rule editor, located in the ‘MykantanMT’ menu, has an easy to use interface. Users can copy a sample of the translated text into the upper text box ‘Test Content’ then input the rules to be applied in the ‘PEX Search Rules’ and their corrections to the ‘PEX Replacement Rules’ box. The user can test the new rules by clicking ‘test rules’ and instantly identify any incorrect rules, before they are uploaded to the profile.

The introduction of tools to assist in the post-editing process helps remove some of the more repetitive corrections for post-editors. The new PEX Editor feature helps improve the PEMT workflow by ensuring all uploaded rule files are correct leading to a more effective method for fixing repetitive errors.

Conference and Event Guide – December 2013

KantanMT eventsThings are winding down as we are getting closer to the end of the year, but there are still some great events and webinars coming up during the month of December that we can look forward to.

Here are some recommendations from KantanMT to keep you busy in the lead up to the festive season.

Listings

Dec 02 – Dec 05, 2013
Event: IEEE CloudCom 2013, Bristol, United Kingdom

Held in association with Hewlett-Packard Laboratories (HP Labs), the conference is open to researchers, developers, users, students and practitioners from the fields of big data, systems architecture, services research, virtualization, security and high performance computing.


Dec 04, 2013
Event: LANGUAGES & BUSINESS Forum – Hotel InterContinental Berlin

The forum highlights key issues in language education, particularly in the workplace and the new technologies that are becoming a key part of the process. The event, will promote international networking and has four main themes; Corporate Training, Pre-Experience Learners, Intercultural Communication and Online Learning.


Dec 05, 2013
Webinar: Effective Post-Editing in Human and Machine Translation Workflows

Stephen Doherty and Federico Gaspari, CNGL (Centre for Next Generation Localisation) will give an overview of post-editing and different post-editing scenarios from ‘gist’ to ‘full’ post-edits. They will also give advice on different post-editing strategies and how they differ for Machine Translation systems.


Dec 07 – Dec 09, 2013
Event: 6th Language and Technology Conference, Poznan, Poland

The conference will address the challenges of Human Language Technologies (HLT) in computer science and linguistics. The event covers a wide range of topics including; electronic language resources and tools, formalisation of natural languages, parsing and other forms of NL processing.


Dec 09 – Dec 13, 2013
Event: IEEE GLOBECOM 2013 – Power of Global Communications, Atlanta, Georgia USA

The conference, which is the second largest of the 38 IEEE technical societies will focus on the latest advancements in broadband, wireless, multimedia, internet, image and voice communications. Some of the topics presented referring to localization occur on the 10th December and include; Localization Schemes, Localization and Link Layer Issues, and Detection, Estimation and Localization.


Dec 10 – Dec 11, 2013
Event: Game QA & Localization 2013, San Francisco, California USA

This event brings together QA and Localisation Managers, Directors and VPs from game developers around the world to discuss key game localization industry challenges. The event in London, June 2013 was a huge success, as more than 120 senior QA and localization professionals from developers, publishers and 3rd party suppliers of all sizes and platforms came to learn, benchmark and network.


Dec 11 – Dec 15, 2013
Event: International Conference on Language and Translation, Thailand, Vietnam and Cambodia

The Association of Asian Translation Industry (AATI) is holding an International Conference on Language and Translation or “Translator Day” in three countries; Thailand on December 11, 2013, Vietnam on December 13, 2013, and Cambodia on December 15, 2013. The events provide translators, interpreters, translation agencies, foreign language centres, NGO’s, FDI financed enterprises and other translation purchasers with opportunities to meet.


Dec 12, 2013
Webinar: LSP Partnerships & Reseller Programs 16:00 GMT (11:00 EST/17:00 CET)

This webinar, which is hosted by GALA and presented by Terena Bell covers how to open up new revenue streams by introducing reseller programs to current business models. The webinar is aimed at world trade associations, language schools, and other non-translation companies wishing to offer their clients translation, interpreting, or localization services.


Dec 13 – Dec 14 2013
Event: The Twelfth Workshop on Treebanks and Linguistic Theories (TLT12), Sofia (Bulgaria)

The workshops, hosted by BulTreeBank Group­­­­­­­ serve to promote new and ongoing high-quality work related to syntactically-annotated corpora such as treebanks. Treebanks are important resources for Natural Language processing applications including Machine Translation and information extraction. The workshops will focus on different aspects of treebanking; descriptive, theoretical, formal and computational.


Are you planning to go to any events during December? KantanMT would like to hear about your thoughts on what makes a good event in the localization industry.

Crowdsourcing vs. Machine Translation

KantanMT CrowdsourcingCrowdsourcing is becoming more popular with both organizations and companies since the concept’s introduction in 2006, and has been adopted by companies who are using this new production model to improve their production capacity while keeping costs low. The web-based business model, uses an open call format to reach a wide network of people willing to volunteer their services for free or for a limited reward, for any activity including translation. The application of translation crowdsourcing models has opened the door for increased demand of multilingual content.

Jeff Howe, Wired magazine defined crowdsourcing as:

“…the act of taking a job traditionally performed by a designated agent (usually an employee) and outsourcing it to an undefined, generally large group of people in the form of an open call”.

Crowdsourcing costs equate to approx. 20% of a professional translation. Language Service Providers (LSPs) like Gengo and Moravia have realised the potential of crowdsourcing as part of a viable production model, which they are combining with professional translators and Machine Translation.

The crowdsourcing model is an effective method for translating the surge in User Generate Content (UGC). Erratic fluctuations in demand need a dynamic, flexible and scalable model. Crowdsourcing is definitely a feasible production model for translation services, but it still faces some considerable challenges.

Crowdsourcing Challenges

  • No specialist knowledge – crowdsourcing is difficult for technical texts that require specialised knowledge. It often involves breaking down a text to be translated into smaller sections to be sent to each volunteer. A volunteer may not be qualified in the domain area of expertise and so they end up translating small sections text, out of context, with limited subject knowledge which leads to lower quality or mistranslations.
  • Quality – translation quality is difficult to manage, and is dependent on the type of translation. There have been some innovative suggestions for measuring quality, including evaluation metrics such as BLEU and Meteor, but these are costly and time consuming to implement and need a reference translation or ‘gold standard’ to benchmark against.
  • Security – crowd management can be a difficult task and the moderator must be able to vet participants and make sure that they follow the privacy rules associated with the platform. Sensitive information that requires translation should not be released to volunteers.
  • Emotional attachment – humans can become emotionally attached to their translations.
  • Terminology and writing style inconsistency – when the project is divided amongst a number of volunteers, the final version’s style needs to be edited and checked for inconsistencies.
  • Motivation – decisions on how to motivate volunteers and keep them motivated can be an ongoing challenge for moderators.

Improvements in the quality of Machine Translation have had an influence on crowdsourcing popularity and the majority of MT post-editing and proofreading tasks fit into crowdsourcing models nicely. Content can be classified into ‘find-fix-verify’ phases and distributed easily among volunteers.

There are some advantages to be gained when pairing MT technology and collaborative crowdsourcing.

Combined MT/Crowdsourcing

Machine Translation will have a pivotal role to play within new translation models, which focus on translating large volumes of data in cost-effective and powerful production models. Merging both Machine Translation and crowdsourcing tasks will create not only fit-for-purpose, but also high quality translations.

  • Quality – as the overall quality of Machine Translation output improves, it is easier for crowdsourcing volunteers with less experience to generate better quality translations. This will in turn increase the demand for crowdsourcing models to be used within LSPs and organizations. MT quality metrics will also make post-editing tasks more straightforward and easier to delegate among volunteers based on their experience.
  • Training data word alignment and engine evaluations can be done through crowd computing, and parallel corpora created by volunteers can be used to train and/or retrain existing SMT engines.
  • Security – customized Machine Translation engines are more secure when dealing with sensitive product or client information. General or publicly available information is more suited to crowdsourcing.
  • Terminology and writing style consistency – writing style and terminology can be controlled and updated through a straightforward process when using MT. This avoids the idiosyncrasies of volunteer writing styles. There is no risk of translator bias when using Machine Translation.
  • Speed – Statistical Machine Translation (SMT) engines can process translations quickly and efficiently. When there is a need for a high volume of content to be translated within a short period of time it is better to use Machine Translation. Output is guaranteed within a designated time and crowdsourcing post-editing tasks speeds up the production process before final checks are carried out by experienced translators or post-editors.
crowdsource and Machine Translation model
Use of crowdsourcing for software localization. Source: V. Muntes-Mulero and P. Paladini, CA Technologies and M. Solé and J. Manzoor, Universitat Politècnica de Catalunya.

Last chance for a FREE TRIAL for KantanAnalytics™ for all members until November 30th 2013. KantanAnalytics will be available on the Enterprise Plan.

MT Lingo

293-blueman-thinking-designMachine Translation TerminologyMT technology can be overwhelming for those new to the industry, and getting to grips with the jargon can be a daunting task even for some of the most and industry savvy gurus. KantanMT put together a list of some acronyms, popular buzzwords and numeronyms, which are abbreviations that use numbers, so that you can keep up with the MT professionals or just brush up on your tech vocabulary.

Numeronyms:

  • L10n – Localization/Localisation the process of adapting and translating a product or service so that it is culturally acceptable for a specific country or region.
  • I18n – Internationalization/Internationalisation is a process implemented in the planning stages of a product or application, it ensures the infrastructure (coding) suits future translations or localizations. Some of the more common Internationalization preparation for software products involves supporting international character sets like Unicode, or ensuring there is enough space in the User Interface (UI) for text to be translated from languages like English with single-byte character codes to the multiple-byte character codes used in Chinese and Japanese Kanji.
  • G11n – Globalization/Globalisation refer to the internationalization and localization preparations for products and services to be released in global markets. It usually incorporates ‘sim-ship’ or simultaneous shipment to different regions.

Acronyms:

  • MT – Machine Translation or Automated Translation is a translation carried out by computer. A piece of natural language text like English is translated by computer software into another language like French.  Cloud MT is Machine Translation based on the cloud. There are different types of MT systems available.
  • RBMT – Rule-Based Machine Translation system that uses a list of syntactic, grammatical and translation rules to generate the most appropriate translations.
  • SMT – Statistical Machine Translation systems are data driven and have a statistical modelling architecture using algorithms to find the most probable match between source and target segments.
  • API – Application Programming Interface is an interface that allows communication and interoperability between two applications or software programs.
  • LSP – Language Service Provider sometimes referred to as a Localization Service Provider, is a service provider that carries out the translation and localization of different types of content for specific countries or locales.
  • TM – Translation Memory is a database of aligned source and target translations called segments. Segments can be words, sentences or paragraphs. TMs can be integrated with CAT tools and they help speed up the translation process. TM files can be used as training data to train SMT engines.
  • SPE – Statistical Post-editing is when Machine Translation output that has been post-edited is re-used as training data and fed back into the SMT engine or used to train a new engine.

Popular Buzzwords:

  • Normalization is the checking and cleaning up a Translation Memory so it can be included as training data for a SMT engine. Things to identify and correct are tags, mistranslations, sentence mismatches and stylistic features like upper and lower case inconsistencies.
  • CAT tools – Computer-Aided Translation tools/ Computer-assisted Translation tools are used by humans to support the translation process by managing MT, TM and glossaries.
  • Glossaries are vocabulary lists of specialised terminology, usually specific to an industry or organisation. These files can be uploaded as additional training data to an SMT engine.
  • Bilingual corpus/ Bi-text database is a large text document with source and target languages. If the corpus is aligned it can be used as training data for an SMT engine.

If you know any new terms or interesting words you heard from your experience in the language and localization industry, KantanMT would love to hear about them, just pop them into the comment box below.

Interview: Working on KantanMT – a Developers Perspective

Eduardo shanahan
Eduardo Shanahan, CNGL

Eduardo Shanahan, a Senior Software Engineer at CNGL spent time working on KantanMT during its early days. KantanMT asked Eduardo to talk about what it was like to work with Founder and Chief Architect, Tony O’Dowd and the rest of the team developing the KantanMT product.

What was your initial impression, when you joined DLab in DCU?

This past year was a different kind adventure. After more than two decades working with Microsoft products like Visual Studio, so it was a big change, moving to Dublin City University (DCU) to be part of the Design and Innovation Lab, or DLab as we call it. The work in DLab consists of transforming code written by researchers into industrial quality products.

One of the first changes was to get a Mac and start deploying code in Linux, with no Visual Studio or even Mono. Instead I worked mostly with Python and NodeJS, and piles of shell scripts. Linux and Python, were not new to me but they did take some adjusting to using them.

This was a completely new environment and a new experience, and I was working in a whole new area. Back then, my relationship with Artificial Intelligence (AI) was informal to say the least, and I wasn’t even aware that something like Statistical Machine Translation (SMT) existed.

How did you get involved with working on KantanMT?

Starting out, I was working on a variety of different projects simultaneously.  A few months into it though, I started working full time with a couple of researchers creating new functionality for Tony and his KantanMT product, which is based on open source Moses technology. Moses technology uses aligned target and source texts of parallel corpora to train a SMT translation system. Once the system is trained, search algorithms are applied to find the most suitable translation matches. This translation model can be applied to any language pair.

What were your goals working on the KantanMT project?

Tony is doing a great job, deploying it on Amazon Web Services and creating a set of tools to streamline the operations for end users. His request to CNGL, was to provide more advanced insight into the translation quality produced by Moses.

To accomplish this, the task was mapped to two successive projects with different researchers on each project. The pace was very intense, we wanted state of the art results that showed up in the applications. Sandipan Dandapat, Assistant Professor in the Department of Computer Science and Engineering, IIT Guwahati and Aswarth Dara, Research Assistant at CNGL, DCU worked on adding real value to the KantanMT product during those long weeks, while I was rewriting their code time after time until it passed all the tests and then some. Our hard work paid off when KantanWatch™ and KantanAalytics™ were born.

Each attempt to deliver was an experience in itself, Tony was quick to detect any inconsistencies and wanted to be extra sure about understanding all the details and steps on the research and implementation.

In your opinion was the work a success?

The end result, is something that has made me proud. The mix between being a scientist and having a real product to implement is a very good combination. The guys at DCU have done a great job on the product base and DLab is a fantastic research and work environment.  The no nonsense attitude from Tony’s side created a very interesting situation and It’s something that we can really celebrate after a year of hard work.

The CNGL Centre for Global Intelligent Content

The CNGL Centre for Global Intelligent Content (Dublin City University, Ireland) is supported by the Science Foundation Ireland. During its academic-industry collaborative research it has not only driven standards in content and localization service integration, but it is also pioneering advancements in Machine Translation through the development of disruptive and cutting edge processing technologies. These technologies are revolutionising global content value chains across a number of different industries.

The CNGL research centre draws its talent and expertise from a combined 150 researchers from Trinity College Dublin, Dublin City University, University College Dublin and University of Limerick. The centre also works closely with industry partners to produce disruptive technologies that will have a positive impact both socially and economically.

KantanMT allows users to build a customised translation engine with training data that will be specific to their needs. KantanMT are continuing to offer a 14 day free trial to new members.