Sue’s Top Tips for Building MT Engines

Sue McDermott, KantanMTI’m new to machine translation and one of the things I’ve been doing at KantanMT is learning how to refine training data with a view to building stock engines.

Stock engines are the optional training data provided by KantanMT to improve the performance of your customized MT engine. In this post I’m going to describe the process of building an engine and refining the training data.

The building process on the platform is quite simple. From your dashboard on the website select “My Client Profiles” where you will find two profiles, which have already been set up. A default profile and sample profile; both of which let you run translation jobs straight away.

To create your own customized profile select ‘New’ at the top of the left-most column. This launches the client Profile Wizard.  Enter the name of your new engine; try to make this something meaningful, or use an easily recognizable standard around how you name your profiles. This makes it easier to recognize which profile is which, when you have more than one profile.

When you select ‘next’ you will be asked to specify the source and target languages from drop down menus. The wizard lets you distinguish between different variants of the same language for example Canadian English or US English. Let’s say we’re translating from Canadian English to Canadian French. If you’re not sure which variant you need, have a quick look at the training data, which will give you the language codes.

The next step gives you an option to select a stock engine from a drop down menu. The stock engines are grouped according to their business area or domain.

You will see a summary of your choices, if you’re happy with them select ‘create’. Your new engine will be shown in the list of your client profiles. However, while you have created your engine, you haven’t yet built it.

KantanMT Stock Engine Training data
Stock training data available for social and conversational domains on the KantanMT platform.

 

Building Your Engine

Selecting your profile from the list will make it the current active engine.  By selecting the Training Data tab you can upload any additional training data easily by using the drag and drop function. Then select the ‘Build’ option to begin building your engine.

It’s always a good idea to supply as much useful training data as possible. This ‘educates’ the engine in the way your organization typically translates text.

Once the build job has been submitted, you can monitor its progress in the ‘My Jobs’ page.

When the job is completed the BuildAnalytics™ feature is created. This can be accessed by clicking on the database icon to the left of the profile name. BuildAnalytics will give you feedback on the strength of your engine using industry standard scores, as well as details about your engines word count. The tabs across the page will give you access to more detail.

The summary tab lets you to see the average BLEU, F-Measure and TER scores for the engine, and the pie charts show you a summary of the percentage scores for all segments. For more detail select the respective tabs and use the data to investigate individual segments.

KantanMT BuildAnalytics Feature
KantanBuildAnalytics provides a granular analyis of your MT engine.

 

A Rejects Report is created for every file of Training Data uploaded. You can use this to determine why some of your data is not being used, and improve the uptake rate of your data.

Gap analysis gives you an effective way to improve your engine with relevant glossary or noise lists, which you can upload to future engine builds. By adding these terminology files in either TBX (Terminology Interchange) or XLSX (Microsoft Excel Spreadsheet) formats you will quickly improve the engines performance.

The Timeline tag shows you the evolution of your engine over its lifetime. This feature lets you compare the statistics with previous builds, and track all the data you have uploaded. On a couple of occasions, I used the archive feature to revert back to a previous build, when the engine building process was not going according to plan.

KantanMT Timeline
KantanMT Timeline lets you view your entire engine’s build history.

 

Improving Your Engine

A great way to improve your engines performance is to analyze the rejects report for the files with a higher rejection rate.  Once you understand the reasons segments are rejected you can begin to address them.  For example, an error 104 is caused by a difference in place holder counts. This can be something as simple as the source language using the % sign where the target language uses the word ‘percent’. In this case a preprocessor rule can be created to fix the problem.

KantanMT Rejects Report Error 104
A detailed rejects report shows you the errors in your MT engine.

A PEX rule editor is accessed from the KantanMT drop down menu. This lets you try out your preprocessor rules, and see the effect that they have in the data. I would suggest directly copying and pasting from the rejects report to the test area and applying your PEX rule to ensure you’re precisely targeting the data concerned. You can get instant feedback using this tool.

Once you’re happy with the way the rules work on the rejected data it’s useful to analyze the rest of the data to see what effect the rules will have.  You want to avoid a situation where using a rule resolves 10 rejects, but creates 20 more. Once the rules are refined copy them to the appropriate files (source.ppx, target.ppx) and upload with the training data. Remember that the rules will run against the content in the order they are specified.

When you rebuild the engine they will be incorporated, and hopefully improve the scores.

Sue’s 3 Tips for Successfully Building MT Engines

  1. Name your profiles clearly – When you are using a number of profiles simultaneously knowing what each one is (Language pair/domain) will make it much easier as you progress through the building process.
  2. Take advantage of BuildAnalytics – Use the insights and Gap analysis features to give you tips on improving your engine. Listening to these tips can really help speed up the engine refinement process.
  3. The PEX Rule Editor is your friend – Don’t be afraid to try out creating and using new PEX rules, if things go south you can always go back to previous versions of your engine.

My internship at KantanMT.com really opened my eyes to the world of language services and machine translation. Before joining the team I knew nothing about MT or the mechanics behind building engines. This was a great experience, and being part of such a smoothly run development team was an added bonus that I will take with me when I return ITB to finish my course.

About Sue McDermott

Sue is currently studying for a Diploma in Computer Science from ITB (Institute of Technology Blanchardstown). Sue joined KantanMT.com on a three month internship. She has a degree in English Literature and a background in business systems, and is also a full-time mum for the last 17 years.

Email: info@kantanmt.com, if you have any questions or want more information on the KantanMT platform.

Language Industry Interview: KantanMT speaks with Maxim Khalilov, bmmt Technical Lead

Language Industry Interview: KantanMT speaks with Maxim Khalilov, bmmt Technical LeadThis year, both KantanMT and its preferred Machine Translation supplier, bmmt, a progressive Language Service Provider with an MT focus, exhibited side by side at the tekom Trade Fair and tcworld conference in Stuttgart, Germany.

As a member of the KantanMT preferred partner program, bmmt works closely with KantanMT to provide MT services to its clients, which include major players in the automotive industry. KantanMT was able to catch up with Maxim Khalilov, technical lead and ‘MT guru’ to find out more about his take on the industry and what advice he could give to translation buyers planning to invest in MT.

KantanMT: Can you tell me a little about yourself and, how you got involved in the industry?

Maxim Khalilov: It was a long and exciting journey. Many years ago, I graduated from the Technical University in Russia with a major in computer science and economics. After graduating, I worked as a researcher for a couple of years in the sustainable energy field. But, even then I knew I still wanted to come back to IT Industry.

In 2005, I started a PhD at Universitat Politecnica de Catalunya (UPC) with a focus on Statistical Machine Translation, which was a very new topic back then. By 2009, after successfully defending my thesis, I moved to Amsterdam where I worked as a post-doctoral researcher at the University of Amsterdam and later as a RD manager at TAUS.

Since February 2014, I’ve been a team lead at bmmt GmbH, which is a German LSP with strong focus on machine translation.

I think my previous experience helped me to develop a deep understanding of the MT industry from both academic and technical perspectives.  It also gave me a combination of research and management experience in industry and academia, which I am applying by building a successful MT business at bmmt.

KMT: As a successful entrepreneur, what were the three greatest industry challenges you faced this year?

MK: This year has been a challenging one for us from both technical and management perspectives. We started to build an MT infrastructure around MOSES practically from scratch. MOSES was developed by academia and for academic use, and because of this we immediately noticed that many industrial challenges had not yet been addressed by MOSES developers.

The first challenge we faced was that the standard solution does not offer a solid tag processing mechanism – we had to invest into a customization of the MOSES code to make it compatible with what we wanted to achieve.

The second challenge we faced was that many players in the MT market are constantly talking about the lack of reliable, quick and cheap quality evaluation metrics. BLEU-like scores unfortunately are not always applicable for real world projects. Even if they are useful when comparing different iterations of the same engines, they are not useful for cross language or cross client comparison.

Interestingly, the third problem has a psychological nature; Post-Editors are not always happy to post edit MT output for many reasons, including of course the quality of MT. However, in many situations the problem is that MT post-editing requires a different skillset in comparison with ‘normal’ translation and it will take time before translators adopt fully to post editing tasks.

KMT: Do you believe MT has a say in the future, and what is your view on its development in global markets?

MK: Of course, MT will have a big say in the language services future. We can see now that the MT market is expanding quickly as more and more companies are adopting a combination TM-MT-PE framework as their primary localization solution.

“At the same time, users should not forget that MT has its clear niche”

I don’t think a machine will be ever able to translate poetry, for example, but at the same time it does not need to – MT has proved to be more than useful for the translation of technical documentation, marketing material and other content which represents more than 90% of the daily translators load worldwide.

Looking at the near future I see that the integration of MT and other cross language technologies with Big Data technologies will open new horizons for Big Data making it a really global technology.

KMT: How has MT affected or changed your business models?

MK: Our business model is built around MT; it allows us to deliver translations to our customers quicker and cheaper than without MT, while at the same time preserving the same level of quality and guaranteeing data security. We not only position MT as a competitive advantage when it comes to translation, but also as a base technology for future services. My personal belief, which is shared by other bmmt employees is that MT is a key technology that will make our world different – where translation is available on demand, when and where consumers need it, at a fair price and at its expected quality.

KMT: What advice can you give to translation buyers, interested in machine translation?

MK: MT is still a relatively new technology, but at the same time there is already a number of best practices available for new and existing players in the MT market. In my opinion, the four key points for translation buyers to remember when thinking about adopting machine translation are:

  1. Don’t mix it up with TM – While TMs mostly support human translators storing previously translated segments, MT translates complete sentences in an automatic way, the main difference is in these new words and phrases, which are not stored in a TM database.
  2. There is more than one way to use MT – MT is flexible, it can be a productivity tool that enables translators to deliver translations faster with the same quality as in the standard translation framework. Or MT can be used for ‘gisting’ without post-editing at all – something that many translation buyers forget about, but, which can be useful in many business scenarios. A good example of this type of scenario is in the integration of MT into chat widgets for real-time translation.
  3. Don’t worry about quality – Quality Assurance is always included in the translation pipeline and we, like many other LSPs guarantee, a desired level of quality to all translations independently of how the translations were produced.
  4. Think about time and cost – MT enables translation delivery quicker and cheaper than without MT.

A big ‘thank you’ to Maxim for taking time out of his busy schedule to take part in this interview, and we look forward to hearing more from Maxim during the KantanMT/bmmt joint webinar ‘5 Challenges of Scaling Localization Workflows for the 21st Century’ on Thursday November 20th (4pm GMT, 5pm CET and 8am PST).

KantanMT Industry Webinar 5 Challenges of Scaling Localization for the 21st Century_Webinar

Register here for the webinar or to receive a copy of the recording. If you have any questions about the services offered from either bmmt or KantanMT please contact:

Peggy Linder, bmmt (peggy.lindner@bmmt.eu)

Louise Irwin, KantanMT (louisei@kantanmt.com)

Scalability or Quality – Can we have both?

KantanMT Engine optimization, machine translationThe ‘quality debate’ is old news and the conversation, which is now heavily influenced by ‘big data’ and ‘cloud computing’ has moved on. Instead it is focusing on the ability to scale translation jobs quickly and efficiently to meet real-time demands.

Translation buyers expect a system or workflow that provides high quality, fit-for-purpose translations. And it’s because of this that Language Service Providers (LSPs) have worked tirelessly, perfecting their systems and orchestrating the use of Translation Memories (TM) within well managed workflows that combine the professionalization of the translator industry – quality is now a given in the buyers eyes.

What is the translation buyers’ biggest challenge?

The Translation buyers’ biggest challenge now is scale – scaling their processes, their workflows and supply chains. Of course, the caveat is that they want scale without jeopardizing quality! They need systems that are responsive, are transparent and scale gracefully in step with their corporate growth and language expansion strategy.

Scale with quality! One without the other is as useless as a wind-farm without wind!

What makes machine translation better than other processes? Looking past the obvious automation of the localization workflow, the one thing that MT can do above all other translation methods is its ability to combine automation and scalability.

KantanAutoScale, KantanMT product, machine translationKantanMT recognizes this and has developed a number of key technologies to accelerate the speed of on-demand MT engines without compromising quality.

  • KantanAutoScale™ is an additional divide and conquer feature that lets KantanMT users distribute their translation jobs across multiple servers running in the cloud.
  • Engine Optimization technology means KantanMT engines now operate 5-10 times faster, reducing the amount of memory and CPU power needed so MT jobs can be processed faster and are more efficiently when using features like KantanAutoScale.
  • API optimization, KantanMT engineers went back to basics, reviewing and refining the system, which enabled users to achieve improvements from 50-100% performance in translation speed.  This meant translation jobs that took five hours can now be completed in less than one hour.

Scalability is the key to advancement in machine translation, and considering the speed at which people are creating and digesting content we need to be able to provide true MT scalability to all language pairs for all content.

KantanMT’s Tony O’Dowd and bmmt’s Maxim Khalilov will discuss the scalability challenge and more, in a free webinar for translation buyers; 5 Challenges of Scaling Localization Workflows in the 21st Century on Thursday November 20th at 4pm GMT, 5pm CET, 8am PST.

KantanMT and bmmt webinar presenters Tony O'Dowd and Maxim Khalilov

To hear more about optimizing or improving the scalability of your engine please contact Louise Irwin (louisei@kantanmt.com).

5 Reasons to Read the TAUS Review

Earlier this month, TAUS, a well-known industry think tank and resource centre for the language services industry launched its quarterly publication; the TAUS review. The new magazine with a mission is dedicated to;

“Making translation technology more prominent and mainstream throughout the globe to break language barriers and improve worldwide communication.”

KantanMT TAUS Review

KantanMT identified five key reasons that make the review an invaluable asset to any translation and localization professional. It’s thanks to these reasons that KantanMT will distribute the TAUS Review right here on the KantanMTblog.

1. Global Translation Industry news 

TAUS has mobilized writers from across the globe; Africa, Americas, Asia and Europe to discuss different trends and technologies in the language services industry. These articles can become a great reference tool for those interested in how language technologies are advancing. In this issue; Andrew Joscelyne reports from Europe; Brian McConnell gives updates from the Americas; Asian trends are covered by Mike Tian-Jian Jiang and Amlaku Eshetie reports from the southern hemisphere; Africa.

2. Research and Reports 

Recent Research in MT is pretty exciting stuff, those that consider themselves language industry veterans like Luigi Muzii remember a time when machine translation predictions were overestimated. But what was once an unrealistic assumption is now changing as “neural networks and big data” are bringing a new frontier to natural language processing. Luigi Muzii gives an overview of the ‘research perspective’, highlighting current trends in research and linking to some interesting ACL winning papers, which introduce MT decoders that do not need linguistic resources.

3. Unique Insights

TAUS Review offers unique insights into the translation industry by incorporating use cases and perspectives from four different personas; the researcher, the journalist, the translator and the language expert, each one with their own different views and opinions on the importance of global communication and breaking down language barriers. In this issue, Jost Zetzsche, Nicholas Ostler, Lane Greene, and Luigi Muzii share their perspectives.

KantanMT especially enjoyed  Jost Zetzsche’s view of making “machine translation translator-centric” where the translator is at the centre of the MT workflow. One of the examples he lists for making this possible, “dynamic improvements in MT systems” is available to KantanMT clients.

4. Language Technology Community 

The opinions and thoughts that come from each contributor are neatly wrapped in one accessible place, and when coupled with the directory of distributors, events and webinars make a very useful resource for any small business or language technology enthusiast. Keep an eye out for some very interesting post-editing and MT quality webinars planned for November.

5. It’s Free! 

Holding true to the concept of sharing information and making translation technology more prominent and mainstream throughout the globe, the review is available quarterly and completely free for its readers, making it accessible to anyone, anywhere regardless of their budget.

Scroll to the end of the page to find the TAUS review on the KantanMTBlog.

SMT Quality Challenge

KantanMT Machine Translation TechnologyOne of the biggest challenges when customizing Statistical Machine Translation (SMT) is improving the engine after its initial development. While you can build a baseline engine using existing Translation Memories (TM), terminology and monolingual training data assets – the real challenge is going beyond this, and achieving even higher levels of quality. More importantly, how can you do this rapidly with minimum cost and effort? A proactive approach to measuring the quality of your training data will greatly assist in doing this.

Kantan BuildAnalytics™ is a new technology that addresses this head-on and helps SMT developers to build engines that are production ready, fast!

What is Kantan BuildAnalytics?

Kantan BuildAnalytics brings a new level of transparency to the SMT building and training process, and KantanMT users can now build higher performing engines for each domain, resulting in less post-editing requirements.

How it works…

When you build a KantanMT engine, some of your training data is automatically extracted and kept to one side. This is called a Reference Data Set – and contains both source and target texts. After a KantanMT engine is built, this Reference Data Set is used to calculate a series of automated quality scores – including BLEU (Bilingual Evaluation Understudy), F-Measure and TER.

This Reference Data Set is also used to perform a Gap Analysis. Gap Analysis is a quick way to determine any missing words in the engine’s phrase-tables. I’ll come back to this later and demonstrate how Gap Analysis can improve the quality performance of KantanMT engines.

But for now, let’s focus on the automated quality scores of BLEU, F-Measure and TER.

BuildAnalytics uses the KantanMT data visualization library to graphically display the distribution of these automated scores based on the Reference Data Set. Since an automated score is calculated for each text segment within the Reference Data Set, this means we get a detailed view of how a KantanMT engine is performing and how it should generate translated output.

By analysing these scores and the Gap Analysis results, and examining the translated output, users of KantanMT are producing higher quality engines because their training data choices are more strategic and refined.

F-Measure

Let’s look at F-Measure first, as this is the most straightforward to understand and visualize.  F-Measure scores show how precise a KantanMT engine is when retrieving words, and how many words it can retrieve or recall during translation. This is why it is commonly referred to as a Recall and Precision measurement. By expressing these two measurements as a ratio, it is a good indicator of the engines performance and its ability to translate content.

KantanMT F-Measure
KantanMT engine F-Measure score distribution

However, while your KantanMT engine may have a high F-Measure score – it doesn’t mean that these words are recalled in the correctly translated order.  We need another metric to give us an indication of how well the engine translated the text and BLEU is one of the most recognized and automated metric for estimating the texts fluency.

BLEU

BLEU is an automatic evaluation metric well known in both the industry and academia, which calculates an estimation of text fluency. Fluency is a measure of the correspondence between a KantanMT engine output and that of a professional translator.

Since the Reference Data Set consists of both source and human translated equivalents, which were created by a professional translator, BLEU score can be calculated by comparing the output of a KantanMT engine to this Reference Data Set.

KantanMT BLEU score
KantanMT engine BLEU score distribution

In practice, BLEU achieves a high correlation with human judgement of quality and remains one of the most popular automated metrics in use today.

TER

TER standards for Translation Error Rate and is used to estimate the amount of post-editing required to transform a generated translation to its original human translation equivalent. In simple terms this is a count of the number of insertions, deletions and substitutions required to transform a segment to match its original human translation equivalent.

KantanMT TER score
KantanMT engine TER score distribution

So the lower this score, the less transformation required which means the less post-editing required too.

Working with Kantan BuildAnalytics™

BuildAnalytics is a really great way to see all these automated scores in action. It uses KantanMT data visualization technology to graphically present these scores, helping developers of KantanMT engines to fine-tune their training data and maximize their engine’s quality performance.

Let’s take a closer look at how this data visualization can be used to gain insights into an engine and determine if it is a high or low performing engine, and what steps we can take to improve it.

Here’s the summary distribution graphs for an engine that contains approx. 3.2m words. It’s a small engine within a technical domain. Its overall scores are:

KantanMT BuildAnalytics Graph

These Summary Graphs show the distribution of scores, grouped into bands (i.e. <40%, 40-54% etc.), for each automated score. This is very helpful in determining the scores’ overall distribution, and how the KantanMT engine is likely to be performing.

Here are the detailed distribution graphs for each automated score:

KantanMT distribution graphs

By reviewing both the Summary Graphs and the more detailed Distribution Graphs we can make some observations of how this engine would most likely perform. My observations are included as part of the commentary in the table above.

It’s important to point out that no one individual score gives an absolute of how a KantanMT engine will perform. We need to take a holistic view on how to determine a general sense of the performance of the engine by reviewing all automated scores together.

Using Kantan BuildAnalytics users can get a good sense of how a KantanMT engine will perform in a production environment and with a little practice and experimentation, they can use this knowledge to build higher performing MT engines.

Gap Analysis

I mentioned this concept earlier in the post, so let’s take a closer look at this really helpful new feature. Gap Analysis determines how many untranslated words remain in the generated translations. These missing words, or ‘Gaps’ can quickly be identified and filled by introducing the most relevant training data to your KantanMT engine and re-training it.

The Gap Analysis feature not only lists the gaps, it also presents suitable training data, which can be post-edited and resubmitted as training data to improve overall engine’s performance. This makes filling the gaps just that little bit easier!

One more (very important) thing…

Most quality improvements for SMT systems will be created by fine tuning terminology and filling data gaps. Post-editing raw-MT output and a focus on minimizing data gaps will significantly improve the quality performance of your KantanMT engines. This cannot be done without the involvement of professional translators. They have the skills, knowledge and linguistic expertise to finesse terminology, identify gaps and choose better training data. While BuildAnalytics helps SMT developers get engines ready for production, ultimately, it’s the professional translator that should have the final say in how production-ready it truly is!

To get the most from your Machine Translation engine, always keep in mind:

  • Measuring and improving training data – high quality training data is the first step to building a successful Machine Translation engine.
  • Take a holistic approach to evaluating performance – automatic evaluation metrics can give a good indicator of how your KantanMT engine will perform, but metrics alone are insufficient for measuring post-editing effort.

Kantan BuildAnalytics is available to Enterprise members of KantanMT, but you can also experience this quality estimation and measurement software by signing up for a free trial on KantanMT.com.

KantanMT – 2013 Year in Review

KantanMT 2013 year in ReviewKantanMT had an exciting year as it transitioned from a publicly funded business idea into a commercial enterprise that was officially launched in June 2013. The KantanMT team are delighted to have surpassed expectations, by developing and refining cutting edge technologies that make Machine Translation easier to understand and use.

Here are some of the highlights for 2013, as KantanMT looks back on an exceptional year.

Strong Customer Focus…

The year started on a high note, with the opening of a second office in Galway, Ireland, and KantanMT kept the forward momentum going as the year progressed. The Galway office is focused on customer service, product education and Customer Relationship Management (CRM), and is home to Aidan Collins, User Engagement Manager, Kevin McCoy, Customer Relationship Manager and MT Success Coach, and Gina Lawlor, Customer Relationship co-ordinator.

KantanMT officially launched the KantanMT Statistical Machine Translation (SMT) platform as a commercial entity in June 2013. The platform was tested pre-launch by both industry and academic professionals, and was presented at the European OPTIMALE (Optimizing Professional Translator Training in a Multilingual Europe) workshop in Brussels. OPTIMALE is an academic network of 70 partners from 32 European countries, and the organization aims to promote professional translator training as the translation industry merges with the internet and translation automation.

The KantanMT Community…

The KantanMT member’s community now includes top tier Language Service Providers (LSPs), multinationals and smaller organizations. In 2013, the community has grown from 400 members in January to 3400 registered members in December, and in response to this growth, KantanMT introduced two partner programs, with the objective of improving the Machine Translation ecosystem.

The Developer Partner Program, which supports organizations interested in developing integrated technology solutions, and the Preferred Supplier of MT Program, dedicated to strengthening the use of MT technology in the global translation supply chain. KantanMT’s Preferred Suppliers of MT are:

KantanMT’s Progress…

To date, the most popular target languages on the KantanMT platform are; French, Spanish and Brazilian-Portuguese. Members have uploaded more than 67 billion training words and built approx. 7,000 customized KantanMT engines that translated more than 500 million words.

As usage of the platform increased, KantanMT focused on developing new technologies to improve the translation process, including a mobile application for iOS and Android that allows users to get access to their KantanMT engines on the go.

KantanMT’s Core Technologies from 2013…

KantanMT have been kept busy continuously developing and releasing new technologies to help clients build robust business models to integrate Machine Translation into existing workflows.

  • KantanAnalytics™ – segment level Quality Estimation (QE) analysis as a percentage ‘fuzzy match’ score on KantanMT translations, provides a straightforward method for costing and scheduling translation projects.
  • BuildAnalytics™ – QE feature designed to measure the suitability of the uploaded training data. The technology generates a segment level percentage score on a sample of the uploaded training data.
  • KantanWatch™ – makes monitoring the performance of KantanMT engines more transparent.
  • TotalRecall™ – combines TM and MT technology, TM matches with a ‘fuzzy match’ score of less than 85% are automatically put through the customized MT engine, giving the users the benefits of both technologies.
  • KantanISR™ Instant Segment Retraining technology that allows members near instantaneous correction and retraining of their KantanMT engines.
  • PEX Rule Editor – an advanced pattern matching technology that allows members to correct repetitive errors, making a smoother post-editing process by reducing post-editing effort, cost and times.
  • Kantan API – critical for the development of software connectors and smooth integration of KantanMT into existing translation workflows. The success of the MemoQ connector, led to the development of subsequent connectors for MemSource and XTM.

KantanMT sourced and cleaned a range of bi-directional domain specific stock engines that consist of approx. six million words across legal, medical and financial domains and made them available to its members. KantanMT also developed support for Traditional and Simplified Chinese, Japanese, Thai and Croatian Languages during 2013.

Recognition as Business Innovators…

KantanMT received awards for business innovation and entrepreneurship throughout the year. Founder and Chief Architect, Tony O’Dowd was presented with the ICT Commercialization award in September.

In October, KantanMT was shortlisted for the PITCH start-up competition and participated in the ALPHA Program for start-ups at Dublin’s Web Summit, the largest tech conference in Europe. Earlier in the year KantanMT was also shortlisted for the Vodafone Start-up of the Year awards.

KantanMT were silver sponsors at the annual 2013 ASLIB Conference ‘Adopting the theme Translating and the Computer’ that took place in London, in November, and in October, Tony O’Dowd, presented at the TAUS Machine Translation Showcase at Localization World in Silicon Valley.

KantanMT have recently published a white paper introducing its cornerstone Quality Estimation technology, KantanAnalytics, and how this technology provides solutions to the biggest industry challenges facing widespread adoption of Machine Translation.

KantanAnalytics WhitePaper December 2013

For more information on how to introduce Machine Translation into your translation workflow contact Niamh Lacy (niamhl@kantanmt.com).

Overcome Challenges of building High Quality MT Engines with Sparse Data

KantanMT Whitepaper Improving your MT

Many of us, involved with Machine Translation are familiar with the importance of using high quality parallel data to build and customize good quality MT engines. Building high quality MT engines with sparse data is a challenge faced not only by Language Service Providers (LSPs), but any company with limited bilingual resources. A more economical alternative to creating large quantities of high quality bilingual data can be found by adding monolingual data in the target language to an MT engine.

Statistical Machine Translation systems use algorithms to find the most probable translations, based on how often patterns occur in the training data, so it makes sense to use large volumes of bilingual training data. The best data to use for training MT engines is usually high quality bilingual data and glossaries, so it’s great if you have access to these language assets.

But what happens when access to high quality parallel data is limited?

Bilingual data is costly and time-consuming to produce in large volumes, so the smart option is to come up with more economical language assets, and monolingual data is one of those economical assets. MT output fluency improves dramatically, by using monolingual data to train an engine, especially in cases where good quality bilingual data is a sparse language resource.

More economical…

Many companies lack the necessary resources to develop their own high quality in domain parallel data. But, monolingual data – is readily available in large volumes across different domains. This target language content can be found anywhere; websites, blogs, customers and even company specific documents created for internal use.

Companies with sparse parallel data can really leverage their available language assets with monolingual data to produce better quality engines, producing more fluent output. Even those with access to large volumes of bilingual data can still take advantage of using monolingual data to improve target language fluency.

Target language monolingual data is introduced during the engine training process so the engine learns how to generate fluent output. The positive effects of including monolingual data in the training process have been proven both academically and commercially.  In a study for TAUS, Natalia Korchagina confirmed that using monolingual data when training SMT engines considerably improved the BLEU score for a Russian-French translation system.

Natalia’s study not only “proved the rule” that in domain monolingual data improves engine quality, she also identified that out of domain monolingual data also improves quality, but to a lesser extent.

Monolingual data can be particularly useful for improving scores in morphologically rich languages like; Czech, Finnish, German and Slovak, as these languages are often syntactically more complicated for Machine Translation.

Success with Monolingual Data…

KantanMT has had considerable success with its clients using monolingual data to improve their engines quality. An engine trained with sparse bilingual data (the sparse bilingual data was still greater than the amount of data in Korchagina’s study) in the financial domain showed a significant improvement in the engine’s overall quality metrics when financial monolingual data was added to the engine:

  • BLEU score showed approx. 40% improvement
  • F-Measure score showed approx. 12% improvement
  • TER (Total Error Rate), where a lower score is better saw a reduction of approx. 50%

The support team at KantanMT showed the client how to use monolingual data to their advantage, getting the most out of their engine, and empowering the client to improve and control the accuracy and fluency of their engines.

How will this Benefit LSPs…

Online shopping by users of what can be considered ‘lower density languages’ or languages with limited bilingual resources is driving demand for multilingual website localization. Online shoppers prefer to make purchases in their own language, and more people are going online to shop as global internet capabilities improve. Companies with an online presence and limited language resources are turning to LSPs to produce this multilingual content.

Most LSPs with access to vast amounts of high quality parallel data can still take advantage of monolingual data to help improve target language fluency. But LSPs building and training MT engines for uncommon language pairs or any language pair with sparse bilingual data will benefit the most by using monolingual data.

To learn more about leveraging monolingual data to train your KantanMT engine; send the KantanMT Team an email and we can talk you through the process (info@kantanmt.com), alternatively, check out our whitepaper on improving MT engine quality available from our resources page.

 

 

#T9n and the Computer

The 35th ASLIB conference opens today, Thursday 28th November and runs for two days in Paddington, London. The annual ‘Translating and the Computer Conference’ serves to highlight the importance of technology within the translation industry and to showcase new technologies available to localization professionals.KantanMT

KantanMT was keen to have a look at how technology has shaped the translation industry throughout history so we took a look at some of the translation technology milestones over the last 50 years.

The computer has had a long history, so it’s no surprise that developments in computer technology greatly affect how we communicate. Machine Translation research dates back to the early 1940s, although its development was stalled because of negative feedback regarding the accuracy of early MT output. The ALPAC (Automatic Language Processing Advisory Committee) report published in 1966, prompted researchers to look for alternative methods to automate the translation process.

1970’s

In terms of modern development, the real evolution of ‘translation and the computer’ began in the 1970s, when more universities started carrying out research and development on automated translation. At this point, the European Coal and Steel Community in Luxemburg and the Federal Armed Forces Translation Agency in Mannheim, Germany were already making use of text related glossaries and automatic dictionaries. It was also around this time that translators started to come together to form translation companies/language service providers who not only translated, but also took on project management roles to control the entire translation process.

Developing CAT tools

1980’s

Translation technology research gained momentum during the early 1980s as commercial content production increased. Companies in Japan, Canada and Europe who were distributing multilingual content to their customers, now needed a more efficient translation process. At this time, translation technology companies began developing and launching Computer Assisted Translation (CAT) technology.

Innovation, KantanMT-IconDutch company, INK was one of the first to release desktop translation tools for translators. These tools originally called INK text tools, sparked more research into the area. Trados, a German translation company, started reselling INK text tools and this led to the research and development of the TED translation editor, an initial version of the translator’s workbench.

1990’s

The 1990s were an exciting time for the translation industry. Translation activities that were previously kept separate from computer software development were now being carried out together in what was termed localization. The interest in localizing for new markets led to translation companies and language service providers merging both technology and translation services, becoming Localization Service Providers.

Trados launched their CAT tools in 1990, with Multiterm, for terminology management and the Translation Memory (TM) software Translators Workbench in 1994. ATRIL, Madrid launched a TM system in 1993 and STAR (Software, Translation, Artwork, Recording) also released Transit, a TM system in 1994. The ‘fuzzy match’ feature was also developed at this time and quickly became a standard feature of TM.

Increasingly, translators started taking advantage of CAT tools to translate more productively. This lead to a downward pressure on price, making translation services more competitive.

The Future…

As we move forward, technology continues to influence translation. Global internet diffusion has increased the level of global communication and has changed how we communicate. We can now communicate in real-time, on any device and through any medium. Technology will continue to develop, and become faster and more adaptive to multi-language users, and demand for real-time translation will drive the further developments in the areas of automated translation solutions.

Find out more about KantanMT’s Quality Estimation Technology, KantanAnalytics.

MT Lingo

293-blueman-thinking-designMachine Translation TerminologyMT technology can be overwhelming for those new to the industry, and getting to grips with the jargon can be a daunting task even for some of the most and industry savvy gurus. KantanMT put together a list of some acronyms, popular buzzwords and numeronyms, which are abbreviations that use numbers, so that you can keep up with the MT professionals or just brush up on your tech vocabulary.

Numeronyms:

  • L10n – Localization/Localisation the process of adapting and translating a product or service so that it is culturally acceptable for a specific country or region.
  • I18n – Internationalization/Internationalisation is a process implemented in the planning stages of a product or application, it ensures the infrastructure (coding) suits future translations or localizations. Some of the more common Internationalization preparation for software products involves supporting international character sets like Unicode, or ensuring there is enough space in the User Interface (UI) for text to be translated from languages like English with single-byte character codes to the multiple-byte character codes used in Chinese and Japanese Kanji.
  • G11n – Globalization/Globalisation refer to the internationalization and localization preparations for products and services to be released in global markets. It usually incorporates ‘sim-ship’ or simultaneous shipment to different regions.

Acronyms:

  • MT – Machine Translation or Automated Translation is a translation carried out by computer. A piece of natural language text like English is translated by computer software into another language like French.  Cloud MT is Machine Translation based on the cloud. There are different types of MT systems available.
  • RBMT – Rule-Based Machine Translation system that uses a list of syntactic, grammatical and translation rules to generate the most appropriate translations.
  • SMT – Statistical Machine Translation systems are data driven and have a statistical modelling architecture using algorithms to find the most probable match between source and target segments.
  • API – Application Programming Interface is an interface that allows communication and interoperability between two applications or software programs.
  • LSP – Language Service Provider sometimes referred to as a Localization Service Provider, is a service provider that carries out the translation and localization of different types of content for specific countries or locales.
  • TM – Translation Memory is a database of aligned source and target translations called segments. Segments can be words, sentences or paragraphs. TMs can be integrated with CAT tools and they help speed up the translation process. TM files can be used as training data to train SMT engines.
  • SPE – Statistical Post-editing is when Machine Translation output that has been post-edited is re-used as training data and fed back into the SMT engine or used to train a new engine.

Popular Buzzwords:

  • Normalization is the checking and cleaning up a Translation Memory so it can be included as training data for a SMT engine. Things to identify and correct are tags, mistranslations, sentence mismatches and stylistic features like upper and lower case inconsistencies.
  • CAT tools – Computer-Aided Translation tools/ Computer-assisted Translation tools are used by humans to support the translation process by managing MT, TM and glossaries.
  • Glossaries are vocabulary lists of specialised terminology, usually specific to an industry or organisation. These files can be uploaded as additional training data to an SMT engine.
  • Bilingual corpus/ Bi-text database is a large text document with source and target languages. If the corpus is aligned it can be used as training data for an SMT engine.

If you know any new terms or interesting words you heard from your experience in the language and localization industry, KantanMT would love to hear about them, just pop them into the comment box below.

Pricing PEMT 2

KantanMT blog, Pricing PEMT

Segment-by-segment Machine Translation Quality Estimation (QE) scores are reforming current Language Service Provider (LSP) business models.

Pricing Machine Translation is one of the most widely debated topics within the translation and localization industries. Many agree that there is no ‘black and white’ approach, because a number of variables must always be taken into consideration when costing a project. Industry experts are in agreement that levels of post-editing effort and payment should be calculated through a fair and easily replicated formula. This transparency is the goal KantanMT had in mind during the development of KantanAnalytics™, a “game-changing” technology in the localization industry.

New Business Model

The two greatest challenges facing Localization Project Managers are; how to cost and schedule Machine Translation projects. Experienced PM’s can quickly gauge how long a project will take to complete, but there is still an element of guesswork and contingency planning involved. This is intensified when you add Machine Translation. Although, not a new technology, the practical application in a business environment is still in infancy stages.

Powerful Machine Translation engines can be easily integrated into an LSP workflow. Measuring Machine Translation quality on a segment-by-segment basis and calculating post-editing effort on those segments allows LSPs to create more streamlined business models.

Studies have shown post-editing Machine Translation can be more productive than translating a document from scratch. This is especially true when translators or post-editors have a broad technical or subject knowledge of the text’s domain. In these cases they can capitalise on their knowledge with higher post-editing productivity.

So, how should a Machine Translation pricing model look?

The development of a technology that can evaluate a translation on a segment-by-segment basis and assign an accurate QE score to a Machine Translated text is critical for the successful integration of this technology into a project’s workflow.

The segment-by-segment breakdown and ‘fuzzy match’ percentage scoring system ensured the commercialisation of Translation Memories into LSP workflows. This system has been adopted as an industry standard for pricing translation jobs where translation memories or Computer Aided Translation (CAT) tools can be implemented. The next natural evolution, is to create a similar tiered ‘fuzzy’ matching system for Machine Translation.

Segment level QE technology is now available where Machine Translated segments are assigned percentage match values, similar to translation memory match values. Post-editing costs, similar to the costing of translation memory matches can be assigned. The match value also gives a clear indication of how long a project should take to post-edit based on the quality of the match and the post-editors skills and experience.

How can we trust the quality score?

The Machine Translation engine’s quality is based on the quality of the training data used to build the engine. The engines quality can be monitored with BLEU scores, F-measure and TER scoring. These automatic evaluation metrics indicate the engines quality, and combined with the ‘fuzzy’ match score, can be adjusted to get a more accurate picture of how post-editing effort is calculated and how projects should be priced. There are a number of variables that dictate how to create and implement a pricing model.

Variables to be considered when creating a pricing model

The challenge in measuring PEMT stems from a number of variables, which need to be considered by PMs when creating a pricing model:

  • Intended purpose – does the text require; a light, fast or full post-edit
  • Language pair and direction – Roman languages tend to provide better MT output
  • Quality of the MT system – better quality, domain specific engines produce better results
  • Post-editing effort – degree of editing required – minor edits or full retranslate
  • Post-editor skill and experience – post-editors with extensive domain expertise

Traditional Models

To overcome these challenges PMs traditionally opted for hourly or daily rates. However, hourly rates do not provide enough transparency or cost breakdown and can make a project difficult to schedule. These rates must also be calculated to take into consideration the translator or post-editors productivity and language pair.

Rates are usually calculated based on the translator or post-editor’s average post-editing speed within the specified domain. Day rates can be a good cost indicator for PMs based on the post-editors capabilities and experience, but again the cost breakdown is not completely transparent. Difficulties usually occur when a post-editor comes across a part of the text that requires more time or effort to post-edit, then productivity automatically drops.

As an example of the differing opinions in the translation community, pricing PEMT is dependent on the post-editing circumstances. Some posters on the Proz.com forum suggest that PEMT is priced as 30-50% or similar to editing a human translation. Others suggest, the output quality of a Machine Translation system is priced around the same as a ‘fuzzy’ match of 50-74% from a translation memory. These are broad subjective figures which do not take variables into consideration.

Calculation of the Machine Translated text on a segment-by-segment basis allows PMs to calculate post-editing effort based on the quality of customised Machine Translation engines. PMs can then use these calculations to build an accurate pricing model for the project, which incorporates all relevant variables. It also makes it possible to distribute post-editing work evenly across translators and post-editors making the most efficient use of their skills. Benefits to calculating post-editing effort are also seen in scheduling and project turnaround times.


KantanAnalytics™ is a segment-by-segment quality estimation scoring technology, which when applied to a Machine Translated text will generate a quality score for each segment, similar to the fuzzy match scoring system used in translation memories.

Sign up for a free trail to experience KantanAnalytics until November 30th 2013 KantanAnalytics will be available on the Enterprise Plan to sign up or upgrade to this plan please email KantanMT’s Success Coach, Kevin McCoy (kevinmcc@kantanmt.com).