Giulia Mattoni, an Italian Translation Technology student from DCU talks about her experience using Machine Translation for evaluating player support content localization. Giulia’s fascinating view illustrates why this area needs further research, and how she used KantanMT to evaluate MT and post-editing for this type content. Continue reading
For our fourth post in the ‘5 Questions’ series, we are very excited to introduce you to Louise Faherty, Technical Project Manager of the Professional Services team at KantanMT. This series of interviews aim to give you a deeper insight into the people at KantanMT. Continue reading
KantanMT.com was used in the course ‘Machine Translation and Post-editing,’ which was taught for the first time in the ‘Degree in Modern Languages Applied to Translation’ in UAH. English and Spanish were the main languages used during this course.
The KantanPEX Rule Editor enables members of KantanMT reduce the amount of manual post-editing required for a particular translation by creating, testing and deploying post-editing automation rules on their Machine Translation engines (client profiles).
The editor allows users to evaluate the output of a PEX (Post-Editing Automation) rule on a sample of translated content without needing to upload it to a client profile and run translation jobs. Users can enter up to three pairs of search and replace rules, which will be run in descending order on your content.
How to use the KantanMT PEX Rule Editor
Login into your KantanMT account using your email and your password.
You will be directed to the ‘Client Profiles’ tab in the ‘My Client Profiles’ page. The last profile you were working on will be ‘Active’ and marked in bold.
To use the ‘PEX-Rule Editor’ with a profile other than the ‘Active’ profile, click on the new profile name to select that profile for use with the ‘Kantan PEX-Rule editor’.
Then click the ‘KantanMT’ tab and select ‘PEX Editor’ from the drop-down menu.
You will be directed to the ‘PEX Editor’ page.
Type the content you wish to test on, in the ‘Test Content’ box.
Type the content you wish to search for in the ‘PEX Search Rules’ box.
Type what you want the replacement to be in the ‘PEX Replacement Rules’ box and click on the ‘Test PEX Rules’ button to test the PEX-Rules.
The results of your PEX-Rules will now appear in the ‘Output’ box.
Give the rules you have created a name by typing in the ‘Rule Name’ box.
Select the profile you wish to apply this rule(s) to and then click on the ‘Upload Rule’ button.
KantanMT PEX editor helps reduce the amount of manual post-editing required for a particular translation, hence, reducing project turn-around times and costs. For additional information on PEX-RULES and the Kantan PEX-Rule editor please click on the links below. For more details about KantanMT localization products and ways of improving work productivity and efficiency please contact us at firstname.lastname@example.org.
It’s a fact, infiltrating new markets is the key to increasing profits, and the first item on any company’s internationalization checklist should be to make sure it communicates product information in a way its target customers can understand.
Leading on from the 2006 research, CSA’s updated survey in 2014 was based on a sample of three thousand global respondents, and it reinforced earlier results by showing that 55% only buy from websites in their native language. This jumped dramatically to 80% in cases where the buyers English language ability is limited.
When it comes to selling internationally, tapping into new revenue streams demands translated content. But, what happens when you have thousands of product descriptions that need to be localized into a plethora of languages?
This is where the fun begins for localization teams with well-established traditional translation workflows in place. Their existing method seems fine…but when it’s time to scale up, this is when cracks in the process begin to appear.
The translation workflow works best when it matches the scale and velocity for the content created whether it is product descriptions, manuals or online help documentation.
The challenging part –
How to translate product descriptions with velocity and to scale?
We have heard a great deal of arguments for and against machine translation and one of the most well known against arguments is “the quality is rubbish, sentences translated by machine translation are garbled and incomprehensible”. We in the language technology field hear this frequently and often shudder in disbelief at how these conclusions have been reached.
Generic or free machine translation systems in most cases do not produce great results, expecting such a system to produce publishable quality MT results or using it as benchmark for all MT systems is akin to extracting blood from a stone. Achieving good MT output takes time, care and the ability to customise the MT system properly.
Any company that is serious about breaking into international markets should also be serious about their MT strategy. They should be considering a customised MT solution that is tailored to their needs, not just by going for a cheap and/or supposedly free option.
Why is MT customisation so important?
Statistical machine translation is based on machine learning and pattern recognition. Segments with multiple word phrases or n-grams as they are known are identified with probability algorithms that select the most probable translation match. Generic or free MT systems typically have been built on a broad mix of content styles and types. This means it’s much harder for the MT system to identify the most likely or even relevant matches in generically built engines.
When the MT system is customised specifically for content that comes from a single domain, such as product descriptions for a specific categories e.g. Home and garden, fashion or electronic devices, the syntax, style and phraseology used will make sure that when an MT match is generated there will be a higher probability that the match will be closer to the desired output, resulting in a much more accurate translation.
How important is saving costs?
Of Course Machine Translation can save costs – if done properly, significant savings can be made. But, saving costs is often not the end goal for implementing a serious MT strategy. The real gains come from increasing productivity without a compromise in quality. Why translate 2000 words a day when you can machine translate and post-edit 8000 words with no loss of quality? Really it can be done! See an example first hand (Netthandelen’s case study PDF download).
When it comes to eCommerce and selling hundreds of products online the words to be translated are counted in billions not thousands, and without MT, traditional localization budgets would become more and more expensive, so MT is really the only practical solution. But, if MT is considered a way to save money by cutting corners then it is doomed to fail from the outset.
It will fail because it’s not sustainable, the effort and costs required to fix bad quality MT output are too great, and if fixing is neglected by publishing the content as is, it will result in angry customers who shop elsewhere – and they will, as the choice available now is greater than ever before!
- Generic free MT will not generate the same quality as customised MT
- Investing in a robust MT strategy will save time, costs and headaches in the long run
- Keep focus on communicating with the customer, in their language and your eCommerce business will thrive
Email email@example.com if you have questions or want to learn more about how Machine Translation works for product descriptions.
We have entered a new age, and a new technology has come into play: Machine Translation (MT). It’s globally accepted that MT systems dramatically increase productivity but it’s a hard struggle to integrate this technology into your production process. Apart from handling the engine building and optimizing procedures, you have to transform your traditional workflow:
The traditional roles of the linguists (translators, editors, reviewers etc.) are reconstructed and converged to find a suitable place in this new, innovative workflow. The emerging role is called ‘post-edit’ and the linguists assigned to this role are called ‘post-editors’. You may want to recruit some willing linguists for this role, or persuade your staff to adopt a different point of view. But whatever the case may be, some training sessions are a must.
What are covered in training sessions?
1. Basic concepts of MT systems
Post-editors should have a notion of the dynamics of MT systems. It is important to focus on the system that is utilized (RBMT/SMT/Hybrid). For widely used SMT systems, it’s necessary for them to know:
- how the systems behave
- the functions of the Translation Model and Language Model*
- input (given set of data) and output (raw MT output) relationship
- what changes in different domains
* It’s not a must to give detailed information about that topics but touching on the issue will make a difference in determining the level of technical backgrounds of candidates. Some of the candidates may be included in testing team.
2. The characteristics of raw MT output
Post-editors should know the factors affecting MT output. On the other hand, the difference between working on fuzzy TM systems and with SMT systems has to be mentioned during a proper training session. Let’s try to figure out what to be given:
- MT process is not the ‘T’ of the TEP workflow and raw MT output is not the target text expected to be output of ‘T’ process.
- In the earlier stages of SMT engines, the output quality varies depending on the project’s dynamics and errors are not identical. As the system improves quality level becomes more even and consistent within the same domain.
- There may be some word or phrase gaps in the systems’ pattern mappings. (Detecting these gaps is one of the main responsibilities of testing team but a successful post-editor must be informed about the possible gaps.)
3. Quality issues
This topic has two aspects: defining required target (end product) quality, and evaluation and estimation of output quality. The first one gives you the final destination and the second one makes you know where you are.
Required quality level is defined according to the project requirements but it mostly depends on target audience and intended usage of the target text. It seems similar to the procedure in TEP workflow. However, it’s slightly different; engine improvement plan should also be considered while defining the target quality level. Basically, this parameter is classified into two groups: publishable andunderstandable quality.
Evaluation and estimation aspect is a little bit more complicated. The most challenging factor is standardizing measurement metrics. Besides, the tools and systems used to evaluate and estimate the quality level have some more complex features. If you successfully establish your quality system, then adversities become easier to cope with.
It’s post-editors’duty to apprehend the dynamics of MT quality evaluation, and the distinction between MT and HT quality evaluation procedures. Thus, they are supposed to be aware of the expected error patterns. It will be more convenient to utilize the error categorization with your well-trained staff (QE staff and post-editors).
4. Post-editing Technique
The fourth and the last topic is the key to success. It covers appropriate method and principles, as well as the perspective post-editors usually acquire. Post-edit technique is formed using the materials prepared for the previous topics and the data obtained from the above mentioned procedures, and it is separately defined for almost every individual customized engines.
The core rule for this topic is that post-edit technique, as a concept, is likely to be definitely differentiated from traditional edit and/or review technique(s). Post-editors are likely to be capable of:
- reading and analyzing the source text, raw MT output and categorized and/or annotated errors as a whole.
- making changes where necessary.
- considering the post-edited data as a part of data set to be used in engine improvement, and performing his/her work accordingly.
- applying the rules defined for the quality expectation levels.
As briefly described in topic #3, the distance between the measured output quality and required target quality may be seen as the post-edit distance. It roughly defines the post-editor’s tolerance and the extent to which he/she will perform his work. Other criterion allowing us to define the technique and the performance is the target quality group. If the target text is expected to be of publishable quality then it’s called full post-edit and otherwise light post-edit. Light & full post-edit techniques can be briefly defined as above but the distinction is not always so clear. Besides, under/over edit concepts are likely to be included to above mentioned issues. You may want to include some more details about these concepts in the post-editor training sessions; enriching the training materials with some examples would be a great idea!
About Selçuk Özcan
Selçuk Özcan has more than 5 years’ experience in the language industry and is a co-founder of Transistent Language Automation Services. He holds degrees in Mechanical Engineering and Translation Studies and has a keen interest in linguistics, NLP, language automation procedures, agile management and technology integration. Selçuk is mainly responsible for building high quality production models including Quality Estimation and deploying the ‘train the trainers’ model. He also teaches Computer-aided Translation and Total Quality Management at the Istanbul Yeni Yuzyil University, Translation & Interpreting Department.
Read More about KantanMT’s Partnership with Transistent in the official News Release, or if you are interested in joining the KantanMT Partner Program, contact Louise (firstname.lastname@example.org) for more details on how to get involved.
Statistical Machine Translation (SMT) has many uses – from the translation of User Generated Content (UGC) to Technical Documents, to Manuals and Digital Content. While some use cases may only need a ‘gist’ translation without post-editing, others will need a light to full human post-edit, depending on the usage scenario and the funding available.
Post-editing is the process of ‘fixing’ Machine Translation output to bring it closer to a human translation standard. This, of course is a very different process than carrying out a full human translation from scratch and that’s why it’s important that you give full training for staff who will carry out this task.
Training will make sure that post-editors fully understand what is expected of them when asked to complete one of the many post-editing type tasks. Research (Vasconcellos – 1986a:145) suggests that post-editing is a honed skill which takes time to develop, so remember your translators may need some time to reach their greatest post-editing productivity levels. KantanMT works with many companies who are post-editing at a rate over 7,000 words per day, compared to an average of 2,000 per day for full human translation.
Types of Training: The Translation Automation User Society (TAUS) is now holding online training courses for post-editors.
Post-editing quality levels vary greatly and will depend largely by the client or end-user. It’s important to get an exact understanding of user expectations and manage these expectations throughout the project.
Typically, users of Machine Translation will ask for one of the following types of post-editing:
- Light post-editing
- Full post-editing
The following diagram gives a general outline of what is involved in both light and full post-editing. Remember however, the effort to meet certain levels of quality will be determined by the output quality your engine is able to produce
Generally, MT users would carry out productivity tests before they begin a project. This determines the effectiveness of MT for the language pair, in a particular domain and their post-editors ability to edit the output with a high level of productivity. Productivity tests will help you determine the potential Return on Investment of MT and the turnaround time for projects. It is also a good idea to carry out productivity tests periodically to understand how your MT engine is developing and improving. (Source: TAUS)
You might also develop a tailored approach to suit your company’s needs, however the above diagram offers some nice guidelines to start with. Please note that a well-trained MT engine can produce near human translations and a light touch up might be all that is required. It’s important to examine the quality of the output with post-editors before setting productivity goals and post-editing quality levels.
In recent years, post-editing skills have become much more of an asset and sometimes a requirement for translators working in the language industry. Machine Translation has grown considerably in popularity and the demand for post-editing services has grown in line with this. TechNavio predicted that the market for Machine Translation will grow at a compound annual growth rate (CAGR) of 18.05% until 2016, and the report attributes a large part of this rise to “the rapidly increasing content volume”.
While the task of post-editing is markedly different to human translation, the skill set needed is almost on par.
According to Johnson and Whitelock (1987), post-editors should be:
- Expert in the subject area, the text type and the contrastive language.
- Have a perfect command of the target language
Is it also widely accepted that post-editors who have a favourable perception of Machine Translation perform better at post-editing tasks than those who do not look favourably on MT.
How to improve Machine Translation output quality
Pre-editing is the process of adjusting text before it has been Machine Translated. This includes fixing spelling errors, formatting the document correctly and tagging text elements that must not be translated. Using a pre-processing tool like KantanMT’s GENTRY can save a lot of time by automating the correction of repetitive errors throughout the source text.
More pre-editing Steps:
Writing Clear and Concise Sentences: Shorter unambiguous segments (sentences) are processed much more effectively by MT engines. Also, when pre-editing or writing for MT, make sure that each sentence is grammatically complete (begins with a capital letter, has at least one main clause, and has an ending punctuation).
Using the Active Voice: MT engines work impressively on text that is clear and unambiguous, that’s why using the active voice, which cuts out vagueness and ambiguity can result in much better MT output.
There are many pre-editing steps you can carry out to produce better MT output. Also, keep in mind writing styles when developing content for Machine Translation to cut the amount of pre-editing required. Get tips on writing for MT here.
For more information about any of KantanMT’s post-editing automation tools, please contact: Gina Lawlor, Customer Relationship Manager (email@example.com).
Many of us, involved with Machine Translation are familiar with the importance of using high quality parallel data to build and customize good quality MT engines. Building high quality MT engines with sparse data is a challenge faced not only by Language Service Providers (LSPs), but any company with limited bilingual resources. A more economical alternative to creating large quantities of high quality bilingual data can be found by adding monolingual data in the target language to an MT engine.
Statistical Machine Translation systems use algorithms to find the most probable translations, based on how often patterns occur in the training data, so it makes sense to use large volumes of bilingual training data. The best data to use for training MT engines is usually high quality bilingual data and glossaries, so it’s great if you have access to these language assets.
But what happens when access to high quality parallel data is limited?
Bilingual data is costly and time-consuming to produce in large volumes, so the smart option is to come up with more economical language assets, and monolingual data is one of those economical assets. MT output fluency improves dramatically, by using monolingual data to train an engine, especially in cases where good quality bilingual data is a sparse language resource.
Many companies lack the necessary resources to develop their own high quality in domain parallel data. But, monolingual data – is readily available in large volumes across different domains. This target language content can be found anywhere; websites, blogs, customers and even company specific documents created for internal use.
Companies with sparse parallel data can really leverage their available language assets with monolingual data to produce better quality engines, producing more fluent output. Even those with access to large volumes of bilingual data can still take advantage of using monolingual data to improve target language fluency.
Target language monolingual data is introduced during the engine training process so the engine learns how to generate fluent output. The positive effects of including monolingual data in the training process have been proven both academically and commercially. In a study for TAUS, Natalia Korchagina confirmed that using monolingual data when training SMT engines considerably improved the BLEU score for a Russian-French translation system.
Natalia’s study not only “proved the rule” that in domain monolingual data improves engine quality, she also identified that out of domain monolingual data also improves quality, but to a lesser extent.
Monolingual data can be particularly useful for improving scores in morphologically rich languages like; Czech, Finnish, German and Slovak, as these languages are often syntactically more complicated for Machine Translation.
Success with Monolingual Data…
KantanMT has had considerable success with its clients using monolingual data to improve their engines quality. An engine trained with sparse bilingual data (the sparse bilingual data was still greater than the amount of data in Korchagina’s study) in the financial domain showed a significant improvement in the engine’s overall quality metrics when financial monolingual data was added to the engine:
- BLEU score showed approx. 40% improvement
- F-Measure score showed approx. 12% improvement
- TER (Total Error Rate), where a lower score is better saw a reduction of approx. 50%
The support team at KantanMT showed the client how to use monolingual data to their advantage, getting the most out of their engine, and empowering the client to improve and control the accuracy and fluency of their engines.
How will this Benefit LSPs…
Online shopping by users of what can be considered ‘lower density languages’ or languages with limited bilingual resources is driving demand for multilingual website localization. Online shoppers prefer to make purchases in their own language, and more people are going online to shop as global internet capabilities improve. Companies with an online presence and limited language resources are turning to LSPs to produce this multilingual content.
Most LSPs with access to vast amounts of high quality parallel data can still take advantage of monolingual data to help improve target language fluency. But LSPs building and training MT engines for uncommon language pairs or any language pair with sparse bilingual data will benefit the most by using monolingual data.
To learn more about leveraging monolingual data to train your KantanMT engine; send the KantanMT Team an email and we can talk you through the process (firstname.lastname@example.org), alternatively, check out our whitepaper on improving MT engine quality available from our resources page.
KantanMT recently announced the forthcoming release of KantanAnalytics™, a tool that provides segment level quality analysis for Machine Translation output. KantanMT has developed this new technology in partnership with the CNGL Centre for Global Intelligent Content, which is also based at Dublin City University.
KantanAnalytics measures the quality of the translations generated by KantanMT engines. The measurement provides a quality score for each segment translated through a KantanMT engine. This means that Language Service Providers (LSPs)will be able to:
- accurately identify segments that require the most post-editing effort
- accurately identify segments that match the client’s quality standards
- better predict project completion times
- offer more accurate pricing to their clients and set a price during the early stages of the project
- build secure commercial Machine Translation frameworks
KantanAnalytics is being rolled out to a sample of KantanMT members this month, July 2013. It will be made available to all members of the KantanMT platform in September 2013.
The CNGL Centre for Global Intelligent Content
CNGL was established in 2007 as a collaborative academia-industry research centre aiming to break new ground in digital intelligent content and to “revolutionise the global content value chain for enterprises, communities, and individuals” (CNGL, 2013).
CNGL says that it intends to “pioneer development of advanced content processing technologies for content creation, multilingual discovery, translation and localization, personalisation, and multimodal interaction across global markets”. Its adds that “these technologies will revolutionise the integration and unification of multilingual, multi-modal and multimedia content and interactions, and drive innovation across the global content value chain” (CNGL, 2013)
The body has received over €43 million euro in funding from Science Foundation Ireland (SFI) and key industry partners. Research for the KantanMT Analytics project was co-funded by the SFI in association with Enterprise Ireland.
CNGL has researchers at Trinity College Dublin, University College Dublin, University of Limerick, and Dublin City University. These researchers produce the aforementioned technologies in association with industry partners. Aside from KantanMT, CNGL has also entered partnerships with Microsoft, Intel, and Symantec to name but a few.
KantanAnalytics is the latest milestone in the partnership between KantanMT and CNGL and it will help to redefine current Machine Translation business models.
Please feel free to comment on this post or any ones previous-we’d love to hear from you!!
If you would like to find out more about KantanMT and KantanMT Analytics, visit KantanMT.com.
Featured Image Source: http://lovebeingretired.com/2011/04/21/a-lifetime-in-review/magnifying-glass-2/