Interview: Working on KantanMT – a Developers Perspective

Eduardo shanahan
Eduardo Shanahan, CNGL

Eduardo Shanahan, a Senior Software Engineer at CNGL spent time working on KantanMT during its early days. KantanMT asked Eduardo to talk about what it was like to work with Founder and Chief Architect, Tony O’Dowd and the rest of the team developing the KantanMT product.

What was your initial impression, when you joined DLab in DCU?

This past year was a different kind adventure. After more than two decades working with Microsoft products like Visual Studio, so it was a big change, moving to Dublin City University (DCU) to be part of the Design and Innovation Lab, or DLab as we call it. The work in DLab consists of transforming code written by researchers into industrial quality products.

One of the first changes was to get a Mac and start deploying code in Linux, with no Visual Studio or even Mono. Instead I worked mostly with Python and NodeJS, and piles of shell scripts. Linux and Python, were not new to me but they did take some adjusting to using them.

This was a completely new environment and a new experience, and I was working in a whole new area. Back then, my relationship with Artificial Intelligence (AI) was informal to say the least, and I wasn’t even aware that something like Statistical Machine Translation (SMT) existed.

How did you get involved with working on KantanMT?

Starting out, I was working on a variety of different projects simultaneously.  A few months into it though, I started working full time with a couple of researchers creating new functionality for Tony and his KantanMT product, which is based on open source Moses technology. Moses technology uses aligned target and source texts of parallel corpora to train a SMT translation system. Once the system is trained, search algorithms are applied to find the most suitable translation matches. This translation model can be applied to any language pair.

What were your goals working on the KantanMT project?

Tony is doing a great job, deploying it on Amazon Web Services and creating a set of tools to streamline the operations for end users. His request to CNGL, was to provide more advanced insight into the translation quality produced by Moses.

To accomplish this, the task was mapped to two successive projects with different researchers on each project. The pace was very intense, we wanted state of the art results that showed up in the applications. Sandipan Dandapat, Assistant Professor in the Department of Computer Science and Engineering, IIT Guwahati and Aswarth Dara, Research Assistant at CNGL, DCU worked on adding real value to the KantanMT product during those long weeks, while I was rewriting their code time after time until it passed all the tests and then some. Our hard work paid off when KantanWatch™ and KantanAalytics™ were born.

Each attempt to deliver was an experience in itself, Tony was quick to detect any inconsistencies and wanted to be extra sure about understanding all the details and steps on the research and implementation.

In your opinion was the work a success?

The end result, is something that has made me proud. The mix between being a scientist and having a real product to implement is a very good combination. The guys at DCU have done a great job on the product base and DLab is a fantastic research and work environment.  The no nonsense attitude from Tony’s side created a very interesting situation and It’s something that we can really celebrate after a year of hard work.

The CNGL Centre for Global Intelligent Content

The CNGL Centre for Global Intelligent Content (Dublin City University, Ireland) is supported by the Science Foundation Ireland. During its academic-industry collaborative research it has not only driven standards in content and localization service integration, but it is also pioneering advancements in Machine Translation through the development of disruptive and cutting edge processing technologies. These technologies are revolutionising global content value chains across a number of different industries.

The CNGL research centre draws its talent and expertise from a combined 150 researchers from Trinity College Dublin, Dublin City University, University College Dublin and University of Limerick. The centre also works closely with industry partners to produce disruptive technologies that will have a positive impact both socially and economically.

KantanMT allows users to build a customised translation engine with training data that will be specific to their needs. KantanMT are continuing to offer a 14 day free trial to new members.

KantanMT and MemSource Cloud Connector

Tony O' Dowd KantanMT’s Founder and Chief Architect
Tony O’ Dowd

Cloud technology and web-based applications have made a significant impact on the localization industry, levelling the playing field between large and smaller Language service providers (LSPs). LSPs who leverage cloud technology can be more competitive. The ‘content explosion’ has also driven the need for on-demand translation services, and taking advantage of cloud technology is the most strategic option for translating large volumes of content securely and in real-time.

David Canek, CEO of MemSource Technologies
David Canek

Software integration plays an important role in achieving a centralised localization management structure. The MemSource Cloud connector developed to integrate with KantanMT will ensure greater control and productivity in localization and translation workflows.

To acknowledge the connector’s release, I caught up with, David Canek, CEO of MemSource Technologies and Tony O’Dowd, KantanMT’s Founder and Chief Architect to get their thoughts on the impact of cloud software integration in the localization industry.

MemSource recently developed a connector to integrate KantanMT and MemSource Cloud, can you explain how the connector works and what this will mean for its users?

[David] Yes, we have developed a connector that lets all of our 10 thousand users very easily select KantanMT as their preferred MT engine for their MemSource Cloud translation projects. The connector is part of our 3.8 release and available as from 3 November 2013. The KantanMT integration supports all of our Machine Translation features, including our post-editing features, specifically the post-editing analysis.

[Tony] The team at MemSource have developed a straightforward mechanism to integrate Machine Translation services into their cloud platform. The MemSource community of LSPs and professional translators can easily select KantanMT as their preferred Machine Translation engine. Integration between both platforms using this new KantanMT connector will boost translation productivity, reduce project costs and improve project margins for the MemSource community.

This partnership is a great example of synergy between two related businesses within the translation industry. How do you think integration will create value for clients and the industry?

[David] Machine Translation has become an integral part of the human translation process and so we found it a logical step to integrate an innovative player in the Machine Translation scene, such as KantanMT.

[Tony] KantanMT combines the speed and accuracy of traditional Translation Memory with the speed and cost-advantages of Machine Translation into a single seamless platform. The current economic climate indicates the localization industry can be certain of only two things – that margin erosion and price compression will continue to put pressure on LSPs to operate with higher levels of efficiency while lowering overall costs.

MemSource and KantanMT customers will benefit from achieving economies of scale when they integrate Machine Translation directly into their existing translation workflows. KantanMT scales effortlessly with business demands and growth, and KantanMT members will benefit from increased profitability as greater volumes of client data are processed.  This helps LSP’s achieve higher levels of operational efficiency while also delivering cost savings to their customers.

There is a lot of buzz around “moving to the cloud” in the tech world, particularly for translation and localization services. As a supplier of both cloud and server translation technology, have you noticed any preference for one over the other, which do your clients prefer and why?

[David] Our clients just like us prefer the cloud version of any technology, including MemSource technology. Therefore, we really focus on providing MemSource Cloud and it is only a question of time when we discontinue offering MemSource in the server option.

[Tony] Progressive companies cannot ignore the financial and operational efficiencies the cloud delivers. The cloud helps organisations achieve economies of scale through reduced capital costs, which are often associated with the investment and maintenance of a technology infrastructure. Combine this with new pricing models like lower monthly subscription fees, which are replacing large upfront software license fees, operating on the cloud ensures a competitive business. This is even more so in the localization industry where the translation of ‘big data’ from the content explosion has increased the need for on-demand localization and translation services. The cloud’s multi-tenant architecture offers LSPs a flexible solution for efficiently managing large volumes of data.

In your opinion, what will the integration of these technologies mean for the future translation industry in the short and longer terms?

[David] Machine Translation has become mainstream technology and will soon have the same importance as Translation Memory in the localization industry. We have believed in this vision right from the start of developing MemSource. This is why we have pioneered the post-editing analysis and other features in MemSource that bring Machine Translation to the forefront and seamlessly integrate it with existing technologies such as Translation Memory.

[Tony] In the short-term, the technology with the greatest impact in the translation industry will be the availability of high speed, on-demand Machine Translation services. It will be used as a tool to boost translator productivity, reduce project costs and improve margins. In using the KantanMT connector, LSPs can integrate Machine Translation into their translation workflows quickly and easily, immediately offering improved services to their clients.

Over the longer-term, like MemSource, KantanMT believes there will be a continuous push to blend Machine Translation and traditional Translation Memory systems into one seamless service. At KantanMT, we’ve made significant progress on this vision by fusing traditional Translation Memory with advanced Machine Translation into the KantanMT platform, and also through the recent development of predictive segment quality estimation technology called KantanAnalytics™.


Thank you, to both David and Tony who gave up time from their busy schedules to be interviewed.


There are still a couple of weeks left to take advantage of the KantanAnalytics™ feature. KantanAnalytics™ is available for ALL KantanMT members until 30th November. When the offer ends it will become an Enterprise Plan only feature.

For more information about the KantanMT Enterprise plan please contact Aidan (aidanc@kantanmt.com)

MT Quality Estimation – KantanAnalytics™

kantanmt, KantanAnalytics

The newest addition to the KantanMT technology portfolio is KantanAnalytics™.  KantanAnalytics, which has been co-developed with the CNGL Centre for Global Intelligent Content (Dublin City University, Ireland), assigns a quality estimation score for each automated translation generated by a KantanMT engine. Expressed as a percentage – this predicts the score a human translator would likely assign as to the utility of the translation. KantanAnalytics help Project Managers predict the cost and schedule of Machine Translation projects and creates new business model opportunities for the localization industry.

The commercialisation of Translation Memory technology in the early 1990’s revolutionised the localization industry and led to increased productivity and translation performance. It also provided a new pricing model for the industry – one based on the type of translation memory match (referred to as a ‘fuzzy-match’). This pricing structure, which was tied to the fuzzy-match score, became an industry standard and an invaluable tool Project Managers could use for providing an accurate cost analysis on translation projects. It was also used to predict the time to complete a project.

The use of KantanAnalytics technology means Project Managers can apply a similar pricing structure when calculating the cost of Machine Translation or Post-Edited Machine Translation (PEMT) projects. Currently, Project Managers and translators use fixed charges, such as calculating hourly rates or a fixed number of words for Machine Translation and PEMT. This method lacks precision and transparency and is not a sufficient cost calculation method to drive the wide scale adoption of Machine Translation.

What this means for KantanMT Members

KantanMT Enterprise Members can use a two-pronged approach to measure Machine Translation quality. Using KantanWatch, BLEU, TER and F-measure scores can show the engine’s overall quality level during the training or development stage, then KantanAnalytics is used to analyse the quality of each segment generated by a KantanMT engine.

By using the KantanAnalytics reports, akin to a ‘fuzzy-match’ report, Project Managers can then determine the number of segments, the quality of each segment and estimate how long a project will take to complete and what the cost should be.

This quality estimation score is expressed as a percentage – the higher the score, the better the quality and consequently the less effort required to post-edit it.

KantanAnalytics can be quickly deployed by Project Managers and Enterprise Members can implement a tiered pricing model on Machine Translation jobs similar to Translation Memory jobs. This is an excellent fit within existing business models, fusing two important industry technologies Machine Translation and Translation Memory.

KantanAnalytics creates the framework for a more accurate, more efficient cost management and deployment of Machine Translation throughout the localization industry.

KantanAnalytics User Interface (UI)

KantanAnalytics report
KantanAnalytics Report

Here is a quick look at the new KantanAnalytics interface. The KantanAnalytics report can be viewed in the Project Dashboard on KantanMT or downloaded as a Microsoft Excel file. The report is generated by clicking the graph icon located in the job status column.

The report results are shown when the report is expanded. To expand the report click on ‘summary’ or ‘file name’. The results are represented in three graphs at the top of the report. In the screen shot below, Total Recall technology shows that 76% of the file for translation generated matches 85% or higher. The second graph, shows that 24% of the document had matches less than 85%. The third graph then shows the quality estimation scores in 10% increments. This data is also listed below the graphs in numerical form.

KantanAnalytics dashboard
KantanAnalytics Report
KantanMT Analytics will be available to Enterprise Members of the KantanMT.com platform from 30th October. To sign up for the Enterprise Plan or to upgrade to this plan please email KantanMT’s Success Coach, Kevin McCoy (kevinmcc@kantanmt.com).

A Truly Global Internet

multilingual, KantanMT, Localisation Industry

The internet became truly multilingual yesterday, as the Internet Corporation for Assigned Names and Numbers (ICANN), announced the release of four new generic top-level domains (gTLDs). gTLDs are internet domain names with language-specific scripts and the four new suffixes represent some of the world’s most widely spoken languages. Their selection for release by the ICANN was a strategic decision.

After Latin script, Chinese is the second most widely used alphabet with approx. 1340 million users, Arabic holds the number three position with 380 million users, and Cyrillic is number five used by approx. 250 million people. The four domain names released yesterday are:

  1. 游戏 (game) – Chinese

  2.   شبكة (web) – Arabic

  3. Онлайн (online) – Cyrillic

  4. Сайт (site) – Cyrillic

The president of ICANN’s Generic Domains Division, Akram Atallah indicated this was just the start of a, “global society” coming together. The purpose of The New Generic Top Level Domain Program is to create a, “globally-inclusive Internet”, improving ecommerce and internet globalisation.

Ripples will be felt in the localization industry with increased demand for real-time translation of user generated content (UGC). Translation technologies are constantly being developed, adopted to markets and fine-tuned. A leading example of this in the development of Machine Translation and these improvements are best seen in the quality assessment (QA) of Machine Translation.

Machine Translation quality has been subjected to scrutiny for decades. This is also changing. Commercial use of Machine Translation is growing, especially in certain industries. Computational capabilities and the availability of vast amounts of multi and monolingual training data have played a significant role in the adoption rate of Machine Translation in both the public and private sectors.

Next week, KantanMT, will release a technology, which addresses the challenge of Machine Translation quality estimation (QE). KantanAnalytics is a revolutionary product that carries out quality analysis at segment level.

Increased demand for real-time high quality translated content will be seen in the near future as internationalised domain names (IDNs) bring people and communities together. This is one of the first steps in increasing the current number of 22 English language dominated domain names to a further 1,400 new multilingual names.

IDNs are domain names registered in non-Latin scripts or ASCII characters, like Chinese. IDNs are already available as second-level domains and country code top-level domains (ccTLDs) tied to specific countries. For example, In Ireland a ccTLD will end in “.ie”. These are different from gTLDs, which belong to a core group of restricted domain names such as .com, .net and .org.

Watch out for the KantanAnalytics release next week. KantanMT are continuing to offer a 14 day free trial to new members. click here>>

Quality Estimation Tool for MT: KantanAnalytics™

KantanMT AnalyticsKantanMT recently announced the forthcoming release of KantanAnalytics™, a tool that provides segment level quality analysis for Machine Translation output. KantanMT has developed this new technology in partnership with the CNGL Centre for Global Intelligent Content, which is also based at Dublin City University.

KantanAnalytics
KantanAnalytics measures the quality of the translations generated by KantanMT engines. The measurement provides a quality score for each segment translated through a KantanMT engine. This means that Language Service Providers (LSPs)will be able to:

  • accurately identify segments that require the most post-editing effort
  • accurately identify segments that match the client’s quality standards
  • better predict project completion times
  • offer more accurate pricing to their clients and set a price during the early stages of the project
  • build secure commercial Machine Translation frameworks

KantanAnalytics is being rolled out to a sample of KantanMT members this month, July 2013. It will be made available to all members of the KantanMT platform in September 2013.

CNGL

The CNGL Centre for Global Intelligent Content
CNGL was established in 2007 as a collaborative academia-industry research centre aiming to break new ground in digital intelligent content and to “revolutionise the global content value chain for enterprises, communities, and individuals” (CNGL, 2013).

CNGL says that it intends to “pioneer development of advanced content processing technologies for content creation, multilingual discovery, translation and localization, personalisation, and multimodal interaction across global markets”. Its adds that “these technologies will revolutionise the integration and unification of multilingual, multi-modal and multimedia content and interactions, and drive innovation across the global content value chain” (CNGL, 2013)

The body has received over €43 million euro in funding from Science Foundation Ireland (SFI) and key industry partners. Research for the KantanMT Analytics project was co-funded by the SFI in association with Enterprise Ireland.

CNGL has researchers at Trinity College Dublin, University College Dublin, University of Limerick, and Dublin City University. These researchers produce the aforementioned technologies in association with industry partners. Aside from KantanMT, CNGL has also entered partnerships with Microsoft, Intel, and Symantec to name but a few.

KantanAnalytics is the latest milestone in the partnership between KantanMT and CNGL and it will help to redefine current Machine Translation business models.

Please feel free to comment on this post or any ones previous-we’d love to hear from you!!

If you would like to find out more about KantanMT and KantanMT Analytics, visit KantanMT.com.

Featured Image Source: http://lovebeingretired.com/2011/04/21/a-lifetime-in-review/magnifying-glass-2/

How to Price Machine Translation Post-Editing

KantanMT Post-editing Machine TranslationSo far in this KantanMT blog series on Machine Translation post-editing we have looked at; automated post-editing, why it is becoming popular within the localization industry, how you can reduce your post-editing times, and the steps you can take to achieve both understandable or ‘fit for purpose’ and close to human levels of post-editing standards. In this post we are going to focus on perhaps one of the most difficult issues with regards to providing a post-editing service, and that’s pricing.

What’s the problem?
The problem, put simply, is that there is no set way for Language Service Providers (LSPs) to price post-editing projects for their clients. That’s because LSPs must contend with a range of variables in the post–editing process, each of which can effect the final cost. Lorena Guerra, writing in 2003, sums up one of the main issues, “Whereas Human Translation is mainly based on the unit “word” as a cost base, in the case of post-editing, as outlined by Spalink et al. the cost base “word” is much harder to justify”. LSPs cannot charge for post-editing a “word” when their post-editors may have just corrected a letter or perhaps even a broader stylistic problem.  There are also other items to consider, here are just a few

  • The time it takes to complete the post-editing process
  • The post-editing standards required by the client
  • The number of segments requiring higher post-editing quality compared to those requiring a lower post-editing standard
  • Varying segment lengths
  • The quality of the raw Machine Translation output
  • Varying degrees of post-editing effort required for different language pairs

LSPs and their clients must not only set a price, but also agree upon how that price is reached. Establishing a pricing framework that considers all parties is an imperative.

Pricing Machine Translation Post-Editing
So how can Localization Service Providers develop appropriate frameworks for pricing Machine Translation post-editing? TAUS, has recently published a public consultation entitled “Best Practice Guidelines for Pricing MT Post Editing” that features guidelines to help solve this problem. Let’s take a look at the key points. Note: These TAUS guidelines are preliminary and are subject to review while the public consultation is ongoing.

1. Things to Always Remember
TAUS says that no matter what kind of framework you use for pricing Machine Translation post-editing, there are certain things to always keep in mind.

Set a price up-front
Ensure that your framework can provide an estimation of the cost of post-editing a text at the outset; re-evaluate prices when you evaluate or roll out a new version of an engine.

Involve all parties
When building your pricing framework, include all parties involved in your Machine Translation process. This is to ensure that everyone agrees “that the pricing model reflects the effort involved”.

Take the content to be post-edited into account
Consider the variables outlined earlier in this post such as post-editing different language pairs and post-editing to various quality standards. All of these factors need to be assessed as part of your pricing framework.

easelly_visual(5)

2. Building a Pricing Model
TAUS recommends combining a number of approaches to build your pricing framework. These are Automated Quality Score (e.g. TER, BLEU, F-Measure), Human Assessment, and Productivity Assessment. TAUS adds that “Productivity Assessment should always be used” regardless of what approach is taken.

Automated quality scores
There a number of combinations of automated measurement tools, KantanMT currently deploys BLEU, TER, and F-Measure.

Human assessment
This involves steps such as human post-editors checking both the quality of raw Machine Translation output and post-edited content.

Post-editing productivity assessment
TAUS defines this as “calculating the difference in speed between translating from scratch and post-editing Machine Translation output”. Speeds may change if you deploy a new engine, so each time a “new ‘production’ ready engine” is rolled out make sure that you perform new productivity assessments.

To find out more about developing a Machine Translation post-editing pricing framework, check out TAUS’s  public consultation “Best Practice Guidelines for MT Post-Editing”. Note: The public consultation on these preliminary guidelines closes Tuesday July 30th 2013 and the official guidelines will be published on Tuesday August 6th 2013.

KantanAnalytics
This week, KantanMT has announced the forthcoming release of KantanMT Analytics. This technology, which has been developed in partnership with the CNGL Centre for Global Intelligent Content, provides segment level quality analysis for Machine Translation output.

By attaining a quality score for each segment of a Machine Translated document, post-editors can accurately identify segments that require the most post-editing time and those which already meet the client’s quality standards. This will help KantanMT members to calculate post-editing effort and price.

That brings us to the end of our blog series on Machine Translation post-editing. We hope you have enjoyed taking this “post-editing adventure” with us and are able to put the advice within this blog series to good use. Please feel free to comment on this post or any ones previous-we’d love to hear from you.

If you want to find out more about KantanMT and KantanAnalytics, visit KantanMT.com or mail info@kantanmt.com.