My TY Work Experience at KantanMT

St. Joseph's Navan

Amy Barter, a Transition Year Student from St. Josephs Mercy Navan spent this week learning about Machine Translation while on her TY Work Experience, 14 – 18 November 2016. We were delighted to have her in the office and really appreciated all her great help. She even had some time to blog about her experience. Continue reading

Sue’s Top Tips for Building MT Engines

Sue McDermott, KantanMTI’m new to machine translation and one of the things I’ve been doing at KantanMT is learning how to refine training data with a view to building stock engines.

Stock engines are the optional training data provided by KantanMT to improve the performance of your customized MT engine. In this post I’m going to describe the process of building an engine and refining the training data.

The building process on the platform is quite simple. From your dashboard on the website select “My Client Profiles” where you will find two profiles, which have already been set up. A default profile and sample profile; both of which let you run translation jobs straight away.

To create your own customized profile select ‘New’ at the top of the left-most column. This launches the client Profile Wizard.  Enter the name of your new engine; try to make this something meaningful, or use an easily recognizable standard around how you name your profiles. This makes it easier to recognize which profile is which, when you have more than one profile.

When you select ‘next’ you will be asked to specify the source and target languages from drop down menus. The wizard lets you distinguish between different variants of the same language for example Canadian English or US English. Let’s say we’re translating from Canadian English to Canadian French. If you’re not sure which variant you need, have a quick look at the training data, which will give you the language codes.

The next step gives you an option to select a stock engine from a drop down menu. The stock engines are grouped according to their business area or domain.

You will see a summary of your choices, if you’re happy with them select ‘create’. Your new engine will be shown in the list of your client profiles. However, while you have created your engine, you haven’t yet built it.

KantanMT Stock Engine Training data
Stock training data available for social and conversational domains on the KantanMT platform.

 

Building Your Engine

Selecting your profile from the list will make it the current active engine.  By selecting the Training Data tab you can upload any additional training data easily by using the drag and drop function. Then select the ‘Build’ option to begin building your engine.

It’s always a good idea to supply as much useful training data as possible. This ‘educates’ the engine in the way your organization typically translates text.

Once the build job has been submitted, you can monitor its progress in the ‘My Jobs’ page.

When the job is completed the BuildAnalytics™ feature is created. This can be accessed by clicking on the database icon to the left of the profile name. BuildAnalytics will give you feedback on the strength of your engine using industry standard scores, as well as details about your engines word count. The tabs across the page will give you access to more detail.

The summary tab lets you to see the average BLEU, F-Measure and TER scores for the engine, and the pie charts show you a summary of the percentage scores for all segments. For more detail select the respective tabs and use the data to investigate individual segments.

KantanMT BuildAnalytics Feature
KantanBuildAnalytics provides a granular analyis of your MT engine.

 

A Rejects Report is created for every file of Training Data uploaded. You can use this to determine why some of your data is not being used, and improve the uptake rate of your data.

Gap analysis gives you an effective way to improve your engine with relevant glossary or noise lists, which you can upload to future engine builds. By adding these terminology files in either TBX (Terminology Interchange) or XLSX (Microsoft Excel Spreadsheet) formats you will quickly improve the engines performance.

The Timeline tag shows you the evolution of your engine over its lifetime. This feature lets you compare the statistics with previous builds, and track all the data you have uploaded. On a couple of occasions, I used the archive feature to revert back to a previous build, when the engine building process was not going according to plan.

KantanMT Timeline
KantanMT Timeline lets you view your entire engine’s build history.

 

Improving Your Engine

A great way to improve your engines performance is to analyze the rejects report for the files with a higher rejection rate.  Once you understand the reasons segments are rejected you can begin to address them.  For example, an error 104 is caused by a difference in place holder counts. This can be something as simple as the source language using the % sign where the target language uses the word ‘percent’. In this case a preprocessor rule can be created to fix the problem.

KantanMT Rejects Report Error 104
A detailed rejects report shows you the errors in your MT engine.

A PEX rule editor is accessed from the KantanMT drop down menu. This lets you try out your preprocessor rules, and see the effect that they have in the data. I would suggest directly copying and pasting from the rejects report to the test area and applying your PEX rule to ensure you’re precisely targeting the data concerned. You can get instant feedback using this tool.

Once you’re happy with the way the rules work on the rejected data it’s useful to analyze the rest of the data to see what effect the rules will have.  You want to avoid a situation where using a rule resolves 10 rejects, but creates 20 more. Once the rules are refined copy them to the appropriate files (source.ppx, target.ppx) and upload with the training data. Remember that the rules will run against the content in the order they are specified.

When you rebuild the engine they will be incorporated, and hopefully improve the scores.

Sue’s 3 Tips for Successfully Building MT Engines

  1. Name your profiles clearly – When you are using a number of profiles simultaneously knowing what each one is (Language pair/domain) will make it much easier as you progress through the building process.
  2. Take advantage of BuildAnalytics – Use the insights and Gap analysis features to give you tips on improving your engine. Listening to these tips can really help speed up the engine refinement process.
  3. The PEX Rule Editor is your friend – Don’t be afraid to try out creating and using new PEX rules, if things go south you can always go back to previous versions of your engine.

My internship at KantanMT.com really opened my eyes to the world of language services and machine translation. Before joining the team I knew nothing about MT or the mechanics behind building engines. This was a great experience, and being part of such a smoothly run development team was an added bonus that I will take with me when I return ITB to finish my course.

About Sue McDermott

Sue is currently studying for a Diploma in Computer Science from ITB (Institute of Technology Blanchardstown). Sue joined KantanMT.com on a three month internship. She has a degree in English Literature and a background in business systems, and is also a full-time mum for the last 17 years.

Email: info@kantanmt.com, if you have any questions or want more information on the KantanMT platform.

Translation Technology Conferences and Events for 2014

KantanMT events2014 has arrived – and there is no better way to get the ball rolling than by planning what events to attend. Over the next twelve months there is a vast selection of conferences, unconferences, workshops, roundtables, webinars and other events planned around the world.

It was hard to narrow the list of everything going on, so KantanMT tried to focus on events that were related to Machine Translation and the Natural Language Processing (NLP) industry, localization, translation technologies and post-editing. Some of the events are more academic, while others are more business orientated.

Unconferences and Conferences…

We added some ‘unconferences’ to the list, these are the opposite of conferences. Unconferences are peer-to-peer interactions on topics chosen by participants at the beginning of a session, unlike more formal conferences. Unconference participants choose the topics, so it is much easier to promote an open discussion and are a good way for industry professionals to get together in an informal setting, sharing their own challenges and solutions.

Localization World, one of the biggest industry conferences, has had a great response from holding unconferences alongside its traditional conferences and the Association of Language Companies (ALC) also endorses the value of unconferences. The next ALC unconference will held in the early part of February.

Hopefully, this list will be a useful resource in deciding what events and conferences to visit during 2014. You may have registered for some of these events already, if not, then now is the time to start filling in your calendar. If you know of a relevant conference or event we missed, please add it to the comment section at the bottom of this post.

2014 Listings

January

Jan 8, 2014 (17:00-18:00 CET)

Webinar: TAUS Translation Technology Showcase – XTRF and Kilgray’s memoQ

Tomasz Mróz, XTRF Operations Director will present usage scenarios on integrating XTRF technology into the translation workflows, TM integration and faster project turnaround times. István Lengyel, CEO of Kilgray will also be presenting on memoQ, a cloud-based translation technology platform for translation management.


Jan 9, 2014

Webinar:  TAUS Dynamic Quality Framework Users Call

The users call is a bi-monthly webinar where TAUS members discuss solutions for measuring Machine Translation quality. Some of the participants include; Autodesk, CA Technologies, Cisco, Dell, Digital Linguistics, eBay, EMC and Google. To register for the webinar, members can email memberservices@taus.net


Jan 15, 2014

Webinar: The Convergence Era: Translation as A Utility (The Content Wrangler, TAUS)

This webinar, hosted by BrightTalk is a discussion by Jaap van der Meer (TAUS) and Scott Abel (The Content Wrangler) on how translation has become a necessary part of everyday life, the same way as electricity, water and the internet have become indispensable.


Jan 16, 2014

Meeting/Webinar: L20n: Next Generation Localization Framework for the Web, The International Multilingual Computing User Group (IMUG), San José, California USA

Zbigniew Braniecki, Software Engineer, Mozilla Corporation will speak about L20n, a new localization framework that isolates localization and enables translators to give naturally expressive translations for even the most complex user interfaces. Mozilla is investing in moving its products – Firefox, Firefox OS, and Firefox for Android – to this new architecture.


Jan 23, 2014

Unconference: Localization Unconference, Achievers Head office Toronto, Canada

This unconference is an all-day event starting at 09:30am and will cover internationalization and localization topics. It is organized by Jenny Reid, Localization Project Manager, BlackBerry; Oleksandr Pysaryuk, Localization Manager, Achievers; and Richard Sikes, Principal Consultant, Localization Flow Technologies.


Jan 30, 2014 (11:00 EST/17:00 CET)

Webinar: Integrating Your Content Platform, Globalization and Localization Association

Anders Holt, European Director and Robert Timms, Technical Director at translate plus will present a webinar on integrating content management platforms; CMS, DMS, PIM or e-procurement system into the translation workflow. They will discuss the integration methods available and how to get the best results and benefits of integration.


Jan 30-31, 2014

Conference: 2014 CRITT – WCRE Conference, Translation in transition: between cognition, computing and technology, Copenhagen Business School (CBS), Frederiksberg, Denmark

This academic conference presents research from the centre for research and innovation in translation and translation technology (CRITT). The program covers a variety of topics including; translation and cognitive processes, translation and translation theory and observations about Machine Translation and translation and post-editing.


February

Feb 5, 2014 (17:00-18:00 CET)

Webinar: TAUS Translation Technology Showcase – Ontram and Across Language Server v6

Christian Weih, Chief Sales Officer from Across Systems presents a TMS platform that integrates all aspects of the translation workflow.


Feb 6-8, 2014

Unconference: ALC Unconference, (Association of Language Companies), Palm Beach Gardens, Florida USA

The Unconference is geared towards language company owners and senior members of staff who get together without any formal presentation structure for more intimate brainstorming and discussion sessions in a casual and relaxed environment.


Feb 6, 2014 (11:00 EST/17:00 CET)

Webinar: Maximizing Translation Efficiency: Best QA Practices for Large Multi-channel Publishing Projects

Jose Sermeno, Product Evangelist at MadCap Software and Peter Argondizzo, Translation and Localization PM at MadTranslations discuss QA best practices that will make projects more efficient.


Feb 24-26, 2014

Conference: ‘Localization in a Shifting Global Economy’ Localization World, Bangkok Thailand

The first of three Localization World conferences of 2014, Localization World is the leading conference for international business, translation and localization providing opportunities for networking and information exchange.


Feb 26-28, 2014

Conference, workshops:  ICC (Intelligent Content Conference) 2014, San José, California USA

ICC focuses on the creation and management of content in different languages on any device. The topics that will include; content strategy, content marketing, content engineering, structured content, ebooks, mobile, apps, adaptive content, automated translation, terminology management, big data and analytics.


Feb 27, 2014 (11:00 EST/17:00 CET)

Webinar: GALA Translation Project Management with memoQ Server Training session

Daniel Zielinski will explain how the memoQ server can be used for managing translation projects effectively. See the different types of projects and workflows supported, and learn how to set up, prepare, monitor and complete a translation project with the memoQ server.


Feb 27 – Mar 1, 2014

Conference: memoQfest Americas, Kilgray Translation Technologies, Los Angeles, California USA

This three day event is hosted by Kilgray Translation Technologies and is aimed at freelance language professionals, LSPs and corporate translation users. The conference gives an overview of translation technology and how it can be integrated into businesses.


March

Mar 3-6, 2014

Conference: WritersUA, the conference for Software User Assistance, Palm Springs, California USA

This conference is for those involved in creating user assistance content. There will be a variety of presentations focused on developing content strategies, key technologies and tools that are used to create well-designed interfaces, technical communications and support information.


Mar 5, 2014 (17:00-18:00 CET)

Webinar: TAUS Translation Technology Showcase – Safaba and KantanMT

The theme of this webinar is the application and influence of MT technologies on global business. Tony O’Dowd, Founder and Chief Architect presents the KantanMT.com cloud-based platform introducing some of the KantanMT technologies and usage cases, including; KantanWatch, KantanISR, KantanAnalytics, TotalRecall, PEX and GENTRY.

Udi Hershkovich, Vice President of Business Development at Safaba will discuss key business imperatives for businesses and how Enterprise MT removes the language barriers that face global businesses.


Mar 13-14, 2014

Conference: International Conference on Translation and Accessibility in Video Games and Virtual Worlds at Universitat Autònoma de Barcelona, Spain

The conference is a meeting point for academics, professionals and students involved in the game localization industry. The conference aims to foster the interdisciplinary debate in these fields, combine them as academic areas of research and contribute to the development of best practices.


Mar 17-21, 2014

Conference: Game Localization Summit at GDC, IGDA Game Localization SIG, San Francisco, California USA

The game Localization Summit at GDC is supported and organized by the IGDA Game Localization SIG, and it is aimed at helping localization professionals as well as the entire community of game developers and publishers understand how to plan and execute game localization and culturalization as a part of the development cycle. There are other GDC conferences planned for Europe and China later in the year.


Mar 23-26, 2014

Conference: GALA 2014, Globalization and Localization Association (GALA), Istanbul, Turkey

The annual GALA conference brings together localization industry professionals for networking opportunities and peer-to-peer learning of the latest technologies and emerging trends in localization, language and translation technology.


Mar 28-29, 2014

Conference: The Translation and Localization Conference, Localize.pl, TexteM, KOMTE, Warsaw, Poland

This is an annual international event focusing on the latest technologies and localization industry trends. The conference is suited to LSPs and freelance translators, and covers technical communication and implications for the translation industry. Big data vs. the translation industry; CAT tools, MT, cloud computing, project management and the human factor; recruitment and training.


April

Apr 2, 2014 (17:00-18:00 CET)

Webinar: Translation Technology Showcase, TAUS – tauyou and Pangeanic

Diego Bartolome, CEO tauyou will discuss the ‘Big Data’ approach to SMT and the importance of clean data on output quality.


Apr 10-11, 2014

Event: TAUS Executive Forum, Oracle Japan, Tokyo, Japan

The executive forum consists of two-days of meetings for buyers and providers of language services and technologies. It is an open exchange about language business innovation and translation technology with the theme ‘translation as a utility’. Topics to be covered include; translation data, MT showcases, DQF evaluation, translation customer support and integration with CRM systems.


Apr 13-15, 2014

Conference: MadWorld 2014, MadCap Software, Inc., San Diego, California USA

Designed to cater for technical writers, documentation managers and content strategists. This is the top conference for technical communication and content strategy.


Apr 25, 2014

Conference: TCeurope Colloquium, Conseil des Rédacteurs Techniques, Aix-en-Provence, France

Conference themes include; looking at the essential core skills of a technical communicator, accessibility and usability, technical communication and social media, multi‐authoring and international teamwork and training technical authors in the internet age.


Apr 26-30, 2014

Conference: EACL-2014, European Chapter of the Association for Computational Linguistics, Gothenburg, Sweden

Available to all ACL members and covers research in computational linguistics, psycholinguistics, speech, information retrieval, multimodal language processing and language issues in emerging domains such as bioinformatics and social media. Workshops and tutorials are held during Saturday-Sunday April 26-27th, while the main conference is runs from Monday-Wednesday April 28th-30th.


May

May 7, 2014 (17:00-18:00 CET)

Webinar: Translation Technology Showcase, TAUS – TaaS and Interverbum

TaaS and Interverbum present in this month’s Translation Technology Showcase by TAUS.


May 7-9, 2014

Conference: memoQfest International, Kilgray Translation Technologies, Budapest, Hungary

This conference aims to set up a forum where companies, LSPs and translators can discuss workflows and best practices that relate to memoQ or translation technology in general. Attendees will discuss industry trends attend workshops and exchange information with translators, LSPs, and translation end users.


May 7-8, 2014

Workshop: Making the Multilingual Web Work, MultilingualWeb, Madrid, Spain

The workshop is supported by the LIDER project and aims to survey and share information about best practices and standards for promoting multilingualism on the web.


May 8-9, 2014

Conference: Intelligent Content – Life Sciences and Healthcare, the Rockley Group, the Content Wrangler, San Francisco, California USA

The event will showcase examples, standards, methods, strategies and tools needed to help pharmaceutical companies, medical device manufacturers, and healthcare firms deliver the right information, in the right language, on any device. Conference topics include; mhealth, ehealth, digital health, personalized healthcare content and advanced translation technologies.


May 17-18, 2014

Conference: UTIC 2014, Ukrainian Translation Industry Conference, Kiev, Ukraine

Translators, managers, educators and software developers get together for networking opportunities and to discuss future industry trends.


May 18-21, 2014

Conference: Technical Communication Summit 2014, Society for Technical Communication, Phoenix, Arizona USA

The Technical Communication Summit is a source of learning for professional technical communicators giving training on the latest communication techniques, publishing technologies and business trends in the industry.


May 18-21, 2014

Conference: ALC 2014 Annual Conference, Association of Language Companies, Palm Springs, California USA

This conference is a networking event for anyone doing business with LSPs, combining educational content and networking.


May 23, 2014

Roundtable: TAUS Translation Automation Roundtable, TAUS, Moscow, Russia

Hosted by ABBYY Language Services, is a meeting for buyers and providers of translation services. The participants will get a good insight into MT technology, customization, implementation requirements and business cases.


May 26-31, 2014

Conference: LREC 2014, the European Language Resource Association, Reykjavík, Iceland

LREC is focused on Language Resources (LRs) and Evaluation for Language Technologies (LT). The aim of LREC is to give an overview of LR and LTs, emerging trends and the exchange of information.


June

June 2-3, 2014

Event: TAUS Industry Leaders Forum 2014, Clontarf Castle Hotel, Dublin

The theme for this meeting is ‘convergence’ with industry leaders discussing best practices, possible common approaches and shared services to optimize translation efficiencies through a series of short presentations.


Jun 3-4, 2014

Workshop: Localization Project Management Certification – The Localization Institute, Clarion Hotel, Dublin, Ireland

As part of the LPM Certification Program, this two-day project management training workshop will be held alongside Localization world. There is an eight week self-study part that must be completed before the workshop. It is open to Localization Project Managers with at least three years project management experience. Early bird and group registration discounts are available.


Jun 4-6, 2014

Conference: Localization World Dublin, Localization World Ltd., Dublin, Ireland

The second localization conference of 2014 will be held in Dublin with the theme of “disruptive innovation” and how this impacts the localization industry and the role of translators. Topics covered at the conference will include; advanced localization management, global business, localization core competencies and technology.


Jun 5-6, 2014

Conference: UA Europe 2013, UA Europe, Kraków, Poland

In association with Writers UA, the UA Europe technical communication conference focuses on software user assistance and online Help, and provides information on the latest industry trends, technical developments, and best practice in software UA.


Jun 16-18, 2014

Conference: EAMT 2014, European Association for Machine Translation, Dubrovnik, Croatia – 17th Annual Conference of the European Association for Machine Translation

The conference is aimed at anyone interested in MT and translation-related tools and resources. Topics will include; MT in multilingual public service (eGovernment etc.), MT for the web, MT embedded in other services, MT evaluation techniques and evaluation results and more.


August

Aug 23-29, 2014

Conference: COLING 2014, International Committee for Computational Linguistics, Dublin, Ireland

The bi-annual COLING conference, is one of the premier Natural Language Processing conferences in the world. The conference will include full papers, oral presentations, poster presentations, demonstrations, tutorials, and workshops on a variety of technical areas on natural language and computation.


September

Sep 25-26, 2014

Workshop: IATIS Regional Workshop, Translator and Interpreter Training, Serbia

This conference is aimed at promoting translator training, and will address training in areas such as field/domain specialization, technical skills (including pre-/post-editing of MT), revision skills and management skills (soft skills).


October

Oct 4-5, 2014

Conference: MedTranslate 2014, GxP Language Services, Freiburg im Breisgau, Germany


Oct 6-7, 2014

Workshop: Localization Project Management Certification, the Localization Institute, Seattle, Washington USA

As part of the LPM Certification Program, this two-day project management training workshop will be held alongside Localization world.


Oct 19, 2014

Unconference: Localization World Unconference, Seattle

The agenda will be set in the first session and then there will be 3-4 break-out sessions with topics the group chose together. Attendees can submit topics to be considered from Wednesday, October 17th and can be submitted at VistaTEC’s booth.


Oct 27-28, 2014

Conference: TAUS User Conference, TAUS, Vancouver, Canada

The TAUS Annual Conference 2014 will be co-located with the Localization World Conference taking place in the Convention Centre, Vancouver, BC, Canada.


Oct 29-31, 2014

Conference: Localization World Vancouver, Localization World Ltd., Vancouver, Canada

Localization World provides an opportunity for the exchange of information in the language and translation services and technologies market.


November

Nov 3-5, 2014

Conference: 38th Internationalization & Unicode Conference (IUC38), Object Management Group, Santa Clara, California USA

The conference is for internationalization experts, tools vendors, software implementers, and business and program managers who want to discuss the best methods for doing business in international markets. The conference will feature subject areas; cloud computing, upgrading to HTML5, integrating with social networking software, and implementing mobile apps.


Nov 5-8, 2014

Conference: 55th ATA Conference, American Translators Association, Sheraton Hotel Chicago, Illinois USA

A networking event for translators, project managers and industry professionals. The aim of the conference is to promote the professional development of translators and interpreters.


Nov 11-13, 2014

Conference:  tcworld – tekom, Stuttgart, Germany

The technical communication conference and trade fair examines different aspects of localization, internationalization and globalization. It is the largest technical communication, authoring and IT management conference in the world and participating companies offer industrial, software and services for technical communication.


December

Dec 8-12 2014

Conference: IEEE GLOBECOM, Austin Texas USA

The conference is the second largest of the 38 IEEE communications societies will focus on the latest advancements in broadband, wireless, multimedia, internet, image and voice communications.


Dec 15-18 2014

Conference: IEEE CloudCom 2014, Nanyang Avenue, Singapore

CloudCom promotes cloud computing platforms. It is co-sponsored by the Institute of Electrical and Electronics Engineers (IEEE) and the Cloud Computing Association. The conference attracts researchers, developers, users, students and practitioners from the fields of big data, systems architecture, services research, virtualization, security and privacy and high performance computing.

KantanMT will look forward to meeting you at some of these conferences over the next year.

KantanMT – 2013 Year in Review

KantanMT 2013 year in ReviewKantanMT had an exciting year as it transitioned from a publicly funded business idea into a commercial enterprise that was officially launched in June 2013. The KantanMT team are delighted to have surpassed expectations, by developing and refining cutting edge technologies that make Machine Translation easier to understand and use.

Here are some of the highlights for 2013, as KantanMT looks back on an exceptional year.

Strong Customer Focus…

The year started on a high note, with the opening of a second office in Galway, Ireland, and KantanMT kept the forward momentum going as the year progressed. The Galway office is focused on customer service, product education and Customer Relationship Management (CRM), and is home to Aidan Collins, User Engagement Manager, Kevin McCoy, Customer Relationship Manager and MT Success Coach, and Gina Lawlor, Customer Relationship co-ordinator.

KantanMT officially launched the KantanMT Statistical Machine Translation (SMT) platform as a commercial entity in June 2013. The platform was tested pre-launch by both industry and academic professionals, and was presented at the European OPTIMALE (Optimizing Professional Translator Training in a Multilingual Europe) workshop in Brussels. OPTIMALE is an academic network of 70 partners from 32 European countries, and the organization aims to promote professional translator training as the translation industry merges with the internet and translation automation.

The KantanMT Community…

The KantanMT member’s community now includes top tier Language Service Providers (LSPs), multinationals and smaller organizations. In 2013, the community has grown from 400 members in January to 3400 registered members in December, and in response to this growth, KantanMT introduced two partner programs, with the objective of improving the Machine Translation ecosystem.

The Developer Partner Program, which supports organizations interested in developing integrated technology solutions, and the Preferred Supplier of MT Program, dedicated to strengthening the use of MT technology in the global translation supply chain. KantanMT’s Preferred Suppliers of MT are:

KantanMT’s Progress…

To date, the most popular target languages on the KantanMT platform are; French, Spanish and Brazilian-Portuguese. Members have uploaded more than 67 billion training words and built approx. 7,000 customized KantanMT engines that translated more than 500 million words.

As usage of the platform increased, KantanMT focused on developing new technologies to improve the translation process, including a mobile application for iOS and Android that allows users to get access to their KantanMT engines on the go.

KantanMT’s Core Technologies from 2013…

KantanMT have been kept busy continuously developing and releasing new technologies to help clients build robust business models to integrate Machine Translation into existing workflows.

  • KantanAnalytics™ – segment level Quality Estimation (QE) analysis as a percentage ‘fuzzy match’ score on KantanMT translations, provides a straightforward method for costing and scheduling translation projects.
  • BuildAnalytics™ – QE feature designed to measure the suitability of the uploaded training data. The technology generates a segment level percentage score on a sample of the uploaded training data.
  • KantanWatch™ – makes monitoring the performance of KantanMT engines more transparent.
  • TotalRecall™ – combines TM and MT technology, TM matches with a ‘fuzzy match’ score of less than 85% are automatically put through the customized MT engine, giving the users the benefits of both technologies.
  • KantanISR™ Instant Segment Retraining technology that allows members near instantaneous correction and retraining of their KantanMT engines.
  • PEX Rule Editor – an advanced pattern matching technology that allows members to correct repetitive errors, making a smoother post-editing process by reducing post-editing effort, cost and times.
  • Kantan API – critical for the development of software connectors and smooth integration of KantanMT into existing translation workflows. The success of the MemoQ connector, led to the development of subsequent connectors for MemSource and XTM.

KantanMT sourced and cleaned a range of bi-directional domain specific stock engines that consist of approx. six million words across legal, medical and financial domains and made them available to its members. KantanMT also developed support for Traditional and Simplified Chinese, Japanese, Thai and Croatian Languages during 2013.

Recognition as Business Innovators…

KantanMT received awards for business innovation and entrepreneurship throughout the year. Founder and Chief Architect, Tony O’Dowd was presented with the ICT Commercialization award in September.

In October, KantanMT was shortlisted for the PITCH start-up competition and participated in the ALPHA Program for start-ups at Dublin’s Web Summit, the largest tech conference in Europe. Earlier in the year KantanMT was also shortlisted for the Vodafone Start-up of the Year awards.

KantanMT were silver sponsors at the annual 2013 ASLIB Conference ‘Adopting the theme Translating and the Computer’ that took place in London, in November, and in October, Tony O’Dowd, presented at the TAUS Machine Translation Showcase at Localization World in Silicon Valley.

KantanMT have recently published a white paper introducing its cornerstone Quality Estimation technology, KantanAnalytics, and how this technology provides solutions to the biggest industry challenges facing widespread adoption of Machine Translation.

KantanAnalytics WhitePaper December 2013

For more information on how to introduce Machine Translation into your translation workflow contact Niamh Lacy (niamhl@kantanmt.com).

Crowdsourcing vs. Machine Translation

KantanMT CrowdsourcingCrowdsourcing is becoming more popular with both organizations and companies since the concept’s introduction in 2006, and has been adopted by companies who are using this new production model to improve their production capacity while keeping costs low. The web-based business model, uses an open call format to reach a wide network of people willing to volunteer their services for free or for a limited reward, for any activity including translation. The application of translation crowdsourcing models has opened the door for increased demand of multilingual content.

Jeff Howe, Wired magazine defined crowdsourcing as:

“…the act of taking a job traditionally performed by a designated agent (usually an employee) and outsourcing it to an undefined, generally large group of people in the form of an open call”.

Crowdsourcing costs equate to approx. 20% of a professional translation. Language Service Providers (LSPs) like Gengo and Moravia have realised the potential of crowdsourcing as part of a viable production model, which they are combining with professional translators and Machine Translation.

The crowdsourcing model is an effective method for translating the surge in User Generate Content (UGC). Erratic fluctuations in demand need a dynamic, flexible and scalable model. Crowdsourcing is definitely a feasible production model for translation services, but it still faces some considerable challenges.

Crowdsourcing Challenges

  • No specialist knowledge – crowdsourcing is difficult for technical texts that require specialised knowledge. It often involves breaking down a text to be translated into smaller sections to be sent to each volunteer. A volunteer may not be qualified in the domain area of expertise and so they end up translating small sections text, out of context, with limited subject knowledge which leads to lower quality or mistranslations.
  • Quality – translation quality is difficult to manage, and is dependent on the type of translation. There have been some innovative suggestions for measuring quality, including evaluation metrics such as BLEU and Meteor, but these are costly and time consuming to implement and need a reference translation or ‘gold standard’ to benchmark against.
  • Security – crowd management can be a difficult task and the moderator must be able to vet participants and make sure that they follow the privacy rules associated with the platform. Sensitive information that requires translation should not be released to volunteers.
  • Emotional attachment – humans can become emotionally attached to their translations.
  • Terminology and writing style inconsistency – when the project is divided amongst a number of volunteers, the final version’s style needs to be edited and checked for inconsistencies.
  • Motivation – decisions on how to motivate volunteers and keep them motivated can be an ongoing challenge for moderators.

Improvements in the quality of Machine Translation have had an influence on crowdsourcing popularity and the majority of MT post-editing and proofreading tasks fit into crowdsourcing models nicely. Content can be classified into ‘find-fix-verify’ phases and distributed easily among volunteers.

There are some advantages to be gained when pairing MT technology and collaborative crowdsourcing.

Combined MT/Crowdsourcing

Machine Translation will have a pivotal role to play within new translation models, which focus on translating large volumes of data in cost-effective and powerful production models. Merging both Machine Translation and crowdsourcing tasks will create not only fit-for-purpose, but also high quality translations.

  • Quality – as the overall quality of Machine Translation output improves, it is easier for crowdsourcing volunteers with less experience to generate better quality translations. This will in turn increase the demand for crowdsourcing models to be used within LSPs and organizations. MT quality metrics will also make post-editing tasks more straightforward and easier to delegate among volunteers based on their experience.
  • Training data word alignment and engine evaluations can be done through crowd computing, and parallel corpora created by volunteers can be used to train and/or retrain existing SMT engines.
  • Security – customized Machine Translation engines are more secure when dealing with sensitive product or client information. General or publicly available information is more suited to crowdsourcing.
  • Terminology and writing style consistency – writing style and terminology can be controlled and updated through a straightforward process when using MT. This avoids the idiosyncrasies of volunteer writing styles. There is no risk of translator bias when using Machine Translation.
  • Speed – Statistical Machine Translation (SMT) engines can process translations quickly and efficiently. When there is a need for a high volume of content to be translated within a short period of time it is better to use Machine Translation. Output is guaranteed within a designated time and crowdsourcing post-editing tasks speeds up the production process before final checks are carried out by experienced translators or post-editors.
crowdsource and Machine Translation model
Use of crowdsourcing for software localization. Source: V. Muntes-Mulero and P. Paladini, CA Technologies and M. Solé and J. Manzoor, Universitat Politècnica de Catalunya.

Last chance for a FREE TRIAL for KantanAnalytics™ for all members until November 30th 2013. KantanAnalytics will be available on the Enterprise Plan.

Interview: Working on KantanMT – a Developers Perspective

Eduardo shanahan
Eduardo Shanahan, CNGL

Eduardo Shanahan, a Senior Software Engineer at CNGL spent time working on KantanMT during its early days. KantanMT asked Eduardo to talk about what it was like to work with Founder and Chief Architect, Tony O’Dowd and the rest of the team developing the KantanMT product.

What was your initial impression, when you joined DLab in DCU?

This past year was a different kind adventure. After more than two decades working with Microsoft products like Visual Studio, so it was a big change, moving to Dublin City University (DCU) to be part of the Design and Innovation Lab, or DLab as we call it. The work in DLab consists of transforming code written by researchers into industrial quality products.

One of the first changes was to get a Mac and start deploying code in Linux, with no Visual Studio or even Mono. Instead I worked mostly with Python and NodeJS, and piles of shell scripts. Linux and Python, were not new to me but they did take some adjusting to using them.

This was a completely new environment and a new experience, and I was working in a whole new area. Back then, my relationship with Artificial Intelligence (AI) was informal to say the least, and I wasn’t even aware that something like Statistical Machine Translation (SMT) existed.

How did you get involved with working on KantanMT?

Starting out, I was working on a variety of different projects simultaneously.  A few months into it though, I started working full time with a couple of researchers creating new functionality for Tony and his KantanMT product, which is based on open source Moses technology. Moses technology uses aligned target and source texts of parallel corpora to train a SMT translation system. Once the system is trained, search algorithms are applied to find the most suitable translation matches. This translation model can be applied to any language pair.

What were your goals working on the KantanMT project?

Tony is doing a great job, deploying it on Amazon Web Services and creating a set of tools to streamline the operations for end users. His request to CNGL, was to provide more advanced insight into the translation quality produced by Moses.

To accomplish this, the task was mapped to two successive projects with different researchers on each project. The pace was very intense, we wanted state of the art results that showed up in the applications. Sandipan Dandapat, Assistant Professor in the Department of Computer Science and Engineering, IIT Guwahati and Aswarth Dara, Research Assistant at CNGL, DCU worked on adding real value to the KantanMT product during those long weeks, while I was rewriting their code time after time until it passed all the tests and then some. Our hard work paid off when KantanWatch™ and KantanAalytics™ were born.

Each attempt to deliver was an experience in itself, Tony was quick to detect any inconsistencies and wanted to be extra sure about understanding all the details and steps on the research and implementation.

In your opinion was the work a success?

The end result, is something that has made me proud. The mix between being a scientist and having a real product to implement is a very good combination. The guys at DCU have done a great job on the product base and DLab is a fantastic research and work environment.  The no nonsense attitude from Tony’s side created a very interesting situation and It’s something that we can really celebrate after a year of hard work.

The CNGL Centre for Global Intelligent Content

The CNGL Centre for Global Intelligent Content (Dublin City University, Ireland) is supported by the Science Foundation Ireland. During its academic-industry collaborative research it has not only driven standards in content and localization service integration, but it is also pioneering advancements in Machine Translation through the development of disruptive and cutting edge processing technologies. These technologies are revolutionising global content value chains across a number of different industries.

The CNGL research centre draws its talent and expertise from a combined 150 researchers from Trinity College Dublin, Dublin City University, University College Dublin and University of Limerick. The centre also works closely with industry partners to produce disruptive technologies that will have a positive impact both socially and economically.

KantanMT allows users to build a customised translation engine with training data that will be specific to their needs. KantanMT are continuing to offer a 14 day free trial to new members.

Training Data

KantanMT Training DataBuilding a KantanMT Engine: Training Data

When the decision is made to incorporate a KantanMT engine into a translation model, the next obvious and most difficult question to answer is what to use to train the engine? This is often followed by: what are the optimum training data requirements to yield a highly productive engine? And how will I curate my training data?

The engine’s target domain and objectives should be clearly mapped out ahead of the build. If the documents are for a specific client or domain then the relevant in-domain training data should be used to build the engine. This also ensures the best possible translation results.

KantanMT recommends a minimum of 2 million training words for each domain specific engine. Higher quantities of in-domain “unique words” will also improve the potential for building an “intelligent” engine.

The quality of the engine is based on the language or translation assets used to build the engine. Studies by TAUS have shown quality is more important than quantity. “Intelligently selected training data” generated higher BLEU scores than an engine built with more generic data. The studies also indicated, a proactive approach in customising or adapting the engine with translation assets led to better quality results.

Translation assets are the best source of suitable training data for building KantanMT engines, they include:

Stock Training Data: KantanMT stock engines are collections of highly cleansed bi-lingual training data sets. Quality is ensured as each data set shows the source corpora and approximate number of words used to create each stock engine. These can be added to client data to produce much larger and more powerful engines. There are over a hundred different stock engines to choose from, including industry specific sets such as IT, Legal, Medical and Finance. Find a list of KantanMT Stock engines here >>

Stock engines are a good starting point if you have limited TMX (Translation Memory Exchange) files in the required domain, or if you would simply like to build bigger KantanMT engines.

Translation Memory Files: This is the best source of high quality training data since both source and target texts are aligned. Translation Memories used for previous translations in a similar domain will also have been verified for quality. This guarantees the engine’s quality will be representative of the Translation Memory quality. As the old expression in the translation industry goes “garbage in, garbage out”, good quality Translation Memory files will yield a good quality Machine Translation engine. The TMX file format is the optimal format for use with KantanMT, however, text files can also be used.

Monolingual Translated Text Files: Monolingual text files are used to create language models for a KantanMT engine. Language models are used for word and phrase selection and have a direct impact on the fluency and recall of KantanMT engines. Translated monolingual training data should be uploaded alongside bi-lingual training data when building KantanMT engines.

Glossary Files: Terminology or glossary files can also be used as training material. Including a glossary improves terminology consistency and translation quality. Terminology files are uploaded with your ‘files to be translated’ and should also be in a TBX file format.

KantanISR™: Instant segment retraining technology allows users to input edited segments via the KantanISR editor. The segments then become training data and are stored in the KantanISR cache. The new segments are incorporated into the engine, avoiding the need to rebuild. As corrected data is included, the engine will improve in quality becoming an even more powerful and productive KantanMT engine.

KantanISR Instant Segment Retrainer
KantanISR editor

Building your KantanMT engine can be a very rewarding process. While some time is needed to gather the best data for a domain specific engine, there are many ways to enhance your engine that require little effort.

For more information about preparing training data or engine re-training, please contact Kevin McCoy, KantanMT Success Coach.