Following our launch of KantanNeural™ engines as part of our KantanFleet™ repository of pre-built MT engines, we received a number of questions and interest around the product. To address these questions, we asked Tony O’Dowd, CEO and Chief Architect of KantanMT.com a few questions about the Neural Machine Translation engines on KantanMT, the features and benefits of these engines and the impetus behind launching KantanNeural. Continue reading
Have you ever wondered who are people behind KantanMT?
We are thrilled to announce a series of posts where we will give 5 questions to each of our team members. These questions will delve a little deeper into their thoughts about technology, language and personal interests!
We are delighted to introduce Laura Casanellas, who bravely accepted the challenge of going first.
Welcome to Part II of the Q&A blog on How Machine Translation Helps Improve Translation Productivity. In case you missed the first part of our post, here’s a link to quickly have a look at what was covered.
Tony O’Dowd, Chief Architect of KantanMT.com and Louise Faherty, Technical Project Manager presented a webinar where they showed how LSPs (as well as enterprises) can improve the translation productivity of the language team, manage post-editing effort estimations and easily schedule projects with powerful MT engines. For this section, we are accompanied by Brian Coyle, Chief Commercial Officer at KantanMT, who joined the team on October, 2015 to strengthen KantanMT’s strategic vision.
We have provided a link to the slides used during the webinar below, along with a transcript of the Q&A session.
Please note that the answers below are not recorded verbatim and minor edits have been made to make the text more accessible.
Question: We are a mid-sized LSP and we would like to know what benefits would we enjoy if we choose to work with KantanMT, over building our own systems from scratch? The latter would be cheaper, wouldn’t it?
Answer (Brian): Tony and Louise have mentioned a lot of features available in KantanMT – indeed, the platform is very feature-rich and provides a great user experience. But on top of that, what’s really underneath KantanMT is the fact that it has access to a massive computing power, which is what Statistical Machine Translation requires in order to perform efficiently and quickly. KantanMT has the unique architecture to help provide instant on-demand access at scale.
As Louise Faherty mentioned, we are currently translating half a billion words per month and we have 760 servers deployed currently. So if you were trying to develop something yourself, it would be hard to reach this level of proficiency in your MT. Whilst no single LSP would probably need this total number of servers, to give you an idea of the cost involved, that kind of server deployment in a self-build environment would cost in the region of €25m.
We also offer 99.99% up time with triple data-centre disaster recovery. It would be very difficult and costly to build this kind of performance yourself. Also, with this kind of performance at your client’s disposal, you can offer Customised MT for mission critical web-based applications such as eCommerce sites.
Finally, a lot of planning, thought, development hours and research has gone into creating what we believe is the best user interface and the platform for MT, which also has the best functionality set with extreme ease of integration in the market place. So, it would be difficult for you to start on your own and build your own system that would be as robust and high quality as KantanMT.com.
Question: Could you also establish KantanNER rules to convert prices on an eCommerce websites?
Answer (Louise Faherty ): Yes, absolutely! With KantanNER, you can also establish rules, convert prices and so on. The only limitation with that being is that the exchange range will of course fluctuate. But there could be options as well of calculating that information dynamically – otherwise you would be looking at a fixed equation to convert those prices.
Question: My client does not want us to use MT because they have had bad experience in the past with Bing Translate – what would convince them to use KantanMT? How will the output be different?
Answer (Tony O’Dowd): One of things that you have to recognise in terms of using the KantanMT platform is that you are using MT to build customised machine translation engines. So you are not going to create generic engines (Bing Translate and Google Translate are generic engines). You would be building customised engines that are trained on the previous translations, glossaries that you clients have provided. You will also be using some of our stock engines that are relevant to your client’s domain.
So when you combine that, you get an engine that will mimic the translation style of your client. Indeed, instead of generic translation engines, you are using an engine that is designed to mirror the terminology and stylistic requirements of your client. If you can achieve this through Machine Translation, you will see that there is a lot less requirement for Post-Editing, and this is one of the most important things that drives away translators from using generic systems or broad-based systems and that’s why they choose customised systems. Clients and LSPs have tested the generic systems as well as customisable engines and found that cloud-based customisable MT add a value, which is not available on free, non-customisable MT platforms.
End of Q/A session
The KantanMT Professional Services Team would once again like to thank you for all your questions during the webinar and for sending in your questions by email.
Have more burning questions? Or maybe you would like to see the brilliant platform translate in a live environment? No problem! Just send an email to email@example.com and we will take care of the rest.
Want to stay informed about our new webinars? You can bookmark this page, or even better – sign up for our newsletter and ensure that you never miss a post!
We had so many questions during the Q&A in our last webinar session ‘How to Improve Translation Productivity‘ by the KantanMT Professional services team, that we decided to split the answers into two blog posts. So, if you don’t find your questions answered here, check out our blog next week for the remaining answers.
Internet today is experiencing what is generally referred to as a ‘content explosion!’ In this fast-paced world, businesses have to strive harder and do more to stay ahead of the game – especially if they are a global business or if they have globalization aspirations. One fool-proof way in which a business can successfully go global is through effective localization. Yet, the huge amount of content available online makes human translation for everything almost impossible. The only viable option then in today’s competitive online environment is through the use of Machine Translation (MT).
On Wednesday 21st October, Tony O’Dowd, Chief Architect of KantanMT.com and Louise Faherty, Technical Project Manager at KantanMT presented a webinar where they showed how Language Service Providers (LSPs) (as well as enterprises) can improve the translation productivity of the team, manage post-editing effort and easily schedule projects with powerful MT engines. Here is a link to the recording of the webinar on YouTube along with a transcript of the Q&A session.
The answers below are not recorded verbatim and minor edits have been made to make the text more readable.
Question: Do you have clients doing Japanese to English MT? What are the results, and how did you get them? (i.e., do you pre-process the Japanese?)
Answer (Tony O’Dowd): English to Japanese Machine Translation (MT) has indeed always posed a challenge in the MT industry. So is it possible to build a high quality, high fidelity MT system for this language combination? Well, there have been quite a few developments recently to improve the prospect of building effective engines in this language combination. For example, one of the latest changes we made on the KantanMT platform for improving the quality of MT is by using new and improved reordering models to make the translation from English to Japanese and Japanese to English much smoother, so we deliver a higher quality output. In addition to that, higher quality training data sets are now available for this language pair, compared to a couple of years ago, when I had started building English to Japanese engines. Back then it was really challenging. It is still requires some effort to build English to Japanese MT engines, but the fact that there’s more content available in these languages makes it slightly easier for us to build high-quality engines.
We are also developing example-based MT for these engines and it so far this is showing encouraging signs of improving quality for this language pair. However, we have not started deploying this development on the platform yet.
KantanMT note: For more insights into how you can prepare high-quality training data, read these tips shared by Tony O’Dowd, and Selçuk Özcan, co-founder of Transistent Language Automation Services during the webinar ‘Tips for Preparing Training Data for High Quality MT.’
Question: Have you got a webinar recorded or scheduled, where we could see how the system works hands-on?
Answer (Tony O’Dowd): If you go on to the KantanMT website, we have video links on the product features pages. So you can actually watch an explanation video while you are looking at the component.
We work in a very visual environment, and we think videos are a great way of explaining how the platform works. And, if you go on to the website, on the bottom left corner of the page, you will find our YouTube channel, which contains videos on all sorts of topics, including how to build your first engine, how to translate your first document and how to improve the output of your engines.
If you click on the Resources menu on our site, you can access a number of tutorials that will talk you through the basics of Statistical Machine Translation Systems. In other words, explore the website and you should find what you need.
KantanMT note: Some other useful links for resources are listed below:
- The KantanMT blog is full of helpful tips, tricks, information and guides on using MT effectively
- You can access KantanMT company slides on our SlideShare page
- Read our client success stories, KantanMT Case Studies
- Find answers in our FAQs
- See specs of our products on our product sheets section
- Read our whitepapers and view past webinars KantanMT webinars
- Check out our help section for help on Getting Started, File Parsing, Post-Editing and Preprocessors
Question: Do you provide any Post-Editing recommendations or standards for standardising the PE process? You said translation productivity rose to 8k words per day – this is only PE, correct?
Answer (Tony O’Dowd): I will take the second question first! The 8,000 words per day is the Post-Editing (PE) rate, yes. It is not the raw translation rate. In Machine Translation, everything comes out pretranslated. So this number refers to the Post-Editing effort – like insertions, deletions, substitution of words, and so on that you need to do to get the content to publishable quality.
Louise Faherty: What we recommend to our clients is that when it comes to PE, they should try to use MT. A lot of translators who are new to using MT will try and translate manually, which is a natural tendency, of course. But what we advise our clients is to copy and paste the translation (MT) in the engine and use the MT. The more you use MT and the more you Post-Edit, the better your engine will become.
Tony O’Dowd: I will add something to Louise Faherty ’s comments there. The best example of PE recommendations that I have come across is provided by a group called TAUS. They are at the pivot of educating the industry on how to develop a proficiency in PE.
Question: What do ‘PPX’ and ‘PEX’ stand for (as abbreviations)?
Answer (Louise Faherty and Tony O’Dowd): PEX stands for Post-Editing Automation. PEX allows you to take the output of an MT engine and dynamically alter that. When would you need to use PEX? Suppose there is a situation where your engine is repeating the same error over and over again. What you can do in such cases is write a PEX file (developed in the GENTRY programming language). This allows the engine to look for patterns in the output of the engine and to dynamically change that in the output.
For example, one of our French clients did not want to have a space preceding a colon mark in the output of their MT (because this was one of their typographical standards and repeated throughout the content). So we wrote a PEX rule that forced a stylistic change in the output of the engine. This enabled the client to reduce the number of Post-Edits substantially.
PPX stands for Preprocessor automation. You can use PPX files for to normalise or improve the training data. It is based on our GENTRY programming language which is available to all our clients for free.
In short then, PPX is for your training data, while PEX is for the actual raw output of your engine.
For more questions and answers, stay tuned for the next part of this post!
If you are in the language service industry, you are undoubtedly on the lookout for ways in which you can improve the productivity of your team – more translated words in less time – that’s what drives your clients as well as you. Automated Machine Translation (MT) seems to be the logical step forward in today’s world of content explosion and tightening deadlines. However, for most Language Service Providers (LSPs), the challenge lies in the actual implementation of this sophisticated technology.
For this reason, it is important that no matter what translation management tools you use, it should be integrated with a powerful MT engine that is reliable, scalable, flexible, and can be trained and re-trained constantly for maximum efficiency and quick turnaround times.
In today’s fast-paced world of content explosion on the Internet, the need for translating this organically growing content with the help of machines has become inevitable. While post-editing the machine translated content will always be required, choosing the right MT features will ensure that translators do not spend countless frustrating hours on those edits.
In this Kantanwebinar, The KantanMT Professional Services Team, Tony O’Dowd and Louise Faherty (Quinn) will show how you can improve the translation productivity of your team, and manage effort estimations and project deadlines better with a powerful MT engine.
During this webinar you will learn:
- Translation challenges (co-ordinating and managing translation projects)
- About the necessity of Machine Translation to be competitive
- How KantanMT.com can be integrated with other Translation Management Systems
Machine Translation (MT) has experienced a surge in popularity. However, achieving the right level of quality output can be challenging, even for the most expert MT engineers.
In the webinar ‘Tips for Preparing Training Data for High Quality MT’, KantanMT’s Founder and Chief Architect, Tony O’Dowd and Selçuk Özcan, Co-founder of Transistent Language Automation Services discussed how to best prepare training data to build high quality Statistical Machine Translation (SMT) engines. Here are their answers from the Q&A session.
Reading time: 5 mins
When it comes to Machine Translation, we know that quantity does not always equal quality. In your opinion, how many words will it take to build a fully functional engine?
Tony O’Dowd: Great question! Based on the entire community of Kantan users today, we have more than 7600 engines on our system. Those engines range from very small all the way up to very large. The biggest engines, which are in the eCommerce domain, contain about a billion words each.
If we exclude all the billion word MT engines so they don’t distort the results then the average size of a KantanMT engine today is approximately 5 million source words.
For example, if you look at our clients in the automotive industry, they have engines in and around 5 million source words, which are producing very high quality MT output.
How long does it take to build an engine of that size?
TOD: Again using KantanMT.com as an example. We can build an MT engine at approx. 4 million words per hour. Therefore, a 5 million-word engine takes approximately an hour to an hour and a half or 90 minutes to build. Compared with other MT providers in the industry, this is insanely fast.
This speed is possible because of our AWS cloud infrastructure. At the moment, we have 480 servers running the system. With such fast build times, our clients can retrain their engines more frequently, giving them higher levels of productivity and higher levels of quality output than most other systems. Read a client use case where speed had a positive impact on MT quality for eCommerce product descriptions (Netthandelen/Milengo case study).
How long does it take to accumulate that many words?
TOD: Most of our clients are able to deliver those words themselves, but our clients who don’t have 5 million source words will normally upload what they have and select one of our stock engines to help them reach a higher word count.
When we look at building an engine for a client, we look at the number of source words, but the key number for us is the number of unique words in an engine. For instance, if I want to have a high quality German engine in a narrow domain it might consist of 5 million source words. More importantly, the unique word count in that engine is going to be close to a million or slightly more than a million unique words.
If I have a high unique word count, I know the engine is going to know how to translate German correctly. Therefore, we don’t look at one word count, we look at a number of different word counts to achieve a high quality engine.
Another factor to consider is the level of inflected forms in the language. This is an indicator of how many words are needed. In order to educate and train the system we need more examples and usage examples of those inflected forms. Generally speaking, highly inflected languages require a lot more training data, so to build a Hungarian engine, which is an incredibly inflected language you will need in excess of 2-3 times the average word count to get workable high quality output.
What kind of additional monolingual data do you have?
TOD: There are 3 areas where we can help in improving suitable, relevant and quality monolingual data.
- We have a library of training data stock engines on KantanMT.com, which all include monolingual data in a variety of domains (Medical, IT, Financial etc.).
- In addition to stock engines, most of our clients upload their own monolingual data either as PDF files, docx or simple text files and we normalise that data. We have an automatic process in place to cleanse the data and convert it into a suitable format for machine translation/machine learning.
- We also offer a spider service, where clients give us a list of domain related URLs where we can collect monolingual data. For example, we recently built a medical engine for a client in the US, in Mexican Spanish and we collected more than 150k medical terms from health service content, which provided a great boost to the quality and more importantly the fluency of the MT engine.
Selçuk Özcan: At Transistent, we collect data from open source projects and open source data. First, we define some filters to ensure that we have the relevant monolingual data from the open source tools, which also includes spidering techniques. We then create a total corpus with the monolingual data we collected, which is used for training the MT engine.
What is the difference between pre-normalisation and final normalisation?
SÖ: The normalisation process is related to the TMS (Translation Management System), CMS (Content Management System) and TM (Translation Memory) systems. Pre-normalisation is applied to the text extracted from your systems to assure that the job will be processed properly. Final normalisation is then applied to the MT output to ensure that content is successfully integrated into the systems.
Can pre-normalisation and final normalisation be applied to corpora from TMs?
SÖ: It is possible to implement normalisation rules to corpora from TM systems. You have to configure your rules depending on your TM tool. Each tool has its own identification and encoding features for tags, markups, non-translatable strings and attributes.
How many words is considered too many in a long segment?
TOD: As part of our data cleansing policy, any data uploaded to a Kantan engine goes through 12 phases of data cleansing. Only segments that pass those 12 phases are included in the engine training. That may seem like a very harsh regime, but it is in place for a very good reason.
At KantanMT, the 3 things we look for in training data is:
We make sure that all the data you upload is very clean from a structural and linguistic point of view before we include it in your engine. If the training data fails any of those 12 steps, it will be rejected. For example, one phase is to check for long segments. By default, any segments with more than 40 words are rejected. This can be changed depending on the language combination and domain, but the default is 40 words or 40 tokens.
SÖ: As Tony mentioned, it also depends on the language pairs. Nevertheless, you may also want to define the threshold value according to the dynamics of your system; i.e. data, domain, required target quality and so on. We usually split the segments with 40 – 45 words.
How long does it take to normalise the data?
SO: The time frame for normalising data depends on a number of factors, including the language pair and differences between linguistic structures that you are working on, how clean the data is and the source of the data. If you have lots of formulas or non-standard characters, it will take longer to normalise that data.
For Turkish it might take an average of 10-15 days to normalise an average of 10 million words. Of course, this depends on the size of the team involved and the volume of data to be processed.
TOD: The time required to normalise data is very much data driven. A rule of thumb in the Kantan Professional Services Team is; standard text consisting mostly of words, such as text from a book, online help or perhaps user interface text, where the predominant token is a word is normalised very quickly because there are no mixed tokens in the data set – only words.
However, if you have numerical data, scientific formulas and product specifications such as measurements with a lot of part numbers there is a high diversity of individual tokens as opposed to simple words. This type of data takes a little bit longer to normalise because you have to instruct the engine, which you can do using the GENTRY programming language and Named Entity Recognition (NER) software.
We have GENTRY and NER built into the KantanMT.com so we can educate the engine to recognise those tokens. This is important because if the engine doesn’t recognise the data it can’t handle it during the translation phase.
The more diverse the tokens in your input, the longer the normalisation process is, and conversely the less diverse the tokens are the quicker the data can be processed. If it’s just words the system can handle this automatically.
We use this rule of thumb when working with clients to estimate how long it will take to build their engines, as we need to be able to give them some sense of a schedule around building an actual MT engine.
What volume of words would you suggest for a good Turkish engine?
SÖ: It makes no sense to work on a Turkish MT system, if you do not have at least a million words of bilingual data and 3 million words of monolingual data. Even in this case, you will have to work more on analyses, testing procedures and rule sets. Ideally, you will have approx. 10 million words of bilingual data. It’s the basic equation in terms of SMT Engine training, the more data you have the higher quality you achieve.
How long does it take to build an engine for Turkish?
SÖ: It depends on the language pair and the field of expertise or domain. Things may be harder if you are working on a language pair each of which has different linguistic structure such as English and Turkish. However, it’s not impossible to maintain a mature MT system for such a language pair, you will just need to have a longer time on it. Another parameter that effects the required time to have a mature MT system is the quality of your data to be utilised for the intended MT system. It is hard to give a specific time estimation without looking at the data but in general, it will probably take 2 to 6 months to have the intended production system.
To know more about the KantanMT platform, contact us (firstname.lastname@example.org) now for a platform demonstration.
In today’s world, the path to profits comes from global expansion, and everyone in business wants profits. With the goal of increased profits in mind, it is logical that business professionals constantly keep an eye out for new ways to expand their customer base and increase their bottom line.
In many cases, the most effective way to reach new customers is to speak their language and what better way to do this than through the translation and localization of product content into the languages, which are spoken, understood and used by the target audience.
When content is static and only needs a one-off translation, then traditional translation workflows do the job just fine, but when the content is a continuous stream of product descriptions or online help/chat content, a real-time scalable translation solution is the only feasible solution.
Machine Translation (MT) is the real-time scalable solution and the key to opening up new markets, reaching new customers and increasing profits. It is a productivity tool in the content production workflow with the potential to boost a company’s economic performance. However, a word of caution…before reaping the economic benefits of including MT in content production there are some criteria that should be carefully considered before jumping in to use Machine Translation.
Join Tony O’Dowd Founder and Chief Architect of KantanMT and Alan Houser, Co-Founder and President of Group Wellesley, Inc. as they discuss the economic arguments in favour of including Machine Translation in content production workflows.
Webinar Date: Thursday July 16th 5PM IST (Dublin), 9AM U.S. west coast and 12PM US East Coast will last approximately one hour including a Q&A session.
During this webinar, you will learn:
- Potential uses of Machine Translation
- How MT can drive performance to increase economic value
- When and how to adopt an MT strategy