Welcome to our second post in the ‘5 Questions’ series, which will give you a deeper insight into the people at KantanMT.
Welcome to our second post in the ‘5 Questions’ series, which will give you a deeper insight into the people at KantanMT.
Have you ever wondered who are people behind KantanMT?
We are thrilled to announce a series of posts where we will give 5 questions to each of our team members. These questions will delve a little deeper into their thoughts about technology, language and personal interests!
We are delighted to introduce Laura Casanellas, who bravely accepted the challenge of going first.
KantanMT.com was used in the course ‘Machine Translation and Post-editing,’ which was taught for the first time in the ‘Degree in Modern Languages Applied to Translation’ in UAH. English and Spanish were the main languages used during this course.
Following the announcement of a direct collaboration of KantanLabs and the ADAPT Centre for Digital Content Technology, we got in touch with Professor Andy Way from the School of Computing in Dublin City University and ADAPT Centre to ask him about innovations in the field of automated translations as well as his thoughts on the engagement between KantanLabs and ADAPT. Continue reading
Following KantanMT’s announcement of the roll out of the much-anticipated KantanLQR™ platform to all its Partners worldwide, Louise Irwin from Digital Marketing Team caught up Louise Faherty, Project Manager, Professional Services, KantanMT to talk about the features, benefits and the impetus behind creating the tool. Continue reading
KantanMT.com was used in the course ‘Machine Translation and Post-editing‘, which was taught for the first time in the ‘Degree in Modern Languages Applied to Translation’ in UAH. English and Spanish were used as the main languages used during this course.
We caught up with Professor Cristina Toledo Báez, and in this post she describes her experience of using KantanMT during the course.
Welcome to Part II of the Q&A blog on How Machine Translation Helps Improve Translation Productivity. In case you missed the first part of our post, here’s a link to quickly have a look at what was covered.
Tony O’Dowd, Chief Architect of KantanMT.com and Louise Faherty, Technical Project Manager presented a webinar where they showed how LSPs (as well as enterprises) can improve the translation productivity of the language team, manage post-editing effort estimations and easily schedule projects with powerful MT engines. For this section, we are accompanied by Brian Coyle, Chief Commercial Officer at KantanMT, who joined the team on October, 2015 to strengthen KantanMT’s strategic vision.
We have provided a link to the slides used during the webinar below, along with a transcript of the Q&A session.
Please note that the answers below are not recorded verbatim and minor edits have been made to make the text more accessible.
Question: We are a mid-sized LSP and we would like to know what benefits would we enjoy if we choose to work with KantanMT, over building our own systems from scratch? The latter would be cheaper, wouldn’t it?
Answer (Brian): Tony and Louise have mentioned a lot of features available in KantanMT – indeed, the platform is very feature-rich and provides a great user experience. But on top of that, what’s really underneath KantanMT is the fact that it has access to a massive computing power, which is what Statistical Machine Translation requires in order to perform efficiently and quickly. KantanMT has the unique architecture to help provide instant on-demand access at scale.
As Louise Faherty mentioned, we are currently translating half a billion words per month and we have 760 servers deployed currently. So if you were trying to develop something yourself, it would be hard to reach this level of proficiency in your MT. Whilst no single LSP would probably need this total number of servers, to give you an idea of the cost involved, that kind of server deployment in a self-build environment would cost in the region of €25m.
We also offer 99.99% up time with triple data-centre disaster recovery. It would be very difficult and costly to build this kind of performance yourself. Also, with this kind of performance at your client’s disposal, you can offer Customised MT for mission critical web-based applications such as eCommerce sites.
Finally, a lot of planning, thought, development hours and research has gone into creating what we believe is the best user interface and the platform for MT, which also has the best functionality set with extreme ease of integration in the market place. So, it would be difficult for you to start on your own and build your own system that would be as robust and high quality as KantanMT.com.
Question: Could you also establish KantanNER rules to convert prices on an eCommerce websites?
Answer (Louise Faherty ): Yes, absolutely! With KantanNER, you can also establish rules, convert prices and so on. The only limitation with that being is that the exchange range will of course fluctuate. But there could be options as well of calculating that information dynamically – otherwise you would be looking at a fixed equation to convert those prices.
Question: My client does not want us to use MT because they have had bad experience in the past with Bing Translate – what would convince them to use KantanMT? How will the output be different?
Answer (Tony O’Dowd): One of things that you have to recognise in terms of using the KantanMT platform is that you are using MT to build customised machine translation engines. So you are not going to create generic engines (Bing Translate and Google Translate are generic engines). You would be building customised engines that are trained on the previous translations, glossaries that you clients have provided. You will also be using some of our stock engines that are relevant to your client’s domain.
So when you combine that, you get an engine that will mimic the translation style of your client. Indeed, instead of generic translation engines, you are using an engine that is designed to mirror the terminology and stylistic requirements of your client. If you can achieve this through Machine Translation, you will see that there is a lot less requirement for Post-Editing, and this is one of the most important things that drives away translators from using generic systems or broad-based systems and that’s why they choose customised systems. Clients and LSPs have tested the generic systems as well as customisable engines and found that cloud-based customisable MT add a value, which is not available on free, non-customisable MT platforms.
End of Q/A session
The KantanMT Professional Services Team would once again like to thank you for all your questions during the webinar and for sending in your questions by email.
Have more burning questions? Or maybe you would like to see the brilliant platform translate in a live environment? No problem! Just send an email to email@example.com and we will take care of the rest.
Want to stay informed about our new webinars? You can bookmark this page, or even better – sign up for our newsletter and ensure that you never miss a post!
Machine Translation (MT) has experienced a surge in popularity. However, achieving the right level of quality output can be challenging, even for the most expert MT engineers.
In the webinar ‘Tips for Preparing Training Data for High Quality MT’, KantanMT’s Founder and Chief Architect, Tony O’Dowd and Selçuk Özcan, Co-founder of Transistent Language Automation Services discussed how to best prepare training data to build high quality Statistical Machine Translation (SMT) engines. Here are their answers from the Q&A session.
Reading time: 5 mins
When it comes to Machine Translation, we know that quantity does not always equal quality. In your opinion, how many words will it take to build a fully functional engine?
Tony O’Dowd: Great question! Based on the entire community of Kantan users today, we have more than 7600 engines on our system. Those engines range from very small all the way up to very large. The biggest engines, which are in the eCommerce domain, contain about a billion words each.
If we exclude all the billion word MT engines so they don’t distort the results then the average size of a KantanMT engine today is approximately 5 million source words.
For example, if you look at our clients in the automotive industry, they have engines in and around 5 million source words, which are producing very high quality MT output.
How long does it take to build an engine of that size?
TOD: Again using KantanMT.com as an example. We can build an MT engine at approx. 4 million words per hour. Therefore, a 5 million-word engine takes approximately an hour to an hour and a half or 90 minutes to build. Compared with other MT providers in the industry, this is insanely fast.
This speed is possible because of our AWS cloud infrastructure. At the moment, we have 480 servers running the system. With such fast build times, our clients can retrain their engines more frequently, giving them higher levels of productivity and higher levels of quality output than most other systems. Read a client use case where speed had a positive impact on MT quality for eCommerce product descriptions (Netthandelen/Milengo case study).
How long does it take to accumulate that many words?
TOD: Most of our clients are able to deliver those words themselves, but our clients who don’t have 5 million source words will normally upload what they have and select one of our stock engines to help them reach a higher word count.
When we look at building an engine for a client, we look at the number of source words, but the key number for us is the number of unique words in an engine. For instance, if I want to have a high quality German engine in a narrow domain it might consist of 5 million source words. More importantly, the unique word count in that engine is going to be close to a million or slightly more than a million unique words.
If I have a high unique word count, I know the engine is going to know how to translate German correctly. Therefore, we don’t look at one word count, we look at a number of different word counts to achieve a high quality engine.
Another factor to consider is the level of inflected forms in the language. This is an indicator of how many words are needed. In order to educate and train the system we need more examples and usage examples of those inflected forms. Generally speaking, highly inflected languages require a lot more training data, so to build a Hungarian engine, which is an incredibly inflected language you will need in excess of 2-3 times the average word count to get workable high quality output.
What kind of additional monolingual data do you have?
TOD: There are 3 areas where we can help in improving suitable, relevant and quality monolingual data.
Selçuk Özcan: At Transistent, we collect data from open source projects and open source data. First, we define some filters to ensure that we have the relevant monolingual data from the open source tools, which also includes spidering techniques. We then create a total corpus with the monolingual data we collected, which is used for training the MT engine.
What is the difference between pre-normalisation and final normalisation?
SÖ: The normalisation process is related to the TMS (Translation Management System), CMS (Content Management System) and TM (Translation Memory) systems. Pre-normalisation is applied to the text extracted from your systems to assure that the job will be processed properly. Final normalisation is then applied to the MT output to ensure that content is successfully integrated into the systems.
Can pre-normalisation and final normalisation be applied to corpora from TMs?
SÖ: It is possible to implement normalisation rules to corpora from TM systems. You have to configure your rules depending on your TM tool. Each tool has its own identification and encoding features for tags, markups, non-translatable strings and attributes.
How many words is considered too many in a long segment?
TOD: As part of our data cleansing policy, any data uploaded to a Kantan engine goes through 12 phases of data cleansing. Only segments that pass those 12 phases are included in the engine training. That may seem like a very harsh regime, but it is in place for a very good reason.
At KantanMT, the 3 things we look for in training data is:
We make sure that all the data you upload is very clean from a structural and linguistic point of view before we include it in your engine. If the training data fails any of those 12 steps, it will be rejected. For example, one phase is to check for long segments. By default, any segments with more than 40 words are rejected. This can be changed depending on the language combination and domain, but the default is 40 words or 40 tokens.
SÖ: As Tony mentioned, it also depends on the language pairs. Nevertheless, you may also want to define the threshold value according to the dynamics of your system; i.e. data, domain, required target quality and so on. We usually split the segments with 40 – 45 words.
How long does it take to normalise the data?
SO: The time frame for normalising data depends on a number of factors, including the language pair and differences between linguistic structures that you are working on, how clean the data is and the source of the data. If you have lots of formulas or non-standard characters, it will take longer to normalise that data.
For Turkish it might take an average of 10-15 days to normalise an average of 10 million words. Of course, this depends on the size of the team involved and the volume of data to be processed.
TOD: The time required to normalise data is very much data driven. A rule of thumb in the Kantan Professional Services Team is; standard text consisting mostly of words, such as text from a book, online help or perhaps user interface text, where the predominant token is a word is normalised very quickly because there are no mixed tokens in the data set – only words.
However, if you have numerical data, scientific formulas and product specifications such as measurements with a lot of part numbers there is a high diversity of individual tokens as opposed to simple words. This type of data takes a little bit longer to normalise because you have to instruct the engine, which you can do using the GENTRY programming language and Named Entity Recognition (NER) software.
We have GENTRY and NER built into the KantanMT.com so we can educate the engine to recognise those tokens. This is important because if the engine doesn’t recognise the data it can’t handle it during the translation phase.
The more diverse the tokens in your input, the longer the normalisation process is, and conversely the less diverse the tokens are the quicker the data can be processed. If it’s just words the system can handle this automatically.
We use this rule of thumb when working with clients to estimate how long it will take to build their engines, as we need to be able to give them some sense of a schedule around building an actual MT engine.
What volume of words would you suggest for a good Turkish engine?
SÖ: It makes no sense to work on a Turkish MT system, if you do not have at least a million words of bilingual data and 3 million words of monolingual data. Even in this case, you will have to work more on analyses, testing procedures and rule sets. Ideally, you will have approx. 10 million words of bilingual data. It’s the basic equation in terms of SMT Engine training, the more data you have the higher quality you achieve.
How long does it take to build an engine for Turkish?
SÖ: It depends on the language pair and the field of expertise or domain. Things may be harder if you are working on a language pair each of which has different linguistic structure such as English and Turkish. However, it’s not impossible to maintain a mature MT system for such a language pair, you will just need to have a longer time on it. Another parameter that effects the required time to have a mature MT system is the quality of your data to be utilised for the intended MT system. It is hard to give a specific time estimation without looking at the data but in general, it will probably take 2 to 6 months to have the intended production system.
KantanMT Founder and Chief Architect, Tony O’Dowd was recently featured in one of Ireland’s major national newspapers; The Irish Times.
The author of the news article, Olive Keogh is a business journalist who specialises in writing about innovative Irish enterprises and startups. With Olive’s kind permission, we are republishing the Irish times article.
“It’s not widely known at home but Ireland has developed an international reputation for research in statistical machine translation. Trinity, DCU and UL are all recognised worldwide and 120 PhD students have graduated here with skills in the field in the last five years. That’s more than in any other country in Europe,” says Tony O’Dowd the man behind KantanMT, a new scalable, high-speed machine translation system based on the Moses decoder and the Amazon Web Services and Cloud Computing infrastructure.
O’Dowd has spent almost 30 years in the software localization sector with companies such as Lotus Development Corporation and Symantec. Xcelerator, the company behind KantanMT, is O’Dowd’s second start-up, but he was also involved in the formation of FIT, a training organisation set up in 1998 to provide IT skills and training for the long-term unemployed.
“We are leveraging the Moses MT decoder and multiple streams of research from the Centre for Global Intelligent Content to make statistical machine translation (SMT) technology available to the masses,” he says.
“Traditional SMT systems are slow, expensive to deploy, time-consuming to customise and complex to manage. In short, not for the faint-hearted. I wanted to harness the economics of the cloud to solve these problems. Using hundreds of high-powered cloud-based severs to convert training data into data models also accelerated the process of customisation and the development of SMT engines.”
O’Dowd points out that in addition to the cost factor, traditional SMT solutions can produce translations of dubious quality. By focusing on advanced natural language processes and data processing algorithms, KantanMT also addresses these quality issues.
“Because of the costs involved, SMT tends to be used by large organisations with big budgets and plenty of people available to work on the system. The KantanMT platform removes this expense and complexity and makes it a far more practical and usable tool for businesses both big and small. Our clients can customise, improve and deploy their own engines in a matter of days,” O’Dowd says.
O’Dowd took his first steps as an entrepreneur in 2000 when he set up Alchemy Software Development. It quickly became a leading player in the software localization sector with over 27,000 licences in use worldwide. This success didn’t go unnoticed. The company was sold to the largest privately owned localization service provider, Translations.com, in March 2007.
Prior to setting up Alchemy O’Dowd was technology manager for Symantec Corporation Ireland and responsible for establishing the organisation’s Asian localization hub in Japan. He was also executive vice-president of Corel Corporation and spent three years as a lecturer in Trinity College Dublin teaching microprocessor design and assembly language programming.
O’Dowd began working on the idea for KantanMT in 2011 while on a year “off” to retrain himself on cloud-based technologies. He employed an MBA student to do detailed research into the barriers preventing companies using SMT and says the major leap forward in computing and storage capacity provided by the cloud enabled him to build a platform for SMT systems that would have been inconceivable without it.
Xcelerator recently raised €1.1 million in seed funding from venture capital company Delta Partners and the Enterprise Ireland High Potential Start Up fund. Early versions of KantanMT were given away free to kill competition and grab market share but first revenues (based on a usage pricing model) began flowing this time last year and O’Dowd says it is now profitable. A second round of funding is planned for later this year.
The company currently employs 11 people in its offices in Dublin and Galway, but this is expected to rise to 20-25 by the end of 2015. Its focus is the export market and its biggest customers are independent software vendors from industries such as ecommerce, finance and electronics. The company also provides MT services to the language industry.
“Starting your first business is definitely daunting as everything is new and you’re travelling down every road for the first time,” O’Dowd says.
“Next time around there is a lot of commonality and because you’ve learned by engaging with the school of hard knocks, you’re better at anticipating the problems and meeting the challenges. You also have a better network of contacts, you’re less frazzled when things don’t go right and you can actually grow the business faster and at a higher level. You also get a better hearing from the funding community as they view you as a safe pair of hands.”
KantanMT is based in the Invent Building at DCU and O’Dowd says the resources and expertise provided by the Invent team were instrumental in getting KantanMT.com off the ground.
“KantanMT.com is the fastest growing SMT platform in the localization industry today. So far over 80.5 billion words have been uploaded to the platform as training data and more than 750 million words have been translated by our clients. When you consider this has all happened in the last nine months, the company is rapidly becoming one of the biggest translation hubs in the market,” O’Dowd says.
The original article was published on Mon, Apr 27, 2015
Email firstname.lastname@example.org to learn more about how the KantanMT platform operates, or if you would like to set up a personalised demo with Tony.
This year, both KantanMT and its preferred Machine Translation supplier, bmmt, a progressive Language Service Provider with an MT focus, exhibited side by side at the tekom Trade Fair and tcworld conference in Stuttgart, Germany.
As a member of the KantanMT preferred partner program, bmmt works closely with KantanMT to provide MT services to its clients, which include major players in the automotive industry. KantanMT was able to catch up with Maxim Khalilov, technical lead and ‘MT guru’ to find out more about his take on the industry and what advice he could give to translation buyers planning to invest in MT.
KantanMT: Can you tell me a little about yourself and, how you got involved in the industry?
Maxim Khalilov: It was a long and exciting journey. Many years ago, I graduated from the Technical University in Russia with a major in computer science and economics. After graduating, I worked as a researcher for a couple of years in the sustainable energy field. But, even then I knew I still wanted to come back to IT Industry.
In 2005, I started a PhD at Universitat Politecnica de Catalunya (UPC) with a focus on Statistical Machine Translation, which was a very new topic back then. By 2009, after successfully defending my thesis, I moved to Amsterdam where I worked as a post-doctoral researcher at the University of Amsterdam and later as a RD manager at TAUS.
Since February 2014, I’ve been a team lead at bmmt GmbH, which is a German LSP with strong focus on machine translation.
I think my previous experience helped me to develop a deep understanding of the MT industry from both academic and technical perspectives. It also gave me a combination of research and management experience in industry and academia, which I am applying by building a successful MT business at bmmt.
KMT: As a successful entrepreneur, what were the three greatest industry challenges you faced this year?
MK: This year has been a challenging one for us from both technical and management perspectives. We started to build an MT infrastructure around MOSES practically from scratch. MOSES was developed by academia and for academic use, and because of this we immediately noticed that many industrial challenges had not yet been addressed by MOSES developers.
The first challenge we faced was that the standard solution does not offer a solid tag processing mechanism – we had to invest into a customization of the MOSES code to make it compatible with what we wanted to achieve.
The second challenge we faced was that many players in the MT market are constantly talking about the lack of reliable, quick and cheap quality evaluation metrics. BLEU-like scores unfortunately are not always applicable for real world projects. Even if they are useful when comparing different iterations of the same engines, they are not useful for cross language or cross client comparison.
Interestingly, the third problem has a psychological nature; Post-Editors are not always happy to post edit MT output for many reasons, including of course the quality of MT. However, in many situations the problem is that MT post-editing requires a different skillset in comparison with ‘normal’ translation and it will take time before translators adopt fully to post editing tasks.
KMT: Do you believe MT has a say in the future, and what is your view on its development in global markets?
MK: Of course, MT will have a big say in the language services future. We can see now that the MT market is expanding quickly as more and more companies are adopting a combination TM-MT-PE framework as their primary localization solution.
“At the same time, users should not forget that MT has its clear niche”
I don’t think a machine will be ever able to translate poetry, for example, but at the same time it does not need to – MT has proved to be more than useful for the translation of technical documentation, marketing material and other content which represents more than 90% of the daily translators load worldwide.
Looking at the near future I see that the integration of MT and other cross language technologies with Big Data technologies will open new horizons for Big Data making it a really global technology.
KMT: How has MT affected or changed your business models?
MK: Our business model is built around MT; it allows us to deliver translations to our customers quicker and cheaper than without MT, while at the same time preserving the same level of quality and guaranteeing data security. We not only position MT as a competitive advantage when it comes to translation, but also as a base technology for future services. My personal belief, which is shared by other bmmt employees is that MT is a key technology that will make our world different – where translation is available on demand, when and where consumers need it, at a fair price and at its expected quality.
KMT: What advice can you give to translation buyers, interested in machine translation?
MK: MT is still a relatively new technology, but at the same time there is already a number of best practices available for new and existing players in the MT market. In my opinion, the four key points for translation buyers to remember when thinking about adopting machine translation are:
A big ‘thank you’ to Maxim for taking time out of his busy schedule to take part in this interview, and we look forward to hearing more from Maxim during the KantanMT/bmmt joint webinar ‘5 Challenges of Scaling Localization Workflows for the 21st Century’ on Thursday November 20th (4pm GMT, 5pm CET and 8am PST).
Register here for the webinar or to receive a copy of the recording. If you have any questions about the services offered from either bmmt or KantanMT please contact:
Peggy Linder, bmmt (email@example.com)
Louise Irwin, KantanMT (firstname.lastname@example.org)