Globalization is no longer a modern phenomenon. With accelerating technological advancements in every sphere including communication, manufacturing and transport, even Globalization 2.0 is a somewhat dated concept. So what’s next? Continue reading
Following the announcement of a direct collaboration of KantanLabs and the ADAPT Centre for Digital Content Technology, we got in touch with Professor Andy Way from the School of Computing in Dublin City University and ADAPT Centre to ask him about innovations in the field of automated translations as well as his thoughts on the engagement between KantanLabs and ADAPT. Continue reading
KantanMT recently published a white paper on what global companies can expect to see in 2016 for Machine Translation (MT). The MT industry is rapidly changing and moulding itself to the technical needs and globalization requirements of the present day. Our white paper puts forward six major MT trends that all businesses need to heed in order to stay relevant and ahead of their competitors.
KantanAPI enables KantanMT clients to interact with KantanMT as an on-demand web service. It also provides a number of different services including translation, file upload and retrieval and job launches.
With the KantanAPI you not only have the opportunity to integrate KantanMT into your workflow systems but also the ability to receive on-demand translations from your KantanMT engines. All these services make the experience with Machine Translation as seamless as possible.
To access the KantanMT API you will first need your ‘API token’. This token can be found in the ‘API’ tab on the ‘My Client Profiles’ page of your KantanMT account.
Once you have your token you can use the API in a number of ways
For more details on implementing your API solution via the REST interface, please see the full API technical documentation at the following link:
Login into your KantanMT account using your email and your password.
You will be directed to the ‘My Client Profiles’ page. You will be in the ‘Client Profiles’ section of the ‘My Client Profiles’ page. The last profile you were working on will be ‘Active’.
If you wish to use the ‘KantanAPI’ with another profile other than the ‘Active’ profile. Click on the profile you wish to use the ‘KantanAPI’ with, then click on the ‘API’ tab.
You will be directed to the ‘API Settings’ page. Now click on the ‘Launch API’ button.
A ‘Launch API’ pop-up will now appear on your screen asking you ‘Are you sure you want to launch the API?’ Click ‘OK’.
The ‘API Status’ will now change from ‘offline’ to ‘initialising’, the ‘Launch API’ button will now change to ‘Launching API’ .
When your KantanAPI launches the ‘API Status’ will now change from ‘initialising’ to ‘running’, the ‘Launching API’ button changes to ‘Shutdown API’ and you should now be able to click on the ‘Translate’ button.
Type the text you wish to translate in the text box and click on the ‘Translate’ button.
The translated text will now appear in the ‘Translated Text’ box. If you wish to make any changes to the translated text simply place the cursor inside the ‘Translated Text’ box and make the changes. Save these changes by clicking the ‘Retrain Engine’ button.
Test if your engine was successfully retrained by clicking the ‘Translate’ button. The retrained text will now appear in the ‘Translated Text’ box.
If you don’t wish to retrain your engine and you are happy with the translated text in the ‘Translated Text’ box. You may continue translating other text or shut down your KantanAPI by clicking the ‘Shutdown API’ button.
When you click the ‘Shutdown API’ button a pop-up will now appear asking you ‘Are you sure you want to shout down the API?’ Click ‘OK’.
The ‘Shutdown API’ button will now change to ‘Terminating API’, the ‘API status’ will now change from ‘running’ to ‘terminating’ and you shouldn’t be able to click on the ‘Translate’ or ‘Retrain Engine’ button.
You will now be directed back to the initial screen on the API Settings page.
KantanAPI™ is one of the various machine translation services offered by KantanMT to improve productivity for our clients and also enable them to be more efficient. For more information on KantanAPI or any KantanMT products please contact us at firstname.lastname@example.org.
For more details on the KantanMT API please see the following links and the video below:
I’m new to machine translation and one of the things I’ve been doing at KantanMT is learning how to refine training data with a view to building stock engines.
Stock engines are the optional training data provided by KantanMT to improve the performance of your customized MT engine. In this post I’m going to describe the process of building an engine and refining the training data.
The building process on the platform is quite simple. From your dashboard on the website select “My Client Profiles” where you will find two profiles, which have already been set up. A default profile and sample profile; both of which let you run translation jobs straight away.
To create your own customized profile select ‘New’ at the top of the left-most column. This launches the client Profile Wizard. Enter the name of your new engine; try to make this something meaningful, or use an easily recognizable standard around how you name your profiles. This makes it easier to recognize which profile is which, when you have more than one profile.
When you select ‘next’ you will be asked to specify the source and target languages from drop down menus. The wizard lets you distinguish between different variants of the same language for example Canadian English or US English. Let’s say we’re translating from Canadian English to Canadian French. If you’re not sure which variant you need, have a quick look at the training data, which will give you the language codes.
The next step gives you an option to select a stock engine from a drop down menu. The stock engines are grouped according to their business area or domain.
You will see a summary of your choices, if you’re happy with them select ‘create’. Your new engine will be shown in the list of your client profiles. However, while you have created your engine, you haven’t yet built it.
Selecting your profile from the list will make it the current active engine. By selecting the Training Data tab you can upload any additional training data easily by using the drag and drop function. Then select the ‘Build’ option to begin building your engine.
It’s always a good idea to supply as much useful training data as possible. This ‘educates’ the engine in the way your organization typically translates text.
Once the build job has been submitted, you can monitor its progress in the ‘My Jobs’ page.
When the job is completed the BuildAnalytics™ feature is created. This can be accessed by clicking on the database icon to the left of the profile name. BuildAnalytics will give you feedback on the strength of your engine using industry standard scores, as well as details about your engines word count. The tabs across the page will give you access to more detail.
The summary tab lets you to see the average BLEU, F-Measure and TER scores for the engine, and the pie charts show you a summary of the percentage scores for all segments. For more detail select the respective tabs and use the data to investigate individual segments.
A Rejects Report is created for every file of Training Data uploaded. You can use this to determine why some of your data is not being used, and improve the uptake rate of your data.
Gap analysis gives you an effective way to improve your engine with relevant glossary or noise lists, which you can upload to future engine builds. By adding these terminology files in either TBX (Terminology Interchange) or XLSX (Microsoft Excel Spreadsheet) formats you will quickly improve the engines performance.
The Timeline tag shows you the evolution of your engine over its lifetime. This feature lets you compare the statistics with previous builds, and track all the data you have uploaded. On a couple of occasions, I used the archive feature to revert back to a previous build, when the engine building process was not going according to plan.
A great way to improve your engines performance is to analyze the rejects report for the files with a higher rejection rate. Once you understand the reasons segments are rejected you can begin to address them. For example, an error 104 is caused by a difference in place holder counts. This can be something as simple as the source language using the % sign where the target language uses the word ‘percent’. In this case a preprocessor rule can be created to fix the problem.
A PEX rule editor is accessed from the KantanMT drop down menu. This lets you try out your preprocessor rules, and see the effect that they have in the data. I would suggest directly copying and pasting from the rejects report to the test area and applying your PEX rule to ensure you’re precisely targeting the data concerned. You can get instant feedback using this tool.
Once you’re happy with the way the rules work on the rejected data it’s useful to analyze the rest of the data to see what effect the rules will have. You want to avoid a situation where using a rule resolves 10 rejects, but creates 20 more. Once the rules are refined copy them to the appropriate files (source.ppx, target.ppx) and upload with the training data. Remember that the rules will run against the content in the order they are specified.
When you rebuild the engine they will be incorporated, and hopefully improve the scores.
My internship at KantanMT.com really opened my eyes to the world of language services and machine translation. Before joining the team I knew nothing about MT or the mechanics behind building engines. This was a great experience, and being part of such a smoothly run development team was an added bonus that I will take with me when I return ITB to finish my course.
Sue is currently studying for a Diploma in Computer Science from ITB (Institute of Technology Blanchardstown). Sue joined KantanMT.com on a three month internship. She has a degree in English Literature and a background in business systems, and is also a full-time mum for the last 17 years.
Email: email@example.com, if you have any questions or want more information on the KantanMT platform.
This year, both KantanMT and its preferred Machine Translation supplier, bmmt, a progressive Language Service Provider with an MT focus, exhibited side by side at the tekom Trade Fair and tcworld conference in Stuttgart, Germany.
As a member of the KantanMT preferred partner program, bmmt works closely with KantanMT to provide MT services to its clients, which include major players in the automotive industry. KantanMT was able to catch up with Maxim Khalilov, technical lead and ‘MT guru’ to find out more about his take on the industry and what advice he could give to translation buyers planning to invest in MT.
KantanMT: Can you tell me a little about yourself and, how you got involved in the industry?
Maxim Khalilov: It was a long and exciting journey. Many years ago, I graduated from the Technical University in Russia with a major in computer science and economics. After graduating, I worked as a researcher for a couple of years in the sustainable energy field. But, even then I knew I still wanted to come back to IT Industry.
In 2005, I started a PhD at Universitat Politecnica de Catalunya (UPC) with a focus on Statistical Machine Translation, which was a very new topic back then. By 2009, after successfully defending my thesis, I moved to Amsterdam where I worked as a post-doctoral researcher at the University of Amsterdam and later as a RD manager at TAUS.
Since February 2014, I’ve been a team lead at bmmt GmbH, which is a German LSP with strong focus on machine translation.
I think my previous experience helped me to develop a deep understanding of the MT industry from both academic and technical perspectives. It also gave me a combination of research and management experience in industry and academia, which I am applying by building a successful MT business at bmmt.
KMT: As a successful entrepreneur, what were the three greatest industry challenges you faced this year?
MK: This year has been a challenging one for us from both technical and management perspectives. We started to build an MT infrastructure around MOSES practically from scratch. MOSES was developed by academia and for academic use, and because of this we immediately noticed that many industrial challenges had not yet been addressed by MOSES developers.
The first challenge we faced was that the standard solution does not offer a solid tag processing mechanism – we had to invest into a customization of the MOSES code to make it compatible with what we wanted to achieve.
The second challenge we faced was that many players in the MT market are constantly talking about the lack of reliable, quick and cheap quality evaluation metrics. BLEU-like scores unfortunately are not always applicable for real world projects. Even if they are useful when comparing different iterations of the same engines, they are not useful for cross language or cross client comparison.
Interestingly, the third problem has a psychological nature; Post-Editors are not always happy to post edit MT output for many reasons, including of course the quality of MT. However, in many situations the problem is that MT post-editing requires a different skillset in comparison with ‘normal’ translation and it will take time before translators adopt fully to post editing tasks.
KMT: Do you believe MT has a say in the future, and what is your view on its development in global markets?
MK: Of course, MT will have a big say in the language services future. We can see now that the MT market is expanding quickly as more and more companies are adopting a combination TM-MT-PE framework as their primary localization solution.
“At the same time, users should not forget that MT has its clear niche”
I don’t think a machine will be ever able to translate poetry, for example, but at the same time it does not need to – MT has proved to be more than useful for the translation of technical documentation, marketing material and other content which represents more than 90% of the daily translators load worldwide.
Looking at the near future I see that the integration of MT and other cross language technologies with Big Data technologies will open new horizons for Big Data making it a really global technology.
KMT: How has MT affected or changed your business models?
MK: Our business model is built around MT; it allows us to deliver translations to our customers quicker and cheaper than without MT, while at the same time preserving the same level of quality and guaranteeing data security. We not only position MT as a competitive advantage when it comes to translation, but also as a base technology for future services. My personal belief, which is shared by other bmmt employees is that MT is a key technology that will make our world different – where translation is available on demand, when and where consumers need it, at a fair price and at its expected quality.
KMT: What advice can you give to translation buyers, interested in machine translation?
MK: MT is still a relatively new technology, but at the same time there is already a number of best practices available for new and existing players in the MT market. In my opinion, the four key points for translation buyers to remember when thinking about adopting machine translation are:
A big ‘thank you’ to Maxim for taking time out of his busy schedule to take part in this interview, and we look forward to hearing more from Maxim during the KantanMT/bmmt joint webinar ‘5 Challenges of Scaling Localization Workflows for the 21st Century’ on Thursday November 20th (4pm GMT, 5pm CET and 8am PST).
Register here for the webinar or to receive a copy of the recording. If you have any questions about the services offered from either bmmt or KantanMT please contact:
Peggy Linder, bmmt (firstname.lastname@example.org)
Louise Irwin, KantanMT (email@example.com)
One of the biggest challenges when customizing Statistical Machine Translation (SMT) is improving the engine after its initial development. While you can build a baseline engine using existing Translation Memories (TM), terminology and monolingual training data assets – the real challenge is going beyond this, and achieving even higher levels of quality. More importantly, how can you do this rapidly with minimum cost and effort? A proactive approach to measuring the quality of your training data will greatly assist in doing this.
Kantan BuildAnalytics™ is a new technology that addresses this head-on and helps SMT developers to build engines that are production ready, fast!
Kantan BuildAnalytics brings a new level of transparency to the SMT building and training process, and KantanMT users can now build higher performing engines for each domain, resulting in less post-editing requirements.
When you build a KantanMT engine, some of your training data is automatically extracted and kept to one side. This is called a Reference Data Set – and contains both source and target texts. After a KantanMT engine is built, this Reference Data Set is used to calculate a series of automated quality scores – including BLEU (Bilingual Evaluation Understudy), F-Measure and TER.
This Reference Data Set is also used to perform a Gap Analysis. Gap Analysis is a quick way to determine any missing words in the engine’s phrase-tables. I’ll come back to this later and demonstrate how Gap Analysis can improve the quality performance of KantanMT engines.
But for now, let’s focus on the automated quality scores of BLEU, F-Measure and TER.
BuildAnalytics uses the KantanMT data visualization library to graphically display the distribution of these automated scores based on the Reference Data Set. Since an automated score is calculated for each text segment within the Reference Data Set, this means we get a detailed view of how a KantanMT engine is performing and how it should generate translated output.
By analysing these scores and the Gap Analysis results, and examining the translated output, users of KantanMT are producing higher quality engines because their training data choices are more strategic and refined.
Let’s look at F-Measure first, as this is the most straightforward to understand and visualize. F-Measure scores show how precise a KantanMT engine is when retrieving words, and how many words it can retrieve or recall during translation. This is why it is commonly referred to as a Recall and Precision measurement. By expressing these two measurements as a ratio, it is a good indicator of the engines performance and its ability to translate content.
However, while your KantanMT engine may have a high F-Measure score – it doesn’t mean that these words are recalled in the correctly translated order. We need another metric to give us an indication of how well the engine translated the text and BLEU is one of the most recognized and automated metric for estimating the texts fluency.
BLEU is an automatic evaluation metric well known in both the industry and academia, which calculates an estimation of text fluency. Fluency is a measure of the correspondence between a KantanMT engine output and that of a professional translator.
Since the Reference Data Set consists of both source and human translated equivalents, which were created by a professional translator, BLEU score can be calculated by comparing the output of a KantanMT engine to this Reference Data Set.
In practice, BLEU achieves a high correlation with human judgement of quality and remains one of the most popular automated metrics in use today.
TER standards for Translation Error Rate and is used to estimate the amount of post-editing required to transform a generated translation to its original human translation equivalent. In simple terms this is a count of the number of insertions, deletions and substitutions required to transform a segment to match its original human translation equivalent.
So the lower this score, the less transformation required which means the less post-editing required too.
BuildAnalytics is a really great way to see all these automated scores in action. It uses KantanMT data visualization technology to graphically present these scores, helping developers of KantanMT engines to fine-tune their training data and maximize their engine’s quality performance.
Let’s take a closer look at how this data visualization can be used to gain insights into an engine and determine if it is a high or low performing engine, and what steps we can take to improve it.
Here’s the summary distribution graphs for an engine that contains approx. 3.2m words. It’s a small engine within a technical domain. Its overall scores are:
These Summary Graphs show the distribution of scores, grouped into bands (i.e. <40%, 40-54% etc.), for each automated score. This is very helpful in determining the scores’ overall distribution, and how the KantanMT engine is likely to be performing.
Here are the detailed distribution graphs for each automated score:
By reviewing both the Summary Graphs and the more detailed Distribution Graphs we can make some observations of how this engine would most likely perform. My observations are included as part of the commentary in the table above.
It’s important to point out that no one individual score gives an absolute of how a KantanMT engine will perform. We need to take a holistic view on how to determine a general sense of the performance of the engine by reviewing all automated scores together.
Using Kantan BuildAnalytics users can get a good sense of how a KantanMT engine will perform in a production environment and with a little practice and experimentation, they can use this knowledge to build higher performing MT engines.
I mentioned this concept earlier in the post, so let’s take a closer look at this really helpful new feature. Gap Analysis determines how many untranslated words remain in the generated translations. These missing words, or ‘Gaps’ can quickly be identified and filled by introducing the most relevant training data to your KantanMT engine and re-training it.
The Gap Analysis feature not only lists the gaps, it also presents suitable training data, which can be post-edited and resubmitted as training data to improve overall engine’s performance. This makes filling the gaps just that little bit easier!
Most quality improvements for SMT systems will be created by fine tuning terminology and filling data gaps. Post-editing raw-MT output and a focus on minimizing data gaps will significantly improve the quality performance of your KantanMT engines. This cannot be done without the involvement of professional translators. They have the skills, knowledge and linguistic expertise to finesse terminology, identify gaps and choose better training data. While BuildAnalytics helps SMT developers get engines ready for production, ultimately, it’s the professional translator that should have the final say in how production-ready it truly is!
To get the most from your Machine Translation engine, always keep in mind:
Kantan BuildAnalytics is available to Enterprise members of KantanMT, but you can also experience this quality estimation and measurement software by signing up for a free trial on KantanMT.com.