Dissemination of Machine Translation innovation is a major priority for us at KantanMT. We believe that Academic Partnerships have a huge role to play in furthering the scope of research and innovation in the field of Machine Translation, and as such we have partnered with a number of Universities to help students use the KanataMT platform in a real word scenario.
We are always looking for ways to improve the KantanMT platform, and to keep our finger on the pulse of the KantanMT user experience, we asked one of the students using the platform to answer some questions about the platform.
Translation Error Rate (TER) is a method used by Machine Translation specialists to determine the amount of Post-Editing required for machine translation jobs. The automatic metric measures the number of actions required to edit a translated segment inline with one of the reference translations. It’s quick to use, language independent and corresponds with post-editing effort. When tuning your KantanMT engine, we recommend a maximum score of 30%. A lower score means less post-editing is required!
How to use TER in KantanBuildAnalytics™
The TER scores for your engine are displayed in the KantanBuildAnalytics™ feature. You can get a quick overview or snapshot in the summary tab. But for a more in depth analysis and to calculate the amount of post-editing required for the engine’s MT output select the ‘TER Score’ tab, which takes you to the ‘TER Scores’ page.
Place your cursor on the ‘TER Scores Chart’ to see the ‘Translation Error Rate’ for each segment. If you hold the cursor over the segment, a pop-up will appear on your screen with details of each segment under these headings, ‘Segment no.’, ‘Score’, ‘Source’, ‘Reference/Target’ and ‘KantanMT Output’.
To see a breakdown of the ‘TER Scores’ for each segment in a table format scroll down. You will now see a table with the headings ‘No’, ‘Source’, ‘Reference/Target’, ‘KantanMT Output’ and ‘Score’.
To see an even more in depth breakdown of a particular ‘Segment’ click on the ‘Triangle’ beside each number.
To download the ‘TER Scores’ of all segments click on the ‘Download’ button on the ‘TER Scores’ page.
This is one of the many features included in KantanBuildAnalytics, which can help the Localization Project Manager improve an engine’s quality after its initial training. To see other features used in KantanBuildAnalytics please see the links below.
Regardless of what we do in our professional careers there is one thing that we all have in common, and that is how to get more done, be more productive and achieve the results we want…yesterday! For Machine Translation or Localization engineers this means finding the quickest way to get their MT engines ready to translate files.
KantanBuildAnalytics™ is a feature that solves the problem of how to quickly improve an engine after its initial training with minimum cost and effort. This post will teach you how to use KantanBuildAnalytics to get your KantanMT engines ready to translate faster.
Lets look at some of the features available for KantanBuildAnalytics:
Fluency Analysis – work with segment level BLEU scores to find out how relevant your training data is and how it impacts engine fluency.
Recall and Precision Analysis – use segment level F-Measure scores to understand the recall precision of your MT engines.
Gap Analysis – improve your engine quickly by creating terminology (glossary) files, simply download a list of untranslated words or ‘gaps’ (as an excel file) then re upload the excel files as new glossary training data.
Training Data Reject Reports – see any training data segments that have been rejected from the engine and their reason for rejection in a downloadable excel file.
Timeline – like your facebook timeline, see your MT engine’s history, with every action taken to improve the engine. It even lets you archive versions so if something goes wrong in the retraining, you can go back to an earlier version.
How to use KantanBuildAnalytics
Login into your KantanMT account using your email and your password.
You will be directed to the ‘My Client Profiles’ page. You will be in the ‘Client Profiles’ section of the ‘My Client Profiles’ page. The last profile you were working on will be ‘Active’.
To use ‘KantanBuildAnalytics’ with another profile other than the ‘Active’ profile. Click on the profile you want to use the ‘KantanBuildAnalytics’ with and make sure that the profile selected has at least one ‘Build’ job done successfully.
Then click on the ‘Build Analytics’ tab on the My Client Profiles’ page.
This will take you to the ‘KantanBuildAnalytics’ page, where you will see the ‘Summary’ tab. This is selected by default. Your summary tab should give you an overview of the performance and measurement of your KantanMT engine.
And of course for the excel lovers, its possible to download the full summary report as an excel spreadsheet, so the engine’s performance information can be analysed to suit your organisation’s specific style requirements. To download the report click on the ‘Download summary report’ button.
To ‘Deep Tune’ the engine click on the ‘Deep Tune’ button. be warned though, this is a thorough tuning of the engine and will take a lot of time, the bigger the MT engine, the longer the tuning process takes.
A ‘Tune Engine’ pop up window will now appear on your screen, click on the ‘OK’ button if you want to deep tune or on ‘Cancel’ if you no longer wish to deep tune the engine.
To see how many segments in the training data were rejected, click on the ‘Rejects Report’ tab. This takes you to the ‘Rejects Report’ page, where you will see a list of segments and the reasons they were rejected.
To download an excel version of the rejects report click on the ‘Download’ button.
To create, test and manage customised preprocessing rules for your training data, click on the ‘Preprocessor Mngt’ button.
These features help MT or Localization Engineers build and develop better performing KantanMT engines. Read more about these features below, or Contact a member of our sales team, to start using our platform now!
I’m new to machine translation and one of the things I’ve been doing at KantanMT is learning how to refine training data with a view to building stock engines.
Stock engines are the optional training data provided by KantanMT to improve the performance of your customized MT engine. In this post I’m going to describe the process of building an engine and refining the training data.
The building process on the platform is quite simple. From your dashboard on the website select “My Client Profiles” where you will find two profiles, which have already been set up. A default profile and sample profile; both of which let you run translation jobs straight away.
To create your own customized profile select ‘New’ at the top of the left-most column. This launches the client Profile Wizard. Enter the name of your new engine; try to make this something meaningful, or use an easily recognizable standard around how you name your profiles. This makes it easier to recognize which profile is which, when you have more than one profile.
When you select ‘next’ you will be asked to specify the source and target languages from drop down menus. The wizard lets you distinguish between different variants of the same language for example Canadian English or US English. Let’s say we’re translating from Canadian English to Canadian French. If you’re not sure which variant you need, have a quick look at the training data, which will give you the language codes.
The next step gives you an option to select a stock engine from a drop down menu. The stock engines are grouped according to their business area or domain.
You will see a summary of your choices, if you’re happy with them select ‘create’. Your new engine will be shown in the list of your client profiles. However, while you have created your engine, you haven’t yet built it.
Building Your Engine
Selecting your profile from the list will make it the current active engine. By selecting the Training Data tab you can upload any additional training data easily by using the drag and drop function. Then select the ‘Build’ option to begin building your engine.
It’s always a good idea to supply as much useful training data as possible. This ‘educates’ the engine in the way your organization typically translates text.
Once the build job has been submitted, you can monitor its progress in the ‘My Jobs’ page.
When the job is completed the BuildAnalytics™ feature is created. This can be accessed by clicking on the database icon to the left of the profile name. BuildAnalytics will give you feedback on the strength of your engine using industry standard scores, as well as details about your engines word count. The tabs across the page will give you access to more detail.
The summary tab lets you to see the average BLEU, F-Measure and TER scores for the engine, and the pie charts show you a summary of the percentage scores for all segments. For more detail select the respective tabs and use the data to investigate individual segments.
A Rejects Report is created for every file of Training Data uploaded. You can use this to determine why some of your data is not being used, and improve the uptake rate of your data.
Gap analysis gives you an effective way to improve your engine with relevant glossary or noise lists, which you can upload to future engine builds. By adding these terminology files in either TBX (Terminology Interchange) or XLSX (Microsoft Excel Spreadsheet) formats you will quickly improve the engines performance.
The Timeline tag shows you the evolution of your engine over its lifetime. This feature lets you compare the statistics with previous builds, and track all the data you have uploaded. On a couple of occasions, I used the archive feature to revert back to a previous build, when the engine building process was not going according to plan.
Improving Your Engine
A great way to improve your engines performance is to analyze the rejects report for the files with a higher rejection rate. Once you understand the reasons segments are rejected you can begin to address them. For example, an error 104 is caused by a difference in place holder counts. This can be something as simple as the source language using the % sign where the target language uses the word ‘percent’. In this case a preprocessor rule can be created to fix the problem.
A PEX rule editor is accessed from the KantanMT drop down menu. This lets you try out your preprocessor rules, and see the effect that they have in the data. I would suggest directly copying and pasting from the rejects report to the test area and applying your PEX rule to ensure you’re precisely targeting the data concerned. You can get instant feedback using this tool.
Once you’re happy with the way the rules work on the rejected data it’s useful to analyze the rest of the data to see what effect the rules will have. You want to avoid a situation where using a rule resolves 10 rejects, but creates 20 more. Once the rules are refined copy them to the appropriate files (source.ppx, target.ppx) and upload with the training data. Remember that the rules will run against the content in the order they are specified.
When you rebuild the engine they will be incorporated, and hopefully improve the scores.
Sue’s 3 Tips for Successfully Building MT Engines
Name your profiles clearly – When you are using a number of profiles simultaneously knowing what each one is (Language pair/domain) will make it much easier as you progress through the building process.
Take advantage of BuildAnalytics – Use the insights and Gap analysis features to give you tips on improving your engine. Listening to these tips can really help speed up the engine refinement process.
The PEX Rule Editor is your friend – Don’t be afraid to try out creating and using new PEX rules, if things go south you can always go back to previous versions of your engine.
My internship at KantanMT.com really opened my eyes to the world of language services and machine translation. Before joining the team I knew nothing about MT or the mechanics behind building engines. This was a great experience, and being part of such a smoothly run development team was an added bonus that I will take with me when I return ITB to finish my course.
About Sue McDermott
Sue is currently studying for a Diploma in Computer Science from ITB (Institute of Technology Blanchardstown). Sue joined KantanMT.com on a three month internship. She has a degree in English Literature and a background in business systems, and is also a full-time mum for the last 17 years.
Email: firstname.lastname@example.org, if you have any questions or want more information on the KantanMT platform.