Dissemination of Machine Translation innovation is a major priority for us at KantanMT. We believe that Academic Partnerships have a huge role to play in furthering the scope of research and innovation in the field of Machine Translation, and as such we have partnered with a number of Universities to help students use the KanataMT platform in a real word scenario.
We are always looking for ways to improve the KantanMT platform, and to keep our finger on the pulse of the KantanMT user experience, we asked one of the students using the platform to answer some questions about the platform.
KantanMT has an ongoing Academic Partnership with Centre for Multidisciplinary and Intercultural Inquiry (CMII) at University College London. A few weeks back, we published a post featuring Senior Lecturer Mark Shuttleworth and his experience using and teaching KantanMT.
This week we will share our discussion with one of the students who used the KantanMT platform during their course.
KantanMT: What made you choose this course, and how to do you think it will help your career?
Student: I chose CLITG013 and CLITG014 because I wanted to have a hands-on theoretical and practical experience of a variety of translation technologies. Today, translation agencies, private companies and government agencies are very aware of the benefits of translation technology. Thus, having a solid experience of Machine Translation will, I believe, greatly boost any potential job prospect. The skills gained from the courses can also be applied to various roles in translation industry, ranging from the role of translator to editor or even translation manager.
KantanMT: Did you have any previous experience of Machine Translation, if not how did you find using the KantanMT system?
Student: Prior to the two courses, I had no experience of Machine Translation. Having worked with the KantanMT platform, I am impressed by the programming and designing effort that was put into building the system. It is a very user-friendly platform. The complicated process of training and testing a machine translation engine has been compressed into one single platform, which is self-explanatory, requiring little to no instruction on how to operate.
KantanMT: What are your thoughts on using the KantanMT platform, what did you find good and do you have any suggestions for improvement?
Student: Firstly, as mentioned earlier, one strong point about KantanMT is its user-friendliness, which makes it convenient for both individual users and big agencies as they do not have to go through series of courses to be able to operate the system. Secondly, it can be shaped according to the users’ purposes.
On the down side, although the platform is easy to operate, training the KantanMT platform to its highest efficiency is not at all an easy task. One has to have good understanding of the technology as well as sizable quality training data in hand. A firm focus on a particular field is also required to scope down the training effort and data. At the moment, from my own experience, how the MT engine processes the input data is still very unpredictable.
Large amount of training data uploaded all at once sometimes results in a great fluctuation in BLEU or F-scores. (There were several times in which the score suddenly rose very high after a big chuck of data was uploaded and then fell sharply right after another data set had been fed in.) KantanMT has tried to minimize the user’s training effort by offering ready-made training data set in a variety of language-pairs. Unfortunately, this does not extend to minor languages.
The Thai language is a minor language with very linear and run-on style of writing. While working on an English to Thai translation project, I discovered that KantanMT still lacks the capacity to process and grammatically generate translation in languages with no full stop or space to indicate the end or the beginning of each chunk.
For suggestions, extending the training data set to cover greater variety of language pairs, particularly those with little exposure, would certainly attract more users to the platform. Secondly, to facilitate the training process, there should be a screen showing a break down analysis of how the data have been processed or indicating the area that requires more data. Thirdly, it would be ideal if the system can evaluate the quality of data and be selective during the building process, not letting a little amount of ‘garbage’, a low quality data, greatly affect the machine building outcome. Lastly, the issue of inconsistency in the evaluation process (the score fluctuation) should be fixed or at least explained more clearly to the users what causes such phenomena and how to avoid it.
Firstly, we would like to thank you for your in depth feedback. We appreciate suggestion and feedback, and always endeavour to improve our MT platform based on these suggestions.
Training Data: We understand that you had trouble working with the training data. The fluctuations in the scores that you experienced might be due to different factors. By gradually uploading small amounts of training data, you can keep the scores relatively consistent. However, adding “big chunks” of data might lead to bigger fluctuations in scores, especially depending on the nature of that data.
For example, if the user uploads a huge amount of “bad” training data (such as poorly aligned or misaligned segments), the scores will drop considerably. We provide numerous sets of stock data that are “safe” in terms of quality, and therefore it will not generally lead to a big drop in F-measure and BLEU score, provided the data is domain specific. Considering linguistic resources in Thai are very limited, gathering quality training data is often an issue. However, at KantanMT, we are constantly adding to our language offering, and with more translation demand from our clients in this language, we can add more quality stock data, leading to better quality MT engines.
Detailed analysis: All users can access KantanBuildAnalytics™ to receive a detailed breakdown of engine quality. Gap analysis is useful for identifying areas that need more data. It provides the user with a complete break down of the test set, segment by segment, which gives the user a detailed insight into the viability of the training data.
These videos explain KantanBuildAnalytics™ in further detail:
KantanMT: What is your impression of the translation industry, and in your opinion, what do you think the industry will look like in the future?
Student: In my view, the translation industry does not exist on its own; it is in fact a part of any industry, which requires multilingual communication. It is an indispensable industry in this era of globalization. In the future, to accommodate other industries, which have moved or extended from offline platform to online platform, translation industry will also have to operate more in the online world and rely on technologies, such as Machine Translation or online translation memory platform to cope with greater demands and overcome physical distance.
KantanMT: Any parting words? I was a scholarship student at UCL, and I will be working for Thailand’s Ministry of Culture for two years. However, I have to wait until all the grades are issued before I can begin my exciting career, so right now I am enjoying my temporarily life of leisure, and working on some freelance projects. I hope everyone in my course in UCL will have another fruitful semester with a bunch of brilliant students.