KantanMT has an ongoing Academic Partnership with Centre for Multidisciplinary and Intercultural Inquiry (CMII) at University College London to accelerate research and learning in the field of Machine Translation (MT). The postgraduate students of the department were able to use the KantanMT platform to update or gain new skills in Translation Technology. With help of the KantanMT platform, the students learnt how to build and customise their own Statistical Machine Translation (SMT) systems in a real world scenario.
Master’s student, Rafaella Athanasiadi of the University College London submitted her thesis as part of the MSc degree in Scientific, Technical and Medical Translation with Translation Technology. Rafaella was supervised by Teaching Fellow and Lecturer Dr. Emmanouela Patiniotaki and she used KantanMT.com for her research. This guest blog post looks at some of her conclusions on Machine Translation and the Localization Industry.
As Hutchins & Somers (c1992:1) argue, “the mechanization of translation has been one of humanity’s oldest dreams.” During the 20th century, the translation process changed radically. From spending endless hours in libraries to find the translation of a word, the translator has been placed in the centre of dozens of assistive tools. To name just a few, today, there are many translation software, terminology extraction tools, project management components, and machine translation systems, which translators have the opportunity to choose from while translating.
However, shifting the focus to audiovisual translation, it can be observed that not so many radical changes took place in that area, at least not until the introduction of machine translation systems in various projects (such as, the MUSA and the SUMAT project) that developed machine translation engines to optimise the subtitling process. Still, the results of such projects do not seem to be satisfactory enough to inspire confidence for the implementation of these engines in the subtitling process both by subtitling software developers and subtitlers.
Based on my personal research that focused primarily on the European setting, in the subtitling industry it seems that only freeware SRT Translator incorporates machine translation while also offering the features that subtitling software usually incorporate (i.e. uploading multimedia files and timecoding subtitles) at the moment. Nonetheless, SRT Translator, which is not very famous among subtitlers, uses solely Google Translator by default, which is a general-domain machine translation engine and not suitable for the purposes of audiovisual translation, one could argue. The quality of the output of Google Translator was tested by translating 35 subtitles of a comedy series. The output was incomprehensible and misleading in many cases.
Even though no further records of traditional subtitling software that incorporate machine translation could be found, there are many online translation platforms that allow users to upload and translate subtitles. Taking into consideration the European market, these can be either translation software like MemoQ, SDL Trados Studio and Wordfast that offer thability to load subtitle files and in some cases link them to the audiovisual content they are connected to, open source tools for translators like Google Translator Toolkit (GTT) or professional and private platforms like Transifex and XTM International that are used by companies and offered to their dedicated network of translators. Nonetheless, in order to enable machine translation in all the above applications, API keys must be purchased. GTT is an exception since it can be used for free anytime and only requires a Gmail account.
The fact that subscription fees have to be paid along with the costs of API keys for each machine translation engine provider puts their usability in question since costs may overweight subtitlers’ profits. Furthermore, these platforms cannot accommodate subtitlers’ needs; for instance, the option to upload and play multimedia files while translating the subtitles is not always possible nor any synchronization features for timecoding the subtitles to the audio track are offered. Transifex, however, is an exception since this localization platform offers users the option to upload multimedia files in the translation editor while translating the subtitles.
According to Macklovitch (2000:1) a translation memory is considered to be “a particular type of translation support tool that maintains a database of source and target language sentence pairs, and automatically retrieves the translation of those sentences in a new text which occur in the database.” Even though machine translation engines were developed through different projects to reduce subtitling time to the least possible degree, no attempts had been traced during this research to integrate a translation memory tool in a subtitling software for optimizing subtitling; at least in a European, Asian and Australian setting. As Smith (2013) argues, “traditionally subtitling has fallen outside the scope of translation memory packages, perhaps as it was thought to be too creative a process to benefit from the features such software offers.” However, as Diaz-Cintas (2015:638) discusses “DVD bonus material, scientific and technical documentaries, edutainment programmes, and corporate videos tend to contain the high level of lexical repetition that makes it worthwhile for translation companies to employ assisted translation and memory tools in the subtitling process.”
Even if such tools have not been integrated in subtitling software, translation memory components are used for subtitling purposes in cloud-based platforms such as GTT, Transifex and XTM International as well as in translation software, MemoQ, SDL Trados Studio, Wordfast Pro and Transit NXT by simply creating a translation memory before or while translating. It should be noted that Transit NXT is the only translation software that can accommodate the needs of subtitlers to a high level among the tools discussed in this research. Apart from the addition of specialized filters to load subtitles (that also exist in MemoQ, SDL Trados Studio and Wordfast Pro), subtitlers can upload multimedia files, translate subtitles while a translation memory component is active and also synchronise their subtitles with the Transit translation editor (Smith, 2013).
Figure 1: The translation editor of Transit NXT by Smith (2013)
The newly-founded company (2012) OOONA has taken a very interesting approach to subtitling by developing a unique cloud-based toolkit that is built exclusively for accommodating the needs of subtitlers. When asked the following question within the context of the MSc thesis,
Considering that other cloud-based translation platforms like GTT, Transifex and XTM International offer the option of uploading a TM or a terminology management component, do you think that it is important to offer it on a subtitling platform as well?
the representative of OOONA (Alex Yoffe) replied that not only will the company implement translation memory and terminology management components in the next phase of enhancing their platform but that they also consider these components to be very important for the subtitling process. In addition, Yoffe (2015) argued that OOONA intends to “add the option of using MT engines. Translators will be able to choose between Microsoft’s, Google’s, or customisable MT engines.” Therefore, it seems that OOONA will become a very powerful tool in the near future with features that will optimise the subtitling process to the maximum and shape the way that subtitling is carried out until now. The fact that Screen Systems, Cavena and EZTitles have partnered with OOONA is an indicator of how much potential there is in this toolkit.
As it can been argued based on the above, there is lack of subtitling software with incorporated translation memory tools. Therefore, this issue was further researched through the form of an online questionnaire that was disseminated to subtitling companies and freelance subtitlers. In addition, two companies that develop subtitling software, Screen Subtitling Systems and EZTitles, were asked to present their views on this topic. In both cases, their willingness to optimise the subtitling process in a semi-automated or a fully-automated way was apparent through their answers. The former company was in favour of a combination of machine translation tools with translation memory tools whereas the latter leaned towards a subtitling system with integrated translation memory and terminology management tools.
Nonetheless, the optimisation of the subtitling process has to coincide with the needs and preferences of subtitlers. Based on the respondents’ answers, it is clear that translation memory tools in subtitling software are desirable by subtitlers. In question,
Which tool would you prefer to have in a subtitling software? An integrated translation memory (TM) or machine translation (MT)?
more than half of the respondents (56.8%) chose TM. Interestingly, the answer Both received the second highest percentage (20.5%) which indicated that subtitlers demand as many assistive tools as possible.
One of the main conclusions that were drawn from this research was that machine translation engines need to be customised to produce good quality output and this can be achieved through customisable engines like KantanMT and Milengo. Moreover, translation memory tools are sought by subtitlers in subtitling software, while cloud-based platforms seem to occupy the translation industry today. Following this trend, subtitling software providers partner with online services/tools like the OOONA toolkit.
Based on the outcomes of this research, it could be said that we are certainly experiencing a new era in subtitling since the traditional PC-based subtitling software are now transforming into flexible and accessible platforms to enhance the subtitling experience as much as possible. It is a matter of time which tool and platform will rule the subtitling industry but one thing is for sure; the technologies of the future will bring a lot of changes in the traditional way of subtitling.
Diaz-Cintas, J., 2015. Technological Strides in Subtitling. In: S. Chan, ed. Routledge Encyclopedia of Translation Technology. London: Routledge, pp. 632-643.
Hutchins, J. W. & Somers, H. L. (c1992). An introduction to machine translation. London: Academic Press.
Macklovitch, E. (2000). Two Types of Translation Memory. In Proceedings of the ASLIB Conference on Translating and the Computer (Vol. 22).
Smith, Steve (2013). New Subtitling Feature in Transit NXT. November 11 2013. [Online]. Available from: http://www.star-uk.co.uk/blog/subtitling/working-with-subtitles-in-transit-nxt/. [Accessed 01 Sept. 2015].
Yoffe, A (2015). MT and TM tools in subtitling. [Interview]. 13 August 2015.
 Relevant data are available in Appendix 1 of the MSc thesis.
I’m new to machine translation and one of the things I’ve been doing at KantanMT is learning how to refine training data with a view to building stock engines.
Stock engines are the optional training data provided by KantanMT to improve the performance of your customized MT engine. In this post I’m going to describe the process of building an engine and refining the training data.
The building process on the platform is quite simple. From your dashboard on the website select “My Client Profiles” where you will find two profiles, which have already been set up. A default profile and sample profile; both of which let you run translation jobs straight away.
To create your own customized profile select ‘New’ at the top of the left-most column. This launches the client Profile Wizard. Enter the name of your new engine; try to make this something meaningful, or use an easily recognizable standard around how you name your profiles. This makes it easier to recognize which profile is which, when you have more than one profile.
When you select ‘next’ you will be asked to specify the source and target languages from drop down menus. The wizard lets you distinguish between different variants of the same language for example Canadian English or US English. Let’s say we’re translating from Canadian English to Canadian French. If you’re not sure which variant you need, have a quick look at the training data, which will give you the language codes.
The next step gives you an option to select a stock engine from a drop down menu. The stock engines are grouped according to their business area or domain.
You will see a summary of your choices, if you’re happy with them select ‘create’. Your new engine will be shown in the list of your client profiles. However, while you have created your engine, you haven’t yet built it.
Building Your Engine
Selecting your profile from the list will make it the current active engine. By selecting the Training Data tab you can upload any additional training data easily by using the drag and drop function. Then select the ‘Build’ option to begin building your engine.
It’s always a good idea to supply as much useful training data as possible. This ‘educates’ the engine in the way your organization typically translates text.
Once the build job has been submitted, you can monitor its progress in the ‘My Jobs’ page.
When the job is completed the BuildAnalytics™ feature is created. This can be accessed by clicking on the database icon to the left of the profile name. BuildAnalytics will give you feedback on the strength of your engine using industry standard scores, as well as details about your engines word count. The tabs across the page will give you access to more detail.
The summary tab lets you to see the average BLEU, F-Measure and TER scores for the engine, and the pie charts show you a summary of the percentage scores for all segments. For more detail select the respective tabs and use the data to investigate individual segments.
A Rejects Report is created for every file of Training Data uploaded. You can use this to determine why some of your data is not being used, and improve the uptake rate of your data.
Gap analysis gives you an effective way to improve your engine with relevant glossary or noise lists, which you can upload to future engine builds. By adding these terminology files in either TBX (Terminology Interchange) or XLSX (Microsoft Excel Spreadsheet) formats you will quickly improve the engines performance.
The Timeline tag shows you the evolution of your engine over its lifetime. This feature lets you compare the statistics with previous builds, and track all the data you have uploaded. On a couple of occasions, I used the archive feature to revert back to a previous build, when the engine building process was not going according to plan.
Improving Your Engine
A great way to improve your engines performance is to analyze the rejects report for the files with a higher rejection rate. Once you understand the reasons segments are rejected you can begin to address them. For example, an error 104 is caused by a difference in place holder counts. This can be something as simple as the source language using the % sign where the target language uses the word ‘percent’. In this case a preprocessor rule can be created to fix the problem.
A PEX rule editor is accessed from the KantanMT drop down menu. This lets you try out your preprocessor rules, and see the effect that they have in the data. I would suggest directly copying and pasting from the rejects report to the test area and applying your PEX rule to ensure you’re precisely targeting the data concerned. You can get instant feedback using this tool.
Once you’re happy with the way the rules work on the rejected data it’s useful to analyze the rest of the data to see what effect the rules will have. You want to avoid a situation where using a rule resolves 10 rejects, but creates 20 more. Once the rules are refined copy them to the appropriate files (source.ppx, target.ppx) and upload with the training data. Remember that the rules will run against the content in the order they are specified.
When you rebuild the engine they will be incorporated, and hopefully improve the scores.
Sue’s 3 Tips for Successfully Building MT Engines
- Name your profiles clearly – When you are using a number of profiles simultaneously knowing what each one is (Language pair/domain) will make it much easier as you progress through the building process.
- Take advantage of BuildAnalytics – Use the insights and Gap analysis features to give you tips on improving your engine. Listening to these tips can really help speed up the engine refinement process.
- The PEX Rule Editor is your friend – Don’t be afraid to try out creating and using new PEX rules, if things go south you can always go back to previous versions of your engine.
My internship at KantanMT.com really opened my eyes to the world of language services and machine translation. Before joining the team I knew nothing about MT or the mechanics behind building engines. This was a great experience, and being part of such a smoothly run development team was an added bonus that I will take with me when I return ITB to finish my course.
About Sue McDermott
Sue is currently studying for a Diploma in Computer Science from ITB (Institute of Technology Blanchardstown). Sue joined KantanMT.com on a three month internship. She has a degree in English Literature and a background in business systems, and is also a full-time mum for the last 17 years.
Email: firstname.lastname@example.org, if you have any questions or want more information on the KantanMT platform.
KantanMT had an exciting year as it transitioned from a publicly funded business idea into a commercial enterprise that was officially launched in June 2013. The KantanMT team are delighted to have surpassed expectations, by developing and refining cutting edge technologies that make Machine Translation easier to understand and use.
Here are some of the highlights for 2013, as KantanMT looks back on an exceptional year.
Strong Customer Focus…
The year started on a high note, with the opening of a second office in Galway, Ireland, and KantanMT kept the forward momentum going as the year progressed. The Galway office is focused on customer service, product education and Customer Relationship Management (CRM), and is home to Aidan Collins, User Engagement Manager, Kevin McCoy, Customer Relationship Manager and MT Success Coach, and Gina Lawlor, Customer Relationship co-ordinator.
KantanMT officially launched the KantanMT Statistical Machine Translation (SMT) platform as a commercial entity in June 2013. The platform was tested pre-launch by both industry and academic professionals, and was presented at the European OPTIMALE (Optimizing Professional Translator Training in a Multilingual Europe) workshop in Brussels. OPTIMALE is an academic network of 70 partners from 32 European countries, and the organization aims to promote professional translator training as the translation industry merges with the internet and translation automation.
The KantanMT Community…
The KantanMT member’s community now includes top tier Language Service Providers (LSPs), multinationals and smaller organizations. In 2013, the community has grown from 400 members in January to 3400 registered members in December, and in response to this growth, KantanMT introduced two partner programs, with the objective of improving the Machine Translation ecosystem.
The Developer Partner Program, which supports organizations interested in developing integrated technology solutions, and the Preferred Supplier of MT Program, dedicated to strengthening the use of MT technology in the global translation supply chain. KantanMT’s Preferred Suppliers of MT are:
To date, the most popular target languages on the KantanMT platform are; French, Spanish and Brazilian-Portuguese. Members have uploaded more than 67 billion training words and built approx. 7,000 customized KantanMT engines that translated more than 500 million words.
As usage of the platform increased, KantanMT focused on developing new technologies to improve the translation process, including a mobile application for iOS and Android that allows users to get access to their KantanMT engines on the go.
KantanMT’s Core Technologies from 2013…
KantanMT have been kept busy continuously developing and releasing new technologies to help clients build robust business models to integrate Machine Translation into existing workflows.
- KantanAnalytics™ – segment level Quality Estimation (QE) analysis as a percentage ‘fuzzy match’ score on KantanMT translations, provides a straightforward method for costing and scheduling translation projects.
- BuildAnalytics™ – QE feature designed to measure the suitability of the uploaded training data. The technology generates a segment level percentage score on a sample of the uploaded training data.
- KantanWatch™ – makes monitoring the performance of KantanMT engines more transparent.
- TotalRecall™ – combines TM and MT technology, TM matches with a ‘fuzzy match’ score of less than 85% are automatically put through the customized MT engine, giving the users the benefits of both technologies.
- KantanISR™ Instant Segment Retraining technology that allows members near instantaneous correction and retraining of their KantanMT engines.
- PEX Rule Editor – an advanced pattern matching technology that allows members to correct repetitive errors, making a smoother post-editing process by reducing post-editing effort, cost and times.
- Kantan API – critical for the development of software connectors and smooth integration of KantanMT into existing translation workflows. The success of the MemoQ connector, led to the development of subsequent connectors for MemSource and XTM.
KantanMT sourced and cleaned a range of bi-directional domain specific stock engines that consist of approx. six million words across legal, medical and financial domains and made them available to its members. KantanMT also developed support for Traditional and Simplified Chinese, Japanese, Thai and Croatian Languages during 2013.
Recognition as Business Innovators…
KantanMT received awards for business innovation and entrepreneurship throughout the year. Founder and Chief Architect, Tony O’Dowd was presented with the ICT Commercialization award in September.
In October, KantanMT was shortlisted for the PITCH start-up competition and participated in the ALPHA Program for start-ups at Dublin’s Web Summit, the largest tech conference in Europe. Earlier in the year KantanMT was also shortlisted for the Vodafone Start-up of the Year awards.
KantanMT were silver sponsors at the annual 2013 ASLIB Conference ‘Adopting the theme Translating and the Computer’ that took place in London, in November, and in October, Tony O’Dowd, presented at the TAUS Machine Translation Showcase at Localization World in Silicon Valley.
KantanMT have recently published a white paper introducing its cornerstone Quality Estimation technology, KantanAnalytics, and how this technology provides solutions to the biggest industry challenges facing widespread adoption of Machine Translation.
For more information on how to introduce Machine Translation into your translation workflow contact Niamh Lacy (email@example.com).
Things are winding down as we are getting closer to the end of the year, but there are still some great events and webinars coming up during the month of December that we can look forward to.
Here are some recommendations from KantanMT to keep you busy in the lead up to the festive season.
Dec 02 – Dec 05, 2013
Event: IEEE CloudCom 2013, Bristol, United Kingdom
Held in association with Hewlett-Packard Laboratories (HP Labs), the conference is open to researchers, developers, users, students and practitioners from the fields of big data, systems architecture, services research, virtualization, security and high performance computing.
Dec 04, 2013
Event: LANGUAGES & BUSINESS Forum – Hotel InterContinental Berlin
The forum highlights key issues in language education, particularly in the workplace and the new technologies that are becoming a key part of the process. The event, will promote international networking and has four main themes; Corporate Training, Pre-Experience Learners, Intercultural Communication and Online Learning.
Dec 05, 2013
Webinar: Effective Post-Editing in Human and Machine Translation Workflows
Stephen Doherty and Federico Gaspari, CNGL (Centre for Next Generation Localisation) will give an overview of post-editing and different post-editing scenarios from ‘gist’ to ‘full’ post-edits. They will also give advice on different post-editing strategies and how they differ for Machine Translation systems.
Dec 07 – Dec 09, 2013
Event: 6th Language and Technology Conference, Poznan, Poland
The conference will address the challenges of Human Language Technologies (HLT) in computer science and linguistics. The event covers a wide range of topics including; electronic language resources and tools, formalisation of natural languages, parsing and other forms of NL processing.
Dec 09 – Dec 13, 2013
Event: IEEE GLOBECOM 2013 – Power of Global Communications, Atlanta, Georgia USA
The conference, which is the second largest of the 38 IEEE technical societies will focus on the latest advancements in broadband, wireless, multimedia, internet, image and voice communications. Some of the topics presented referring to localization occur on the 10th December and include; Localization Schemes, Localization and Link Layer Issues, and Detection, Estimation and Localization.
Dec 10 – Dec 11, 2013
Event: Game QA & Localization 2013, San Francisco, California USA
This event brings together QA and Localisation Managers, Directors and VPs from game developers around the world to discuss key game localization industry challenges. The event in London, June 2013 was a huge success, as more than 120 senior QA and localization professionals from developers, publishers and 3rd party suppliers of all sizes and platforms came to learn, benchmark and network.
Dec 11 – Dec 15, 2013
Event: International Conference on Language and Translation, Thailand, Vietnam and Cambodia
The Association of Asian Translation Industry (AATI) is holding an International Conference on Language and Translation or “Translator Day” in three countries; Thailand on December 11, 2013, Vietnam on December 13, 2013, and Cambodia on December 15, 2013. The events provide translators, interpreters, translation agencies, foreign language centres, NGO’s, FDI financed enterprises and other translation purchasers with opportunities to meet.
Dec 12, 2013
Webinar: LSP Partnerships & Reseller Programs 16:00 GMT (11:00 EST/17:00 CET)
This webinar, which is hosted by GALA and presented by Terena Bell covers how to open up new revenue streams by introducing reseller programs to current business models. The webinar is aimed at world trade associations, language schools, and other non-translation companies wishing to offer their clients translation, interpreting, or localization services.
Dec 13 – Dec 14 2013
Event: The Twelfth Workshop on Treebanks and Linguistic Theories (TLT12), Sofia (Bulgaria)
The workshops, hosted by BulTreeBank Group serve to promote new and ongoing high-quality work related to syntactically-annotated corpora such as treebanks. Treebanks are important resources for Natural Language processing applications including Machine Translation and information extraction. The workshops will focus on different aspects of treebanking; descriptive, theoretical, formal and computational.
Are you planning to go to any events during December? KantanMT would like to hear about your thoughts on what makes a good event in the localization industry.
Crowdsourcing is becoming more popular with both organizations and companies since the concept’s introduction in 2006, and has been adopted by companies who are using this new production model to improve their production capacity while keeping costs low. The web-based business model, uses an open call format to reach a wide network of people willing to volunteer their services for free or for a limited reward, for any activity including translation. The application of translation crowdsourcing models has opened the door for increased demand of multilingual content.
“…the act of taking a job traditionally performed by a designated agent (usually an employee) and outsourcing it to an undefined, generally large group of people in the form of an open call”.
Crowdsourcing costs equate to approx. 20% of a professional translation. Language Service Providers (LSPs) like Gengo and Moravia have realised the potential of crowdsourcing as part of a viable production model, which they are combining with professional translators and Machine Translation.
The crowdsourcing model is an effective method for translating the surge in User Generate Content (UGC). Erratic fluctuations in demand need a dynamic, flexible and scalable model. Crowdsourcing is definitely a feasible production model for translation services, but it still faces some considerable challenges.
- No specialist knowledge – crowdsourcing is difficult for technical texts that require specialised knowledge. It often involves breaking down a text to be translated into smaller sections to be sent to each volunteer. A volunteer may not be qualified in the domain area of expertise and so they end up translating small sections text, out of context, with limited subject knowledge which leads to lower quality or mistranslations.
- Quality – translation quality is difficult to manage, and is dependent on the type of translation. There have been some innovative suggestions for measuring quality, including evaluation metrics such as BLEU and Meteor, but these are costly and time consuming to implement and need a reference translation or ‘gold standard’ to benchmark against.
- Security – crowd management can be a difficult task and the moderator must be able to vet participants and make sure that they follow the privacy rules associated with the platform. Sensitive information that requires translation should not be released to volunteers.
- Emotional attachment – humans can become emotionally attached to their translations.
- Terminology and writing style inconsistency – when the project is divided amongst a number of volunteers, the final version’s style needs to be edited and checked for inconsistencies.
- Motivation – decisions on how to motivate volunteers and keep them motivated can be an ongoing challenge for moderators.
Improvements in the quality of Machine Translation have had an influence on crowdsourcing popularity and the majority of MT post-editing and proofreading tasks fit into crowdsourcing models nicely. Content can be classified into ‘find-fix-verify’ phases and distributed easily among volunteers.
There are some advantages to be gained when pairing MT technology and collaborative crowdsourcing.
Machine Translation will have a pivotal role to play within new translation models, which focus on translating large volumes of data in cost-effective and powerful production models. Merging both Machine Translation and crowdsourcing tasks will create not only fit-for-purpose, but also high quality translations.
- Quality – as the overall quality of Machine Translation output improves, it is easier for crowdsourcing volunteers with less experience to generate better quality translations. This will in turn increase the demand for crowdsourcing models to be used within LSPs and organizations. MT quality metrics will also make post-editing tasks more straightforward and easier to delegate among volunteers based on their experience.
- Training data – word alignment and engine evaluations can be done through crowd computing, and parallel corpora created by volunteers can be used to train and/or retrain existing SMT engines.
- Security – customized Machine Translation engines are more secure when dealing with sensitive product or client information. General or publicly available information is more suited to crowdsourcing.
- Terminology and writing style consistency – writing style and terminology can be controlled and updated through a straightforward process when using MT. This avoids the idiosyncrasies of volunteer writing styles. There is no risk of translator bias when using Machine Translation.
- Speed – Statistical Machine Translation (SMT) engines can process translations quickly and efficiently. When there is a need for a high volume of content to be translated within a short period of time it is better to use Machine Translation. Output is guaranteed within a designated time and crowdsourcing post-editing tasks speeds up the production process before final checks are carried out by experienced translators or post-editors.
Last chance for a FREE TRIAL for KantanAnalytics™ for all members until November 30th 2013. KantanAnalytics will be available on the Enterprise Plan.
European Multilingual Blogging Day Friday 15th 2013
European Multilingual Blogging Day 2013 or EMBD2013, as it is also known is organised by the European Commission. The event’s European focus, aims to promote the “multilingual dimension” of the internet. The internet is a melting pot of different languages, and thanks to the content explosion and increased global availability of the internet, this means everyone’s voice can be heard.
KantanMT asked bilingual and multilingual people with an interest in multiculturalism, languages and translation to take part, in celebrating the event by writing about their own experiences.
Europe currently has 24 official working languages, the first official languages were decided in 1958; Dutch, French, German and Italian. As the EU grew in size, so too did its languages. Now, all EU regulatory and legislative documents are published in all its official languages.
KantanMT want to say a special thanks to the 11 people who contributed to EMBD2013, by writing about their own experiences:
- Croatian – Petra Postic
- Danish – Anne Overgaard, Law and Social Science Graduate
- Finnish – Leena A. Saarinen, Researcher at Finland Futures Research Centre (FFRC)
- French – Anna Curran, Senior Administrative Assistant at UCD Applied Language Centre
- German – Deirdre Hallinan – Medical Scientist, St. Vincents Hospital
- Irish – Declan Quinn, Translator – (FR-EN, EN-GA, GA-EN)
- Italian – Serena Peruzzi, Junior Web Content Analyst at Webroot
- Polish – Marek Mazur, KantanMT Software Developer
- Portuguese – Bruno Ferrer Silva, Barista
- Slovak – Monika Palova, Science Student
- Spanish – Emamanuela Quesada, Restaurant Manager
Croatian – Petra Postic VISEJEZICNOST (BEING MULTILINGUAL) – Ja sam pocela uciti engleski sa sedmnaest godina. Sa lakocom sam ga usvojila I pocela koristiti par godina kasnije kada sam se doselila u Dublin. Nisam ni razmisljala o tome koliko mi je poznavanje jezika pomoglo u svakodnevnom zivotu. Od trazenja posla do komunikacije sa prodavacima na placu. Cak I onda, kada sam bila izlozena engleskome svaki dan, upisala sam dodatne tecajeve za usavrsavanje. Govoriti strani jezik nije puko sporazumjevanje radi dobivanja informacija, to je puno vise. To znaci radjanje pravoga prijateljstva gdje prvotno nije bilo moguce, znaci iskusiti zivot u jednom drugacijem kontekstu, istraziti druge kulture sa njihovog gledista. A u mojoj obitelji dvojezicnost, odnosno trojezicnost, znaci mogucnost suzivota I onoga najosnovnijeg bez cega veza ne bi bila moguca – komunikacije.
Danish – Anne Overgaard, Law and Social Science Graduate. Den samlede oplevelse, som jeg har haft ved at være tilflytter til et land, hvor der tales et andet sprog, er gennemgående god. De fleste “problemer”, som jeg oplevede opstod i starten af mit ophold, og i de situationer, hvor jeg var sammen med mange personer på en gang. I de situationer følte jeg ofte, at der fløj ord rundt i luften, som jeg ikke kunne få samlet til at give mening – jeg kunne bare smile og nikke. Jeg tror dog, at det er utrolig sundt at prøve at bo et andet sted, idet man derved er tvunget til bare at sige noget, til trods for, at man er bange for at lyde dum. Det kan efterfølgende også være en hjælp, når man skal kommunikere med andre, som ikke er helt tryg i det sprog, som tales.
Finnish – Leena A. Saarinen, Researcher at Finland Futures Research Centre (FFRC) Kokemuksena monikielisyyttä ja elämistä vieraskielisellä alueella kuvaa erinomaisesti sana mittelö. Useiden kielten tunteminen ja asuminen ruotsinkielisellä alueella on omalla kohdalla johtanut jatkuvaan kisailuun eri kielten välillä. Kaupan kassalla huomaa, että ruotsinkielisen sanan sijaan pinnalle puskee väkisin espanjaa tai kotimaahan soitettaessa ei sanavarastosta löydy enää riittävästi sanoja, jotka vastaisivat äidinkieltäni, vaan sanasto on täyttynyt mitä erikoisimmista anglismeista. Klassikkotapaus on kuitenkin ehkä, kun saa itsensä kiinni käyttämästä googlen kääntäjää englannista omalle äidinkielelle.
Oman äidinkielen osittainen liukuminen taustajoukkoihin uusien kielien tieltä on kuitenkin itselle ollut kokemuksena enemmän kuin joukko huvittavia sattumuksia kantaväestön kanssa. Omakohtaisesti paikallisen kielen taitaminen on aina auttanut ulkomaailman ymmärtämisessä, integroitumisessa kuin myös omaan ympäristöön vaikuttamisessa. Samanaikaisesti kielikapasiteetin kasvaminen on myös vallannut tilaa omassa ajattelutavassa ja suhtautumisessa muihin. Kyse ei ole niinkään identiteetin tai omien juurien kadottamisesta, vaan ennemminkin kodin löytymisestä täysin vieraalta alueelta.
French – Anna Curran, Senior Administrative Assistant at University College Dublin Applied Language Centre. La culture et la traduction – J’ai commencé à étudier le français à 10 ans. Cependant, quand j’ai déménagé à Lyon pour mon année Erasmus à 20 ans, j’avais du mal à comprendre mes colocataires. Pendant mon année Erasmus, je n’apprenais seulement comment parler le français, mais la culture française aussi.
Je crois que c’était cet ‘apprentissage culturelle’ qui m’inspirait à faire un master en traduction. Avec une meilleure connaissance de la langue française, je réfléchissais plus profondément au sens culturel des mots-comment la présence d’un mot dans une langue, et son absence dans l’autre parle de la culture de la langue, et les gens qui la parlent.
German – Deirdre Hallinan – Medical Scientist, St. Vincents Hospital. Hallo! Mein name ist Deirdre. Gebürtig komme ich aus Deutschland, aber ich wohne seit 24 Jahren in Irland. Meine Mutter ist Deutsch und mein Vatter ist Irisch. Mein ganzes Leben lang sprech ich zwei sprachen – Deutsch und Englisch. Für meine Eltern war es wichtig beide Sprachen zuhaben, damit ich mich mit unserer Familie unterhalten konnte. Egal ob in Deutschland oder in Irland. Was mir am besten gefällt ist, dass ich hier Deutsch sprechen kann und keiner versteh es! Dass ist besonders gut wenn man mit jemandem schimpfen möchte! Zweisprachig zu sein ist auch für das reisen sehr praktisch. Natürlich sprechen nicht viele andere Länder Deutsch, aber es ist immerhin besser wie nichts. In der Schule hat es auch viel geholfen im Deutschunterricht, ich konnte meinen Freunden mit der Schularbeit helfen. Wenn ich mal meine eigenen Kinder habe möchte ich auch dass sie zweisprachig aufwachsen.
Irish – Declan Quinn, Translator – (FR-EN, EN-GA, GA-EN). Bíonn sé deacair ar Éireannaigh a shamhlú go mbíonn an oiread sin tóra ar ár dteanga náisiúnta thar lear. Tá tóir uirthi ó Bhostún go Béising. Bhí deis agam féin an Ghaeilge a mhúineadh san Eoraip i 2009; ba sa Bhriotáin a bhíos lonnaithe i Roinn na Briotáinise agus an Léinn Cheiltigh in Ollscoil Rennes 2. Bhí mic léinn agam ó na disciplíní acadúla ar fad agus iad ag iarraidh blaiseadh a fháil dár dteanga. Ba mhinic ag tús na bliana acadúil a d’fhiafródh roinnt mac léinn díom, agus iad ag déanamh rogha idir na teangacha ar fad a bhí ar fáil dóibh, cé chomh deacair is a bheadh sé an chanúint Bhéarla seo, is é sin, l’irlandais, a fhoghlaim go ceann bliana. Mhíneoinn go soiléir gur teanga eile ar fad a bheadh á foghlaim acu nach bhfuil cosúlachtaí ar bith aici leis an mBéarla. Ní fheicfinn arís iad faraor!
Míthuiscintí mar sin a bhí ar dhaoine áirithe i leith na Gaeilge ach tóir uirthi agus suim inti ag daoine eile go háirithe ag cainteoirí Briotáinise ar spéis leo staid na teanga in Éirinn – an t-aon tír Cheilteach neamhspleách (rud a meabhraíodh dom go minic). Ní bheinn róshásta an freagra ionraic a thabhairt!
Italian – Serena Peruzzi, Junior Web Content Analyst at Webroot. Avete mai sentito parlare del fatto che si ha una diversa personalità per ciascuna delle lingue parlate? Io ci credo eccome, e credo di esserne la prova vivente!
Mi chiamo Serena, sono italiana e vivo e lavoro a Dublino. Nell’ultimo anno e mezzo ho sviluppato un discreto accento irlandese nel mio inglese parlato, che non solo mi ha fatto guadagnare la simpatia di amici e colleghi “autoctoni”, ma mi ha fatto sviluppare anche un carattere irlandese. Ecco il perché: quando parlo inglese, mi lamento del tempo atmosferico, dico a tutti sure you’ll be grand! e sono educata con le persone che incontro per strada (e sì, dico thanks a million all’autista del bus quando scendo!). Quando parlo italiano, parlo a voce alta, gesticolo, mi lamento spesso della Trenitalia e mantengo un leggero livello di maleducazione, uno di quelli per cui si chiede scusa o per favore solo a persone che non si conoscono.
Decisamente, quello studio che hanno pubblicato sull’Economist recentemente è vero fino all’ultima riga. E oltre ad essere deadly craic avere una doppia personalità da alternare continuamente, mi rende orgogliosa di sentirmi italiana e delighted to be a bit Irish too. Qualcun’altro vive un’esperienza simile?
Polish – Marek Mazur, KantanMT Software Developer. Moja przygoda z drugim językiem, pomimo tego iż uczyłem się go wszkole, rozpoczęła się w rzeczywistości od przyjazdu do Irlandii w 2006 roku. Decyzja o wyjeździe z Polski była bardzo spontaniczna iszybka. Nie zastanawiałem się czy mój angielski będzie wystarczającyaby znaleźć pracę w nowym kraju i prowadzić ‘normalne’ życie. Muszę przyznać, iż na tym etapie mój język angieslki był podstawowy. Przyjazd do Irlandii niemniej to szybko zmienił. Po kilku dniach dostałem pracę gdzie musiałem komunikować się po angielsku z klientami. Pierwsze dni w Irlandii zatem spędzałem nie na zwiedzaniu nowego kraju, ale na uczeniu się angielskiego. Był to chyba najbardziej intensywny kurs w moim życiu. Korzystałem z różnych książek oraz pomocy przyjaciół biegle mówiących po angielsku. Najważniejsze jednak było przebywanie w tym środowisku.
Po kilku miesiącach było już lepiej, choć proces uczenia się trwa do dzisiaj. Po roku zmieniłem pracę uznając, iż lepiej dla mnie będzie jeżeli będą pracował wśród osób mówiących wyłącznie po angielsku. W mojej pierwszej pracy zatrudnieni byli prawie sami Polacy, co na początku było bardzo pomocne i wygodne. Z czasem uznałem jednak, iż nie mogę zamykać się na polskie środowisko. Ta zmiana pomogła mi się dalej rozwijać. Rozpoczęłem później studia, które ostatecznie ukończyłem z wyróżnieniem. To był najbardziej owocny pod względem nauki drugiego języka okres. Może nie był tak bardzo intensywny jak tuż po przyjeździe, ale codziennie na każdych zajęciach uczyłem się dużo nowych słów oraz poznawałem nowe osoby z którymi rozmawiając nabierałem pewności siebie.
Tak jak wspomniałem wcześniej, proces nauki drugiego języka trwa nadal. Nie jest to tylko nauka słówek i gramatyki, ale nauka kultury, mentalności i myślenia w innym języku. Każdego dnia uczę się czegoś nowego i poprzez to moja wiedza i kultura staje się bogatsza. Nauka drugiego języka otworzyła mi nowe drogi życiowe w prywatnym życiu, karierze, nauce i poznawaniu świata i nowych kultur.
Portuguese – Bruno Ferrer Silva, Barista. Parte de mim sempre desejou expressar em mais de um idioma, por mais que definições não faltem na língua portuguesa, algo sempre me levou a tentar desafiar os poderes dos fonemas. Mas a idéia de domar um idioma e encontrar compreensão e se fazer entendido, e não só fascinante mas tem um impacto magnífico e unicelular, e que talvez nossas experiências, emoções e culturas possam ser combinadas e, em frações de segundos pelo poder da fala construímos pequenas pontes para nossas almas.
Conquistamos talvez, por um momento que seja, a busca mais natural do ser humano, o que talvez seja uma dos sentimentos mais humanos que todos nos carregamos, ou sentiremos um dia … Ser compreendido
Slovak – Monika Palova, Science Student. Keď som začala rozmýšľať o tom, že by som sa presťahovala zo Slovenska do anglicky hovoriacej krajiny, myslela som si, že moja znalosť anglického jazyka bola dosť dobrá na to, aby som sa zaobišla. No hneď ako som prišla, zistila som, že som sa mýlila. Pamätám sa, ako som šla v taxíku z letiska a taxikár sa so mnou snažil porozprávať. Nešlo to, pretože som mu nerozumela ani jediné slovo! Po mesiaci som si našla robotu a začala som pracovať s ľudmi z Írska a iných krajín, čo ma nútilo k tomu, aby som neprestajne rozprávala po anglicky. Asi po šiestich mesiacoch sa to zlepšilo, začala som rozoznávať rôzne írske prízvuky a naučila som sa miestny slang. O niekoľko rokov neskôr som začala študovať vedu po anglicky, čo mi dúfam prinesie úžitok v budúcnosti. Život v Írsku mi dal možnosť naučiť sa poriadne po anglicky a nájsť si priateľov z celého sveta. Dublin je veľmi prívetivé a multikultúrne mesto, kde sa človek nikdy necíti byť sám. Navždy budem vďačná, že som to mohla zažiť.
Spanish – Emamanuela Quesada, Restaurant Manager. La belleza de ser tico. Recientemente anduve por los rumbos del sur, Peru y Bolivia para ser exactos. Como yo siempre he viajado a países anglosajones y era mi primera vez al exótico sur no sabia realmente que esperar. Bolivia siempre había sido un sueño pero ya ni me acordaba el porque. Si tenia alguna expectativa, alguna noción fantasiosa de estas tierras, todo se lo llevo el viento y solamente quedo la realidad absoluta de los espectaculares altiplanos, del desierto y el paramo, de la basura y la sequedad, de la altitud que hace que tu cabeza vuele tan alto y pienses que no podrías bajar jamas! Me percate de que no solo nosotros( antonio el italiano, rafa el español, manu la tica) nos sentíamos asi. En esa subcultura de mochileros, en Suramerica somos otra republica, ser mochilero alla es unico. Un frente social que se guia por lo inesperado de las calles, por lo que el local te pueda afrecer. Siendo yo la unica costarricense-tica como se nos reconoce dentro y fuera del terruño nacional- era como una joya preciosa- eres de Costa Rica? No se ven muchos por aca! Are you from Costa Rica? I absolutely love your country! o el fastidioso pero realista- es muy caro! Its so expensive! Claro, el paraiso tiene precio! Pero, y ese es otro asunto con miles de paginas de porque explanatorios!, finalmente lo experimente por mi misma, los ticos SI tenemos acento! y fuerte!! Tres veces: un día en un bus, el muchacho dijo dos palabras y yo: de donde es usted? de Costa Rica– ayy mae que tuanis, yo tambien!! Esa oracion, esas pocas palabras, era como liberar el alma y que sonriera el corazón. El poder reconocer a uno de los nuestros en un instante, el poder decir mae y pura vida, el poder escuchar Sonambulo juntos… que placer divino… No sabia que lo llevaba en mi tan marcado. Será que en un mundo desconocido la identificación colectiva es un oasis sanador? Las dos siguientes veces paso igual pero mas esporadico, encontré a un par de ticos mas en los rincones mas lejanos del mundo, y el resplandor, el minuto de pura vida! fue el mismo: una mezcla de orgullo, felicidad, de pertencia… No se si le pasara a mucha gente, pero yo ya lo confirmo: orgullosa de ser tica, mi tierra me llama. Pura Vida!
Official EU Languages 2013
When the decision is made to incorporate a KantanMT engine into a translation model, the next obvious and most difficult question to answer is what to use to train the engine? This is often followed by: what are the optimum training data requirements to yield a highly productive engine? And how will I curate my training data?
The engine’s target domain and objectives should be clearly mapped out ahead of the build. If the documents are for a specific client or domain then the relevant in-domain training data should be used to build the engine. This also ensures the best possible translation results.
KantanMT recommends a minimum of 2 million training words for each domain specific engine. Higher quantities of in-domain “unique words” will also improve the potential for building an “intelligent” engine.
The quality of the engine is based on the language or translation assets used to build the engine. Studies by TAUS have shown quality is more important than quantity. “Intelligently selected training data” generated higher BLEU scores than an engine built with more generic data. The studies also indicated, a proactive approach in customising or adapting the engine with translation assets led to better quality results.
Translation assets are the best source of suitable training data for building KantanMT engines, they include:
Stock Training Data: KantanMT stock engines are collections of highly cleansed bi-lingual training data sets. Quality is ensured as each data set shows the source corpora and approximate number of words used to create each stock engine. These can be added to client data to produce much larger and more powerful engines. There are over a hundred different stock engines to choose from, including industry specific sets such as IT, Legal, Medical and Finance. Find a list of KantanMT Stock engines here >>
Stock engines are a good starting point if you have limited TMX (Translation Memory Exchange) files in the required domain, or if you would simply like to build bigger KantanMT engines.
Translation Memory Files: This is the best source of high quality training data since both source and target texts are aligned. Translation Memories used for previous translations in a similar domain will also have been verified for quality. This guarantees the engine’s quality will be representative of the Translation Memory quality. As the old expression in the translation industry goes “garbage in, garbage out”, good quality Translation Memory files will yield a good quality Machine Translation engine. The TMX file format is the optimal format for use with KantanMT, however, text files can also be used.
Monolingual Translated Text Files: Monolingual text files are used to create language models for a KantanMT engine. Language models are used for word and phrase selection and have a direct impact on the fluency and recall of KantanMT engines. Translated monolingual training data should be uploaded alongside bi-lingual training data when building KantanMT engines.
Glossary Files: Terminology or glossary files can also be used as training material. Including a glossary improves terminology consistency and translation quality. Terminology files are uploaded with your ‘files to be translated’ and should also be in a TBX file format.
KantanISR™: Instant segment retraining technology allows users to input edited segments via the KantanISR editor. The segments then become training data and are stored in the KantanISR cache. The new segments are incorporated into the engine, avoiding the need to rebuild. As corrected data is included, the engine will improve in quality becoming an even more powerful and productive KantanMT engine.
Building your KantanMT engine can be a very rewarding process. While some time is needed to gather the best data for a domain specific engine, there are many ways to enhance your engine that require little effort.
For more information about preparing training data or engine re-training, please contact Kevin McCoy, KantanMT Success Coach.
There are some great events and webinars coming up over the next month and KantanMT put together a list of some noteworthy dates to add to the calendar.
KantanMT’s Aidan Collins, User Engagement Manager, will be attending tcworld on Thursday 7th November in Wiesbaden, Germany. Then towards the end of the month, Aidan will head to London, and present at the 35th ASLIB Translating and the Computer Conference. KantanMT are also a silver sponsor for this year’s ASLIB conference.
Nov 04 – 05, 2013
Workshop: Translation Project Management, Wiesbaden, Germany.
Angelika Zerfaß and Martin Beuster will be presenting a Translation Project Management (PM) and Localization PM workshop. This is geared towards current and future Project Managers in the localization and translation industry.
Nov 06 – 08, 2013
Event: tcworld 2013 – tekom trade fair, Rhein-Main-Hallen, Wiesbaden, Germany.
This is the largest global event for technical communication. Participating companies offer industrial, software and services for technical communication with a regional focus on Germany, Austria and Switzerland. The conference will cover topics on localization, internationalization, and globalization, management of technical communication, mobile documentation and content strategies.Contact: tekom, firstname.lastname@example.org
To set up a meeting with Aidan Collins, User Engagement Manager, email him directly at email@example.com or call him on +353 86 823 1767.
Nov 06 – 09, 2013
Event: 54th ATA Conference, San Antonio, Texas USA.
This is a great networking event for translators, project managers and industry professionals. The aim of the conference is to promote the professional development of translators and interpreters. There will be approx. 175 educational sessions in varying languages, specializations and levels. Contact: American Translators Association, firstname.lastname@example.org
Webinar: MemoQ – Getting Started guide, online.
An introductory webinar for translators who want to use MemoQ. Participants will learn how to create projects, translate using MemoQ Editor and Translation Memory management.
Webinar: Editing for Localization, online.
Katherine (Kit) Brown-Hoekstra is targeting Senior Technical Communicators and Content Managers with a webinar on editing for Localization.
Nov 15 – 16, 2013 (Expolingua International Fair, Nov 15 – 17)
Event:: InDialog: Mapping the Field of Community Interpreting, Expolingua International Fair Berlin, Germany
This conference is focusing on interpreting services aimed towards government representatives, policy makers, service providers and anyone involved in the interpreting service workflow. InDialog is taking place in conjunction with 26th EXPOLINGUA International Fair for languages and Cultures. Contact: ICWE GmbH, email@example.com
Webinar: The Convergence Era: Translation As A Utility
Event: think! India, The Metropolitan Hotel & Spa, Delhi
think! India is a one day event with a regional focus on how to succeed in the expanding localization industry in India. The event is coordinated by GALA, the Globalization and Localization Association, and is part of a series of regional events, which bring language service providers (LSPs) together.
Event: 35th Translating and the Computer Conference, Paddington, London
This event covers technology and its influence on the localization and translation industry. It aims to bring translators, researchers and students in the translation and localization field together. It is also a great event for catching up on the latest computer aided translation (CAT) tools. KantanMT are sponsoring this event. Niamh Lacy and Aidan Collins will both be there to answer any questions about KantanMT’s technology.
To set up a meeting with Aidan or Niamh, email Niamhl@kantanmt.com or call her directly on +353 877526320
The United Nations (UN) are big promoters of multilingualism and this week is no exception. The UN Academic Impact (UNAI) and the ELS Educational Services launched a student essay contest to promote international education and multilingualism. Entrants should submit an essay written in one of the six official languages of the UN: Arabic, Chinese, English, French, Russian and Spanish as long as it’s not their native tongue.
The theme of the contest “Many Languages, One World’, focuses on multilingualism in a globalised world and supports communication between all global citizens. The UN is a global organisation, which understands the challenges in making hefty volumes of content available in different languages.
In 2001, Kofi Annan, UN Secretary-General at the time, suggested there was a linguistic imbalance with the UN having a tendency towards English. The reasons behind the imbalance boiled down to high translation costs and a lack of resources.
Ten years later, in 2011, the World Intellectual Property Organization (WIPO) in collaboration with the UN, trained their Moses technology based Machine Translation engine, using approx. 11 years of translated UN documents (2000 – 2012), which were provided by the UN’s Documentation Division (DD). The Tapta4Un was born – a Statistical Machine Translation (SMT) engine for professional UN translators.
The UN had used Google translate and Bing Translator to translate their publicly available documents at first, and with good results. But as data from other organisations was added to those engines, the quality of UN translated documents began to decrease.
The TAPTA engine, built with customised UN training data, provided a much higher quality Machine Translation result and higher BLEU scores compared with google translate. This paved the way for the ‘gText’ project, a global UN project, which is the product of the positive adoption of Machine Translation, tasked with integrating computer aided translation (CAT) tools into the document workflow.
KantanMT allows users to build a customised translation engine with training data that will be specific to their needs. KantanMT are continuing to offer a 14 day free trial to new members. click here>>