What Makes a Start-up Stand Out?

Irish Software Association

KantanMT was recently announced as a finalist in three out of eight categories of the Irish Software Awards 2016 (ISA 2016); ‘Emerging Company of the Year’, ‘Technology Innovation of the Year’ and ‘Outstanding Achievement in International Growth’. This is very exciting news for us, and our success has only been made possible thanks to our brilliantly supportive clients and partners. The announcement led us to walk down the memory lane, and think about things that we did right over the past couple of years.

We would like to share some basic principles that we followed as a company, which helped us succeed and made us one of the most recognisable brands, not only within the translation and localization industry, but also within the wider Software Service scene.

If you are in a start-up mode, these pointers will help you achieve full commercial exploitation within the span of a year: Continue reading

A Trip down Memory Lane: KantanMT in 2015

KantanMT Year in ReviewWhile chatting over a mouthful of mince pies, some tourtière and a few classy glasses of mulled wine this week, we at KantanMT were suddenly struck by the realisation that 2015 was perhaps one of the most sensational, successful and eventful years for us in the company! And the fact is, we can’t wait to start working on everything that we have planned for 2016 – we are certain that the new year is going to be even more exciting for us.

Continue reading

More Questions Answered on How MT Helps Improve Translation Productivity (Part II)

Part II

Welcome to Part II of the Q&A blog on How Machine Translation Helps Improve Translation Productivity. In case you missed the first part of our post, here’s a link to quickly have a look at what was covered.

Tony O’Dowd, Chief Architect of KantanMT.com and Louise Faherty, Technical Project Manager presented a webinar where they showed how LSPs (as well as enterprises) can improve the translation productivity of the language team, manage post-editing effort estimations and easily schedule projects with powerful MT engines. For this section, we are accompanied by Brian Coyle, Chief Commercial Officer at KantanMT, who joined the team on October, 2015 to strengthen KantanMT’s strategic vision.

We have provided a link to the slides used during the webinar below, along with a transcript of the Q&A session.

Please note that the answers below are not recorded verbatim and minor edits have been made to make the text more accessible.

Question: We are a mid-sized LSP and we would like to know what benefits would we enjoy if we choose to work with KantanMT, over building our own systems from scratch? The latter would be cheaper, wouldn’t it?

Answer (Brian): Tony and Louise have mentioned a lot of features available in KantanMT – indeed, the platform is very feature-rich and provides a great user experience. But on top of that, what’s really underneath KantanMT is the fact that it has access to a massive computing power, which is what Statistical Machine Translation requires in order to perform efficiently and quickly. KantanMT has the unique architecture to help provide instant on-demand access at scale.

As Louise Faherty  mentioned, we are currently translating half a billion words per month and we have 760 servers deployed currently. So if you were trying to develop something yourself, it would be hard to reach this level of proficiency in your MT. Whilst no single LSP would probably need this total number of servers, to give you an idea of the cost involved, that kind of server deployment in a self-build environment would cost in the region of €25m.

We also offer 99.99% up time with triple data-centre disaster recovery. It would be very difficult and costly to build this kind of performance yourself.  Also, with this kind of performance at your client’s disposal, you can offer Customised MT for mission critical web-based applications such as eCommerce sites.

Finally, a lot of planning, thought, development hours and research has gone into creating what we believe is the best user interface and the platform for MT, which also has the best functionality set with extreme ease of integration in the market place. So, it would be difficult for you to start on your own and build your own system that would be as robust and high quality as KantanMT.com.

Question: Could you also establish KantanNER rules to convert prices on an eCommerce websites?

Answer (Louise Faherty ): Yes, absolutely! With KantanNER, you can also establish rules, convert prices and so on. The only limitation with that being is that the exchange range will of course fluctuate. But there could be options as well of calculating that information dynamically – otherwise you would be looking at a fixed equation to convert those prices.

KantanMT_ProductivityQuestion: My client does not want us to use MT because they have had bad experience in the past with Bing Translate – what would convince them to use KantanMT? How will the output be different?

Answer (Tony O’Dowd): One of things that you have to recognise in terms of using the KantanMT platform is that you are using MT to build customised machine translation engines. So you are not going to create generic engines (Bing Translate and Google Translate are generic engines). You would be building customised engines that are trained on the previous translations, glossaries that you clients have provided. You will also be using some of our stock engines that are relevant to your client’s domain.

So when you combine that, you get an engine that will mimic the translation style of your client. Indeed, instead of generic translation engines, you are using an engine that is designed to mirror the terminology and stylistic requirements of your client. If you can achieve this through Machine Translation, you will see that there is a lot less requirement for Post-Editing, and this is one of the most important things that drives away translators from using generic systems or broad-based systems and that’s why they choose customised systems. Clients and LSPs have tested the generic systems as well as customisable engines and found that cloud-based customisable MT add a value, which is not available on free, non-customisable MT platforms.

End of Q/A session

The KantanMT Professional Services Team would once again like to thank you for all your questions during the webinar and for sending in your questions by email.

Have more burning questions? Or maybe you would like to see the brilliant platform translate in a live environment? No problem! Just send an email to demo@kantanmt.com and we will take care of the rest.

Want to stay informed about our new webinars? You can bookmark this page, or even better – sign up for our newsletter and ensure that you never miss a post!

Let's Get Started_KantanMT

Translation Machines in Sci-fi

Richard Brooks, CEO, K International KantanMT
Richard Brooks, CEO, K International

This blog post was written by Richard Brooks. He’s a firm believer that life imitates art, CEO of the UK-based LSP K International, a company specialising in translation services for the legal industry and director of the Association of Language Companies.

Translation Machines in Sci-fi

In science fiction, translation of the potentially infinite number of languages spoken by alien species presents a dilemma. How to deal with communication between interplanetary species without resorting to contrivance, or spending the first twenty minutes of each episode’s dialogue clumsily showing characters learning one another’s diphthongs?

The notion of a ‘universal translator’ emanated from Murray Leinster’s novella First Contact, published in 1945 (and clearly that isn’t the only debt Gene Roddenberry owes to Leinster). It’s a greatly helpful – borderline miraculous, in fact – convention of sci-fi: a technological solution to the language barrier, leaving more time for the actual narrative to unfold in one language, typically English.

With the incredible advancements in technology we’re witnessing at the moment such as Microsoft’s pilots of a Skype Translator and the industry leading work KantanMT is achieving in this area, are we seeing the beginnings of live translation – well ahead of Star Trek’s 22nd century deadline? In the meantime, let’s take a look at five of sci-fi’s finest translation machines, which beat anything real-life technology can offer – for now.

KantanMT Blog, Universal Translator

1. Star Trek: Universal Translator

An important part of Star Trek’s near-utopian vision of the future is the Universal Translator. Translating any language into another even while a person is speaking, this exceptionally handy tool means Starfleet craft in any quadrant of the galaxy can speak to new life and new civilizations without confusion.

Voiced by Star Trek creator Roddenberry’s widow Majel Barrett until her death in 2008, the development of a universal translator was, in the Trek universe, a portent of Earth’s cultures achieving universal peace. It’s difficult to imagine Google Translate having the same impact.

This convenient concept has been often copied, and occasionally parodied: in Futurama, everyone in the universe speaks English, rendering Professor Farnworth’s one successful invention – a translation device – useless, as it merely translates English into the dead language, French!

2. The Hitchhikers’ Guide to the Galaxy: the Babel Fish

Some sci-fi plays with the concept in less serious ways. In Douglas Adams’ H2G2, to help Arthur Dent deal in some small way with anything that goes on around him, inserted into his ear is a Babel Fish, memorably described by the Guide as “small, yellow, leechlike and probably the oddest thing in the universe.”

The science (such as it is) behind the Babel Fish is that it can absorb the frequencies of outside speakers, and a translation is secreted by the fish into the hearer’s brain via his or her ear canal. In a witty reversal of Star Trek’s idealistic Federation, Adams reveals that, by allowing everyone to understand one another, the Babel Fish has actually caused more war than anything else in the universe.

3. Farscape: Translator microbes

In science fiction, as in reality, it is the individual idiosyncrasies of languages which are trickiest to master. When people in the UK from a hundred miles apart may speak different languages, not to mention a range of different dialects and accents, can auditory translation really be so smooth?

One series to acknowledge this is Farscape, where astronaut John Crichton is injected with bacteria-sized ‘translator microbes’, which are injected into – and colonise – his brain. The microbes work to make their host understand any spoken information in any language – except idioms are translated literally. This leads to a great deal of confusion for John, and opportunities for humour for the audience (all jokes are language, after all) – and also perhaps renders these microbes a more realistically-limited translator technology.

4. Doctor Who: The TARDIS’ Translation Circuit

As well as being telepathically linked with the Doctor, and granting the ability to travel to any time or place in history and the future, the TARDIS’ telepathic field is used to automatically translate what the Doctor and any companions hear or read into a language which they can understand.

While wonderfully convenient, the mind-meld involved does mean that the translation circuits won’t actually work when the Doctor is unconscious – not an outright impossibility. Also, because translations are time specific, ancient civilization won’t understand neologisms – and, neatly, the Romans have never heard the word ‘volcano’ – because they’ve not lived to see an eruption.

5. Star Wars: C-3PO

Luke Skywalker is the ultimate sci-fi everyman: he is every bit as much in need of a guide to the universe he finds himself in as the viewing audience are. Reinforcing this are his guides, C-3PO and R2D2, who Luke needs with him – despite their obvious drawbacks as travelling companions – because C-3PO is programmed with millions of languages, everything from Ewok to R2’s bleeps and whistles.

When the franchise returns with The Force Awakens later this year (which most fans will rightly consider the fourth, rather than seventh, Star Wars movie), C-3PO’s translation abilities are sure to make him at least partially useful to have around.

The KantanMT team say a big Thank You to Richard for a very savvy post on translation machines in science fiction.

Richard (@RichardMBrooks) will join Tony O’Dowd, (@TonyODowd1) KantanMT Founder and Chief Architect alongside other Language industry heavyweights at the ATC Annual Conference in the Old Trafford Stadium on 24th and 25th September 2015. Register here to attend the conference. 

KantanMT at ATC Conference

If you want to learn more about Machine Translation, send us email (info@kantanmt.com) with your questions and we will be happy to answer them!

Q&A: Tips for Preparing Training Data for High Quality Machine Translation

KantanMT, Machine TranslationMachine Translation (MT) has experienced a surge in popularity. However, achieving the right level of quality output can be challenging, even for the most expert MT engineers.

In the webinar ‘Tips for Preparing Training Data for High Quality MT’, KantanMT’s Founder and Chief Architect, Tony O’Dowd and Selçuk Özcan, Co-founder of Transistent Language Automation Services discussed how to best prepare training data to build high quality Statistical Machine Translation (SMT) engines. Here are their answers from the Q&A session.

Reading time: 5 mins

When it comes to Machine Translation, we know that quantity does not always equal quality. In your opinion, how many words will it take to build a fully functional engine?

Tony O’Dowd: Great question! Based on the entire community of Kantan users today, we have more than 7600 engines on our system. Those engines range from very small all the way up to very large. The biggest engines, which are in the eCommerce domain, contain about a billion words each.

If we exclude all the billion word MT engines so they don’t distort the results then the average size of a KantanMT engine today is approximately 5 million source words.

For example, if you look at our clients in the automotive industry, they have engines in and around 5 million source words, which are producing very high quality MT output.

How long does it take to build an engine of that size?

TOD: Again using KantanMT.com as an example. We can build an MT engine at approx. 4 million words per hour. Therefore, a 5 million-word engine takes approximately an hour to an hour and a half or 90 minutes to build. Compared with other MT providers in the industry, this is insanely fast.

This speed is possible because of our AWS cloud infrastructure. At the moment, we have 480 servers running the system. With such fast build times, our clients can retrain their engines more frequently, giving them higher levels of productivity and higher levels of quality output than most other systems. Read a client use case where speed had a positive impact on MT quality for eCommerce product descriptions (Netthandelen/Milengo case study).

How long does it take to accumulate that many words?

TOD: Most of our clients are able to deliver those words themselves, but our clients who don’t have 5 million source words will normally upload what they have and select one of our stock engines to help them reach a higher word count.

When we look at building an engine for a client, we look at the number of source words, but the key number for us is the number of unique words in an engine. For instance, if I want to have a high quality German engine in a narrow domain it might consist of 5 million source words. More importantly, the unique word count in that engine is going to be close to a million or slightly more than a million unique words.

If I have a high unique word count, I know the engine is going to know how to translate German correctly. Therefore, we don’t look at one word count, we look at a number of different word counts to achieve a high quality engine.

Another factor to consider is the level of inflected forms in the language. This is an indicator of how many words are needed. In order to educate and train the system we need more examples and usage examples of those inflected forms. Generally speaking, highly inflected languages require a lot more training data, so to build a Hungarian engine, which is an incredibly inflected language you will need in excess of 2-3 times the average word count to get workable high quality output.

What kind of additional monolingual data do you have?

TOD: There are 3 areas where we can help in improving suitable, relevant and quality monolingual data.

  1. We have a library of training data stock engines on KantanMT.com, which all include monolingual data in a variety of domains (Medical, IT, Financial etc.).
  2. In addition to stock engines, most of our clients upload their own monolingual data either as PDF files, docx or simple text files and we normalise that data. We have an automatic process in place to cleanse the data and convert it into a suitable format for machine translation/machine learning.
  3. We also offer a spider service, where clients give us a list of domain related URLs where we can collect monolingual data. For example, we recently built a medical engine for a client in the US, in Mexican Spanish and we collected more than 150k medical terms from health service content, which provided a great boost to the quality and more importantly the fluency of the MT engine.

Selçuk Özcan: At Transistent, we collect data from open source projects and open source data. First, we define some filters to ensure that we have the relevant monolingual data from the open source tools, which also includes spidering techniques. We then create a total corpus with the monolingual data we collected, which is used for training the MT engine.

What is the difference between pre-normalisation and final normalisation?

SÖ: The normalisation process is related to the TMS (Translation Management System), CMS (Content Management System) and TM (Translation Memory) systems. Pre-normalisation is applied to the text extracted from your systems to assure that the job will be processed properly. Final normalisation is then applied to the MT output to ensure that content is successfully integrated into the systems.

Can pre-normalisation and final normalisation be applied to corpora from TMs?

SÖ: It is possible to implement normalisation rules to corpora from TM systems. You have to configure your rules depending on your TM tool. Each tool has its own identification and encoding features for tags, markups, non-translatable strings and attributes.

How many words is considered too many in a long segment?

KantanMT QualityTOD: As part of our data cleansing policy, any data uploaded to a Kantan engine goes through 12 phases of data cleansing. Only segments that pass those 12 phases are included in the engine training. That may seem like a very harsh regime, but it is in place for a very good reason.

At KantanMT, the 3 things we look for in training data is:

  1. Quality
  2. Relevance
  3. Quantity

We make sure that all the data you upload is very clean from a structural and linguistic point of view before we include it in your engine. If the training data fails any of those 12 steps, it will be rejected. For example, one phase is to check for long segments. By default, any segments with more than 40 words are rejected. This can be changed depending on the language combination and domain, but the default is 40 words or 40 tokens.

SÖ: As Tony mentioned, it also depends on the language pairs. Nevertheless, you may also want to define the threshold value according to the dynamics of your system; i.e. data, domain, required target quality and so on. We usually split the segments with 40 – 45 words.

How long does it take to normalise the data?

SO: The time frame for normalising data depends on a number of factors, including the language pair and differences between linguistic structures that you are working on, how clean the data is and the source of the data. If you have lots of formulas or non-standard characters, it will take longer to normalise that data.

For Turkish it might take an average of 10-15 days to normalise an average of 10 million words. Of course, this depends on the size of the team involved and the volume of data to be processed.

TOD: The time required to normalise data is very much data driven. A rule of thumb in the Kantan Professional Services Team is; standard text consisting mostly of words, such as text from a book, online help or perhaps user interface text, where the predominant token is a word is normalised very quickly because there are no mixed tokens in the data set – only words.

However, if you have numerical data, scientific formulas and product specifications such as measurements with a lot of part numbers there is a high diversity of individual tokens as opposed to simple words. This type of data takes a little bit longer to normalise because you have to instruct the engine, which you can do using the GENTRY programming language and Named Entity Recognition (NER) software.

We have GENTRY and NER built into the KantanMT.com so we can educate the engine to recognise those tokens. This is important because if the engine doesn’t recognise the data it can’t handle it during the translation phase.

The more diverse the tokens in your input, the longer the normalisation process is, and conversely the less diverse the tokens are the quicker the data can be processed. If it’s just words the system can handle this automatically.

We use this rule of thumb when working with clients to estimate how long it will take to build their engines, as we need to be able to give them some sense of a schedule around building an actual MT engine.

What volume of words would you suggest for a good Turkish engine?

SÖ: It makes no sense to work on a Turkish MT system, if you do not have at least a million words of bilingual data and 3 million words of monolingual data. Even in this case, you will have to work more on analyses, testing procedures and rule sets. Ideally, you will have approx. 10 million words of bilingual data. It’s the basic equation in terms of SMT Engine training, the more data you have the higher quality you achieve.

How long does it take to build an engine for Turkish?

SÖ: It depends on the language pair and the field of expertise or domain. Things may be harder if you are working on a language pair each of which has different linguistic structure such as English and Turkish. However, it’s not impossible to maintain a mature MT system for such a language pair, you will just need to have a longer time on it. Another parameter that effects the required time to have a mature MT system is the quality of your data to be utilised for the intended MT system. It is hard to give a specific time estimation without looking at the data but in general, it will probably take 2 to 6 months to have the intended production system.

View the Slide Deck

Watch the webinar recording:

To know more about the KantanMT platform, contact us (demo@kantanmt.com) now for a platform demonstration.

LocWorld28 Berlin – Kindle Voyage up for Grabs from KantanMT

KantanMT LocWorld ExhibitorIt’s that time of year again and the European edition of the LocWorld conference kicks off this week from 3-5 June at the Maritim Hotel in Berlin, Germany. KantanMT are exhibiting at Stand #17.

For anyone unfamiliar with the conference, it is considered The marketplace of the language industry and is hosted by Multilingual magazine and the Localization Institute. Its purpose is to be the place where everyone from the language industry can get together to network, build business relationships and learn from industry peers.

If you haven’t registered already, check out the full program and you will see a great line up of sessions, roundtables and workshops to suit all areas within the localization industry.

The KantanMT team have spent the last couple of weeks preparing for the LocWorld conference and we have some great giveaways planned. As part of the LocWorld Prize draw visitors to the KantanMT Stand at #17 can drop their business cards in the big blue bowl to be in with a chance to win a brand new Kindle Voyage. The Voyage is Amazon’s thinnest Kindle yet and has a high-resolution 300 ppi display and a new adaptive front light.

KantanMT Giveaway

We only have one Kindle Voyage to give away, but fret not, as there is plenty of KantanMT.com branded merchandise up for grabs, including polo-shirts, 4GB USB flash drives and pens.

Tony O’Dowd, KantanMT’s Founder and Chief Architect will be available meet with attendees interested in learning about machine translation. Stop by booth #17, or send an email to info@kantanmt.com to arrange a one-on-one with Tony.

Tony will also be speaking at two sessions:

And, when it’s all over, if anyone has time to see the city for the weekend or even just a few hours on Friday before returning home. Here is a list of the top 10 must see Berlin sights!

  1. Berlin Wall
  2. Holocaust Memorial
  3. The Berlin Zoological Garden
  4. Reichstag
  5. Brandenburg Gate
  6. Museum Island
  7. Berlin Cathedral
  8. Checkpoint Charlie
  9. Humboldt Universitaet
  10. Berliner Fernsehturm (Berlin TV Tower)

For all those walkers, if the weather is good, these sights can be seen via the Sandemans free walking tour, which departs daily from East Berlin at 11 AM and 2 PM at the Brandenburg Gate.

Finally, anyone interested in learning more about KantanMT that has not registered for the conference can contact us to get a FREE guest pass to the exhibition hall for a couple of hours to meet the KantanMT team.

We hope to see you in Berlin!

Scalability or Quality – Can we have both?

KantanMT Engine optimization, machine translationThe ‘quality debate’ is old news and the conversation, which is now heavily influenced by ‘big data’ and ‘cloud computing’ has moved on. Instead it is focusing on the ability to scale translation jobs quickly and efficiently to meet real-time demands.

Translation buyers expect a system or workflow that provides high quality, fit-for-purpose translations. And it’s because of this that Language Service Providers (LSPs) have worked tirelessly, perfecting their systems and orchestrating the use of Translation Memories (TM) within well managed workflows that combine the professionalization of the translator industry – quality is now a given in the buyers eyes.

What is the translation buyers’ biggest challenge?

The Translation buyers’ biggest challenge now is scale – scaling their processes, their workflows and supply chains. Of course, the caveat is that they want scale without jeopardizing quality! They need systems that are responsive, are transparent and scale gracefully in step with their corporate growth and language expansion strategy.

Scale with quality! One without the other is as useless as a wind-farm without wind!

What makes machine translation better than other processes? Looking past the obvious automation of the localization workflow, the one thing that MT can do above all other translation methods is its ability to combine automation and scalability.

KantanAutoScale, KantanMT product, machine translationKantanMT recognizes this and has developed a number of key technologies to accelerate the speed of on-demand MT engines without compromising quality.

  • KantanAutoScale™ is an additional divide and conquer feature that lets KantanMT users distribute their translation jobs across multiple servers running in the cloud.
  • Engine Optimization technology means KantanMT engines now operate 5-10 times faster, reducing the amount of memory and CPU power needed so MT jobs can be processed faster and are more efficiently when using features like KantanAutoScale.
  • API optimization, KantanMT engineers went back to basics, reviewing and refining the system, which enabled users to achieve improvements from 50-100% performance in translation speed.  This meant translation jobs that took five hours can now be completed in less than one hour.

Scalability is the key to advancement in machine translation, and considering the speed at which people are creating and digesting content we need to be able to provide true MT scalability to all language pairs for all content.

KantanMT’s Tony O’Dowd and bmmt’s Maxim Khalilov will discuss the scalability challenge and more, in a free webinar for translation buyers; 5 Challenges of Scaling Localization Workflows in the 21st Century on Thursday November 20th at 4pm GMT, 5pm CET, 8am PST.

KantanMT and bmmt webinar presenters Tony O'Dowd and Maxim Khalilov

To hear more about optimizing or improving the scalability of your engine please contact Louise Irwin (louisei@kantanmt.com).