NMT has arrived – can we now agree on what “quality” means?

In a blog I wrote in May of 2018 I commented that when it comes to IT nothing develops faster than new technology (“In Technology, Things Never Slow Down – Quite the Opposite”). It seems that once a new technology is born, its exponential growth becomes unstoppable. And that – I would argue – is exactly where we are with machine translation. We are watching an amazing exponential growth, with Neural Machine Translation (NMT) now the new, hot kid on the block.

Undoubtedly, NMT is now mainstream and LSPs are rapidly adopting it as a viable translation technology option. Machine Translation has followed the exponential growth path of other technological innovations. Amazingly, Google managed to condense in to four years the development of its NMT solution. That compares with 15 years it spent on developing and refining statistical machine translation (SMT). In November 2016, Google give up on developing SMT and unveiled its newly developed NMT system (GNMT). Google announced that their new NMT technology was already able to translate eight language pairs. Within 12 months GNMT was supporting another 90 language pairs. The exponential growth was astonishing: the new kid on the MT block had arrived and was soon centre stage.


A Transformer Neural Network

 Image Courtesy of: “The Illustrated Transformer” by Jay Alammar

In tandem to this rapid NMT development, the emergence of relatively cheap, super-powerful computers – coupled with practically unlimited storage capacity due to the emergence of the “Cloud” and the Cloud’s more recent iteration “the Edge” – the way was opened for most language companies to adopt complex NMT technology as a standard customer service offer. The way has been created for the language industry to use automated translation technology to process even more content, in to more languages, and faster than ever before.

In 2018, a www.Slator.com report revealed that in just 12 months the number of LSPs offering NMT as a service had quadrupled from a low base of five to 20. Perhaps more pertinent than the numbers are the status of the LSPs adopting the technology as their go-to language solution for bulk translation projects. Significantly too, NMT is now the standard solution of major global entities in both the private (e.g. Amazon, Google, Microsoft etc) and public sectors (e.g. the EU).

So, as LSPs move speedily to integrate Machine Translation into their service offerings and workflows it’s perhaps a good time to look at the ongoing debate about “quality” in NMT production. Quality has always been a hot topic and fiercely debated within the industry. Whole conferences, often in exotic locations, have been given over to passionate debates trying to define the “gold standard” for quality.  But the debate has one inherent flaw, there’s no consensus on what constitutes “quality”.

“Central to these issues is the acceptance that there is no longer a single ‘gold standard’ measure of quality, such that the situation in which MT is deployed needs to be borne in mind, especially with respect to the expected ‘shelf-life’ of the translation itself.

Source: Andy Way “Quality expectations of machine translation

The conundrum is what exactly constitutes “quality”, how is it measured, and who decides what is universal qualitative measure is acceptable?  As with all empirical matters, it is not surprising that this vigorous debate is still ongoing, fuelled by the growing use of NMT. Slator.com’s 2018 NMT shone a light on the debate and gave the views of different industry LSP players on this complicated subject.

The current discussion primarily focuses on the efficacy of technical quality testing using such standards as the BLEU score. BLEU is not seen by many as a suitable measuring standard. Its use simply followed because its existence pre-dated the emergence of NMT. Again, to quote Andy Way:

“When it comes to NMT, however, the improvements over predecessor MT— not to mention the differences in design (i.e. NMT usually runs on character-level encoder-decoder systems) — makes BLEU even less suited to quantifying output quality.”

Indeed, there are those in the industry who would declare BLEU has now become obsolete and should be replaced by measures such as BEER, chrF and characTER (see Slator.com).  These technologies all measure quality at a more granular level, unlike BLEU which measures at a sentence level. Then again, there are those who argue that the ultimate assessment of quality can only be done by a human: a linguistic, subject-matter expert. Underlying this belief is the assertion that there needs to be within the industry an accepted gold standard. An “absolute” standard that can only be achieved by a human. And is this “absolutist” stance that is the Achilles heel of their argument.

The advent of machine translation and post-editing has focused attention on the very nature of quality: Is it proximity to a “gold standard” of perfection or is it characteristic of a product that simply serves its purpose well enough to satisfy the needs of the consumer? In other words, is quality something that should be measured and judged in absolute terms or in relative terms?

Source: The Thorny Problem of Translation and Interpreting Quality: Geoffrey S. Koby, Kent State University, Kent, Ohio, USA; Isabel Lacruz, Kent State University, Kent, Ohio, USA

The same academicians state that the “absolute” stance asserts that some requirements are always understood and absolute and constant and can therefore remain unstated.  Whereas the “relative” stance asserts that in general, the best practice is to explicitly state all requirements as specifications, because they can vary from project to project. For both ends of the specification’s axis, some degree of accuracy and fluency is required. The difference is whether the minimum levels depend on audience or purpose or both.


Image Courtesy of: Pickering Electronics

It is in the last sentence above that the defines the dichotomy in the debate. Absolutist fly the gold standard flag and are unforgiving of less. The relative stance says, no – it’s the intended audience that defines the required quality standard. Their argument can be best explained in the following two examples:

  1. You are a production manager in a law firm. You have been told you need a million words of Discovery Files to be translated from German in to English with two weeks. Only then can the firm begin to understand the evidence placed before them. As a production manager you could contact an LSP and ask them to provide an army of expert translators in law who can deal with the language pair. Good luck with that – and the cost. Or you could ask the question what are we trying to achieve with this translation?

The answer comes back that the lawyers need to have a “gist” of what the files contain, at that point they will decide which files are relevant and have them fully translated. The audience is not demanding a gold standard, they want a basic quality that will allow them to decide on the next translation move. This is undoubtedly a job for NMT in that this technology will satisfy the quality required, will do it within the narrow time frame and will do it without bankrupting the firm. Case closed: NMT wins hands down;

  1. You are the production manager in a company that supplies mission critical technology. Perhaps a medical firm that supplies an IVD product that must be implanted successfully to ensure the survival of patients. The product’s instruction for use cannot have any ambiguity, or error. There needs to be zero errors (although, arguably not possible?). Certainly, this is not a case where you would use solely NMT to produce a raw (unedited) translation. However, you could use both NMT (because of volumes and time pressures) and a linguist/s with the subject-matter expertise. In seriousness, no-one would currently argue NMT alone should be used alone used. There would be a need for a strong human input. It’s a no brainer!

So those who argue that there is alone a gold standard are wrong in their argument. The standard is decided by the audience of the translation. If a company wants real-time translations it should be willing to accept a lesser standard of quality. More and more global companies are looking for this service, so much so LSPs are now embedding NMT in to their workflows with some PMs also finding their job spec changing as they become initiators and monitors of a constant flow of words being translated in real-time and displayed online. Alongside them you also have PMs working with the traditional human translators and/or human and machine translations.

The translation paradigm has changed dramatically in just a short number of years. At the end of the day, companies with an eye to their bottom lines will specified their quality standard. For them, this will be dictated by the audience they are aiming the translation towards. If it is an internal document and is large volume and time sensitive the customer will likely lower their quality needs and accept raw NMT. If the audience is more important, they might opt for both NMT and human editing.

Other companies, as outlined above, will declare they have a zero tolerance of errors and demand the human only track. One thing is clear, it is no longer a binary choice of human versus machine: the new paradigm sees the need for both. The production model is evolving as too are the job functions at all levels within the LSPs. From finance through to production, all will need to fully understand how the paradigm has shifted and restructure accordingly – to snooze is to lose.

Aidan Collins, Marketing Manager at KantanMT

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s