Nikos Katris, submitted his thesis; ‘Evaluation of Two Statistical Machine Translation Systems within a Greek-English Cross-Language Information Retrieval Architecture’ to University of Limerick in October 2015. In his research he compared the results of KantanMT with the Moses system for information retrieval.
Nikos was supervised by Dr Richard Sutcliffe at the University of Limerick’s College of Science and Engineering Department of Computer Science and Information Systems (CSIS). Nikos kindly agreed to discuss his research in an interview. The University of Limerick and the Localisation Research Centre are KantanMT’s academic partners.
KantanMT: What made you decide to go for a topic related to Cross-Language Information Retrieval Architecture (CLIR)?
During my studies at the University of Limerick, I was introduced into the fascinating world of Natural Language Processing by my supervisor, Dr. Richard Sutcliffe, and I knew I wanted to pursue this domain further. This, combined with my background as a professional translator, made me want to focus on machine translation (MT) as the topic for my thesis. However, I didn’t want to go into the usual route and approach MT as a technology used solely to translate content for consumption by human end users. I wanted to look into other types of MT applications and that’s how I stumbled upon the field of cross-language information retrieval. Which is kind of funny, if we consider that all of us use information retrieval technology several times during a typical day.
KantanMT: Can you describe briefly how CLIR works?
Put simply, information retrieval (IR) is asking the system a question and the system retrieves relevant documents that “answer” that query. Google is hands-down the most popular IR application. Whenever we want to find out about the latest Game of Thrones spoilers, we type a query or a sentence in the search box that contains the keywords that Google uses to retrieve relevant web pages.
However, this process is usually monolingual. We type our queries in one language and the relevant documents are also written in the same language. Cross-language information retrieval (CLIR) uses MT technology to retrieve relevant documents in a language different from the language of the query. For example, an English-speaking medical doctor can use such a system to retrieve relevant research papers for a specific condition written in German by typing a query in his native language, i.e. English.
KantanMT: What were the challenges with Greek language in CLIR?
Greek is not one of the most researched languages in CLIR, however some characteristics of the language are the main source of problems for any NLP-related task. Greek is a language with complex inflectional and conjugational behaviour using a variety of endings, suffixes and prefixes as well as accents. As a result, a noun can have 4-7 forms and verb can reach up to 250. This makes the computational processing of Modern Greek a very difficult task compared to English.
Dictionary-based approaches face numerous problems, because the dictionaries usually have limited coverage due to the morphological complexity of the language. Stemming techniques can improve this, although they increase the level of uncertainty since more words with different meanings are conflated into the same stem. Therefore, there are two levels of uncertainty, which CLIR system involving Greek must deal with: one caused by the stemming process and one due to the natural ambiguity of the language (Kotsonis et al 2008).
KantanMT: What domain did you choose for your research and where did you get your training data?
The most essential resource was the document collection, upon which the search for retrieving relevant documents was conducted. The domain of the document collection defined the domain of the corpora, which was used for training the MT systems. The document collection that was chosen is the OHSUMED database, which is a subset of MEDLINE, the online medical information database that contains 348,566 references consisting of titles and/or abstracts from 270 medical journals.
For the MT part of the experiment, I used the EMEA corpus that is provided by the European Medicines Agency (EMA) for building the translation and language models. For the tuning process, I used the QTLP English-Greek Corpus for the medical domain, which was downloaded from the META-SHARE repository. Finally, the European Centre for Disease Prevention and Control (ECDC) Greek-English Translation Memory subcorpus was used for the calculation of the BLEU score. The ECDC corpus is a translation memory provided by European Union (EU) agency.
KantanMT: Can you describe briefly the process you took for the evaluation?
The set of queries from the OHSUMED collection was translated by an independent medical doctor into Greek. The Greek queries were then translated back into English using KantanMT and Moses. These machines translated queries were fed into Apache Solr, the IR tool, which was used for retrieving the relevant documents from the English document collection.
The results from the two sets of translated queries (one set of translated queries for each MT system) were then compared to the gold standard, i.e. the documents which are truly relevant, to calculating the precision, recall and F-measure values for each system. Moreover, we examined if there is a correlation between translation quality and IR performance using the BLEU metric as the quality score for the output of each MT system. Finally, we also calculated the precision, recall and F-measure values produced when feeding the IR system with the initial English human-produced queries.
KantanMT: Did you generate any unusual or interesting results that you could share?
The aim of this paper was to evaluate KantanMT and Moses as two statistical machine translation applications within a cross-language information retrieval architecture. KantanMT proved to be slightly better than Moses in the sense that queries translated with KantanMT provided more relevant documents compared to the ones translated with Moses. Moreover, our experiments showed that the use of SMT in CLIR produces quite good results compared to a monolingual retrieval scenario.
The key result of this research, however, is that the BLEU score of an SMT system does not necessarily correlate with the results it generates in a CLIR system. In the CLIR experiment, queries translated by KantanMT retrieved more relevant documents than those translated by Moses. However, in the BLEU evaluation Moses achieved better results than KantanMT. Nevertheless, it’s important to note that it would be interesting to evaluate quality using other quality metrics as well.
KantanMT: Has this research helped your career and what are you working on now?
I work as a freelance translator, localizer and CAT tools trainer. My Master’s degree helped me to expand my knowledge of the localization industry and to improve my skills as a translator. My research on CLIR, however, did open new doors for me and my long-term academic plans will most certainly focus on Natural Language Processing.
About Nikolaos Katris
Nikos Katris studied translation in the Department of Foreign Languages Translation and Interpretation of the Ionian University, and holds a Master of Science in Multilingual Computing and Localization from the University of Limerick. He works as a freelance translator and localizer and teaches CAT-tools at the Centre of Translation, Interpretation and International Relations of City Unity College in Athens. You can find Nikos on LinkedIn and Twitter.