14 April 2016

The Future of Language Resources for Machine Translation (LR4MT)

In a recent brief survey of language service suppliers (LSPs), LT-Innovate attempted to find out from the translation industry how they saw the future of language data availability specifically for machine translation. The results provide food for thought when it comes to planning for the improved usability of digital text resources in the years ahead. It looks as if new developments in machine translation (MT) technology will work in parallel with a growing need for the right data.

First, statistical machine translation is clearly on most LSPs’ radar screens. Hard data on the actual size of the user market for MT systems is impossible to calculate today, as is information on who uses which free or paying services available online in their everyday work. But everyone who responded to our survey claims they will be “using” MT in the next 2 to 3 years. 

Preparing for this transition is therefore vital for the nascent language data resource sector.

Overall, 30% of our LSP respondents reckon that the data they will need to prime their MT engines will come from their clients, 78% will use their in-house translation memories and similar, and 70% will try to find third party sources from outside their immediate business nexus. 

15% of them would be prepared to buy such data, 70% of them will crawl the web, while a total of 83% of them expect more free resources will become available.


Judging by our current findings from mapping publicly-available LR4MT in Europe, the chances of them finding the relevant resources easily look relatively small. Sharing language data is not a high-visibility phenomenon so far.

However 39% said they did not have the necessary engineering resources in-house to transform the content they might find into viable MT data. This suggests there could be a small market for language data cleaning and aligning for data harvested from the web or well-known repositories.

When it comes to the desired quality criteria for usable language resources, by far the most important criterion (84%) was unsurprisingly domain relevance. Indeed, small customer- or domain-specific language models for MT are typically considered to outperform general models by a very large factor. This suggests that some serious effort will need to go into pinpointing domain relevance in any language resource supply platform, rather than rely, say, on volume as a virtue in itself.


Appropriately, there was also considerable emphasis on leveraging the semantic characteristics of language data needed for MT. Semantically enriched data, as proposed by such EC-funded projects as LIDER (a Linked Open Data-based ecosystem of free, interlinked, and semantically interoperable language resources) and BabelNet (multilingual dictionary underpinned by a rich semantic network) clearly have potential as a future resource. We therefore need to examine the fastest and most efficient way to transform this potential technology stack into an operational reality. We can also expect to hear much more from Coreon about multilingual knowledge management as a fundamental business tool.

So what can we expect for a more effective and efficient deployment of LR4MT? In general, respondents are looking towards new hybrid models of machine translation involving the integration of transfer/grammar and semantic modules into the plain vanilla statistical model as it exists today. This suggests that language technology and data resource quality will need to evolve closely in parallel.

They also expect deep learning to be applied to MT, together with such processes as continuous retraining during the MT post-editing phase. In other words, we are just at the beginning of a new cycle of more artificial intelligence-driven MT systems that will be able to learn as they go and leverage even more usability from relevant data resources. But as one respondent pointed out, the ultimate litmus test for the value of translation resource data is whether or not the original translation is any good. Tools to tame the elusive beast of rapid translation quality evaluation will still need to be part of the mix.

What specific needs for or constraints on MT data resources do you foresee in Europe in the near future? Tell us here or respond to our survey.

Jo CĂ©line