14 July 2016

Is Multilingualism a Joke ?

That the Brits develop some sort of black humour when it comes to BREXIT and its consequences, is somehow understandable. That every nonsense talk is then published by renown newspapers such as The Guardian is another issue.
In Brussels, apparently, there is a joke going around regarding the soon-to-be new UK Commissioner, as could be read in today's The Guardian: "EU officials have stressed that the next British commissioner cannot expect a post as influential as the one Hill has given up. The Brussels beltway joke is that the British commissioner should be put in charge of multilingualism – a job created in 2007 for Romania when it joined the EU halfway through the European commission".

Multilingualism is a crucial cultural patrimony of Europe and an important topic at all levels, be it economic, social or cultural. The period of the multilingual Commissioner showed that languages, language technologies and multilingualism figured high on several agendas - but were sooner or later dropped as soon as the short shelf-life of the post came to an end.

Now, since 14 days exactly, the term "multilingualism" appears again in the title of a unit at DG CONNECT (Unit G3): "Learning, Multilingualism & Accessibility" that gives its description as: "The mission of the unit is to make the Digital Single Market more accessible, secure and inclusive. To this end, the unit supports policy, research, innovation and deployment of learning technologies and key enabling digital language technologies and services to allow all European consumers and businesses to fully benefit from the Digital Single Market." (emphasis added).

The importance of languages for the Digital Single Market is slowly getting recognized (at least in speeches by Vice-President A. Ansip and Commissioner Oettinger). Therefore, I do not find the idea of a Commissioner for Multilingualism that bad, and surely not a "joke". Actually, somehow I would like the idea that an English native speaker becomes the Commissioner for Multilingualism. If only to see the French reaction to that "joke"...

14 April 2016

The Future of Language Resources for Machine Translation (LR4MT)

In a recent brief survey of language service suppliers (LSPs), LT-Innovate attempted to find out from the translation industry how they saw the future of language data availability specifically for machine translation. The results provide food for thought when it comes to planning for the improved usability of digital text resources in the years ahead. It looks as if new developments in machine translation (MT) technology will work in parallel with a growing need for the right data.

First, statistical machine translation is clearly on most LSPs’ radar screens. Hard data on the actual size of the user market for MT systems is impossible to calculate today, as is information on who uses which free or paying services available online in their everyday work. But everyone who responded to our survey claims they will be “using” MT in the next 2 to 3 years. 

Preparing for this transition is therefore vital for the nascent language data resource sector.

Overall, 30% of our LSP respondents reckon that the data they will need to prime their MT engines will come from their clients, 78% will use their in-house translation memories and similar, and 70% will try to find third party sources from outside their immediate business nexus. 

15% of them would be prepared to buy such data, 70% of them will crawl the web, while a total of 83% of them expect more free resources will become available.

Judging by our current findings from mapping publicly-available LR4MT in Europe, the chances of them finding the relevant resources easily look relatively small. Sharing language data is not a high-visibility phenomenon so far.

However 39% said they did not have the necessary engineering resources in-house to transform the content they might find into viable MT data. This suggests there could be a small market for language data cleaning and aligning for data harvested from the web or well-known repositories.

When it comes to the desired quality criteria for usable language resources, by far the most important criterion (84%) was unsurprisingly domain relevance. Indeed, small customer- or domain-specific language models for MT are typically considered to outperform general models by a very large factor. This suggests that some serious effort will need to go into pinpointing domain relevance in any language resource supply platform, rather than rely, say, on volume as a virtue in itself.

Appropriately, there was also considerable emphasis on leveraging the semantic characteristics of language data needed for MT. Semantically enriched data, as proposed by such EC-funded projects as LIDER (a Linked Open Data-based ecosystem of free, interlinked, and semantically interoperable language resources) and BabelNet (multilingual dictionary underpinned by a rich semantic network) clearly have potential as a future resource. We therefore need to examine the fastest and most efficient way to transform this potential technology stack into an operational reality. We can also expect to hear much more from Coreon about multilingual knowledge management as a fundamental business tool.

So what can we expect for a more effective and efficient deployment of LR4MT? In general, respondents are looking towards new hybrid models of machine translation involving the integration of transfer/grammar and semantic modules into the plain vanilla statistical model as it exists today. This suggests that language technology and data resource quality will need to evolve closely in parallel.

They also expect deep learning to be applied to MT, together with such processes as continuous retraining during the MT post-editing phase. In other words, we are just at the beginning of a new cycle of more artificial intelligence-driven MT systems that will be able to learn as they go and leverage even more usability from relevant data resources. But as one respondent pointed out, the ultimate litmus test for the value of translation resource data is whether or not the original translation is any good. Tools to tame the elusive beast of rapid translation quality evaluation will still need to be part of the mix.

What specific needs for or constraints on MT data resources do you foresee in Europe in the near future? Tell us here or respond to our survey.

Jo Céline

25 February 2016

ESIF Funding Opportunities

We know it all: There are these perfectly innovative ideas, a dream-team available and willing to cooperate, but, alas, no money. It is for language technologies the same as for other topics - no, it is a bit worse, as there is no dedicated budget in H2020 for LT only. Hence, it was time to look for other money sources. LT-Observatory, a project funded by H2020 (yes :-) looked into alternative opportunities and started with ESIF. This lesser known acronym stands for European Structural and Investment Funds, funds distributed by the EC but administered by national/regional authorities. For those that want to know more about ESIF, click here.

Member States and/or regions identify priorities, so-called "Research & Innovation Smart Specialisation Strategies" or RIS 3. Other priorities include SME competitiveness. While no priority mentions directly language technologies, many see a window of opportunity for LT projects within their priorities. The Member States' and regions' information can be found at the National Funding Opportunities web page. Or you can download a pdf document.

Our tip: Visit the website in the coming months, we will add more national and regional funding opportunities. If you try it out and are successful, let us know so that we can feature your "success story" on-line.