22 May 2014

Language Technology is the drill to make Big Data "oil" flow in Europe!

It has always surprised me how much is written and talked about Big Data without pointing to the main barrier to the data revolution: our many languages (more than 60 in Europe alone). The numbers surely differ from sector to sector, but a fair guess would be that half of big data is unstructured, i.e. text. Most multimedia data is also converted to text (speech-to-text, tagging, metadata) before further processing. Text in Europe is always multilingual.

Europe prides itself of an “undeniable competitive advantage, thanks to [its] computer literacy level”. In fact, we have had this advantage for decades, but so far it hasn’t helped much. Good brains and companies are systematically bought by our American friends. No, we rather have to focus on what is specific for Europe. On what we have and the US doesn’t. Maybe even if it is a disadvantage - at first sight.

What makes Europe special and different is the fact that we are trying to build a Single Market in spite of our different cultures and systems. Our multilingualism is always seen as a challenge, a big disadvantage. Most Big Data applications only work well in English and, with some luck, okayish in German, Spanish, or French. Smaller EU countries with lesser spoken languages are basically excluded from the data revolution. The dominance of English in content and tools is the reason for the US lead in Big Data. Many European companies have reacted to this and now use English as their corporate language. But Big Data is often big because it originates from customers and citizens. And these rather use their own languages.

What if we managed to turn this perceived handicap of a multilingual Europe into an asset? Overcoming the language barriers would be a great step towards a Single Market. We would make sure that smaller Member States participate and perhaps become drivers of the data revolution. Even more importantly, Europe would become the fittest for the global markets. The BRICs and all other emerging economies do not accept any more the dominance of English. Europe has a unique chance... if it solves a problem the Americans do not have, or discover too late.

The real opportunity is therefore to create the Digital Single Market for content/data independently of the latter's (linguistic) origin. This would require that we overcome the language-silos in which most data remains captive and make all data language-neutral.

To achieve this, we urgently need a European Language Cloud. For all text based Big Data applications the European Language Cloud is a web-based set of APIs that provides the basic functionality to build products for all languages of the Single Digital Markets and Europe’s main trading partners. For more information, see my previous post.

While the European language technology industry might not have all the solutions readily available to deliver the European Language Cloud, many language resources could be pooled as a first step. In addition, many technologies are presently entering into a phase of maturity (after decades of European investment into R&D) and could be harnessed - through a set of common APIs - into a viral Language Infrastructure. This would go a long way towards delivering the European Language Cloud... without which the Big Data oil will only continue to flow from English grounds.

Jochen Hummel
CEO, ESTeam AB - Chairman, LT-Innovate