22 May 2014

Language Technology is the drill to make Big Data "oil" flow in Europe!



It has always surprised me how much is written and talked about Big Data without pointing to the main barrier to the data revolution: our many languages (more than 60 in Europe alone). The numbers surely differ from sector to sector, but a fair guess would be that half of big data is unstructured, i.e. text. Most multimedia data is also converted to text (speech-to-text, tagging, metadata) before further processing. Text in Europe is always multilingual.

Europe prides itself of an “undeniable competitive advantage, thanks to [its] computer literacy level”. In fact, we have had this advantage for decades, but so far it hasn’t helped much. Good brains and companies are systematically bought by our American friends. No, we rather have to focus on what is specific for Europe. On what we have and the US doesn’t. Maybe even if it is a disadvantage - at first sight.

What makes Europe special and different is the fact that we are trying to build a Single Market in spite of our different cultures and systems. Our multilingualism is always seen as a challenge, a big disadvantage. Most Big Data applications only work well in English and, with some luck, okayish in German, Spanish, or French. Smaller EU countries with lesser spoken languages are basically excluded from the data revolution. The dominance of English in content and tools is the reason for the US lead in Big Data. Many European companies have reacted to this and now use English as their corporate language. But Big Data is often big because it originates from customers and citizens. And these rather use their own languages.

What if we managed to turn this perceived handicap of a multilingual Europe into an asset? Overcoming the language barriers would be a great step towards a Single Market. We would make sure that smaller Member States participate and perhaps become drivers of the data revolution. Even more importantly, Europe would become the fittest for the global markets. The BRICs and all other emerging economies do not accept any more the dominance of English. Europe has a unique chance... if it solves a problem the Americans do not have, or discover too late.

The real opportunity is therefore to create the Digital Single Market for content/data independently of the latter's (linguistic) origin. This would require that we overcome the language-silos in which most data remains captive and make all data language-neutral.

To achieve this, we urgently need a European Language Cloud. For all text based Big Data applications the European Language Cloud is a web-based set of APIs that provides the basic functionality to build products for all languages of the Single Digital Markets and Europe’s main trading partners. For more information, see my previous post.

While the European language technology industry might not have all the solutions readily available to deliver the European Language Cloud, many language resources could be pooled as a first step. In addition, many technologies are presently entering into a phase of maturity (after decades of European investment into R&D) and could be harnessed - through a set of common APIs - into a viral Language Infrastructure. This would go a long way towards delivering the European Language Cloud... without which the Big Data oil will only continue to flow from English grounds.

Jochen Hummel
CEO, ESTeam AB - Chairman, LT-Innovate

TKE 2014 – Ontology, Terminology and Text Mining 19-21 June, Berlin

The 11th International Conference on Terminology and Knowledge Engineering (TKE) takes place 19-21 June in Berlin. Learn from more than twenty selected presentations in two parallel tracks and attend two exciting key notes about most recent research in ontology, terminology and text mining:
  • "Ontology and the Illusion of Knowledge: Mines of text and nuggets of enlightenment" - Khurshid Ahmad, Trinity College Dublin.
  • "The Sphere of Terminology: Between Ontological Systems and Textual Corpora" - Kyo Kageura, University of Tokyo.
On Saturday, 21st June, a workshop on ISO language codes welcomes your active participation. Last but not least, TKE organizers and sponsors welcome you to a great conference dinner on Thursday, June 19th. Register now, available seats are limited!

The event is organized by GTW and DIN Deutsches Institut für Normung e. V. in cooperation with INRIA, Coreon, Cologne University of Applied Sciences, Copenhagen Business School, Termnet and other associations and consortia, national and international organizations.

18 May 2014

2014 - The year of the verticals for Europe's language technology industry

In a recent interview, the CEO of  the Spanish firm Daedalus, José Carlos Gonzalez  said with great verve that his “goal for 2014 is to cover progressively the specific needs of our clients by helping them to develop solutions in vertical markets, freeing them from the complexity of language processing technology.”

Freeing verticals from the complexity of language technologies is a necessary step forward. But it means knowing about the specific needs of industries, and how solutions can be invented that address the infrastructural conditions of these often large-scale players requiring fairly long-term

At LT-Innovate, we believe that 2014 will be the year of the verticals. This means that instead of endlessly repeating what our language technology could do if there was, as the poet said, world enough and time (and above all money), we should deliver solutions that industries actually need.

We kick-started this process of market analysis some 18 months ago and have built up a useful body of knowledge about gaps, want-to-haves, on-going problems, and the sheer lack of awareness among various verticals of the potential benefits of LT. We recently published our findings on these markets to help our members compare their experience and insight with our own efforts at trying to identify opportunities.

Each industry naturally has its specific needs, even though all of them tend to follow the trend towards breaking down information silos and stepping up cross-lingual data sharing while keeping costs down.
We found that the increasingly globalising Manufacturing industry tended to expect massively unified information centres with localised interfaces; that Tourism needed deep, multilingual sentiment analysis applications, and that Media & Publishing is increasingly requiring integrated multimodal (speech/text/image) monitoring, using multilingual speech recognition among other technologies.

We also learnt that whatever the structure of the industry, there are multiple touch points in most workflows where LT can play a role in lowering costs, improving efficiency and contributing to what we can call digital integration. Spoken interfaces can improve productivity in numerous industrial jobs, from store-room workers to clinicians making out reports on patients.

Likewise, the need for cross-lingual access to information of all sorts is now a constant in nearly every European vertical. Today these tend to be addressed by point applications; tomorrow we can expect far more integrated solutions that can adapt more effectively to specific requirements in the online workplace.

This year LT-Innovate hopes to leverage this initial knowledge base to build a clearer picture of where language & speech technology can play a differentiating, even disruptive, role in simplifying processes, adding value to operations, lowering costs and breaking down data silos in different industries in Europe. So stay in touch.