According to Sen. Lindsey Graham, a six-month trip to Russia by the older Boston Marathon bombing suspect, Tamerlan Tsarnaev, slipped through the cracks as the FBI’s eyes and ears were disoriented by a “misspelling” in the suspect’s name. It is impossible to say with certainty without looking at the way the name was spelled, however, the “misspelling” could as well be a completely legitimate alternate way to transliterate Tsarnaev’s name.
It is largely unknown to people living within the linguistic universe of one language, but a basic fact of life to people having to deal with multiple languages: most often, there is more than one correct way to translate (or even transliterate). The indeterminacy of translation applies to names as well, because no two writing systems are the same. A spelling of a name is always unambiguous in the original language, but a specific combination of sounds (or ideograms, like in Chinese) may not have a definite equivalent in the other language, and this is where there is a choice. For example, the surname of the bombing suspects, unambiguously Царнаев in Cyrillic letters, may be spelled as Tsarnaev, Tsarnayev, Tzarnaev, Tzarnayev, even Zarnaev or Czarnaev. But is a customs clerk in the passport control (or even an FBI investigator) equipped with the knowledge that Tzarnayev is the same as Tsarnaev?
While Romanization of the Cyrillic script is a relatively straightforward task, other scripts are far more different. Arabic and Hebrew don’t have vowels; with letters that can be read like “u” or “o”, both variants will inevitably be used. A classic example is the name Muhammad, which can be spelled as Mohammad or Mohammed in English. Abdullah can be spelled as Abd-Allah or Abd Allah.
Even positioning can be an issue. While Indo-European names normally start with a given name and end with a surname, this is not the case in the Far East. Chinese names, for example, follow a very rigid and simple, but different logic: first character is a surname, taken from a set of a few hundred Chinese surnames –e.g. Wang, Hu, Li, Dong. The remaining character or two characters are random. With the given name following the surname, the terms “first name” and “last name” are confusing at best, and sometimes deemed “culturally insensitive” (because they neglect to consider a huge portion of the world’s population). If a form or a questionnaire asks to specify “name”, and processed with the assumption that the first part is the given name, an error is pretty much guaranteed. Many Chinese, moving abroad, have to switch the order to the Indo-European style and choose to use the non-Chinese style in private correspondence – which makes matters even more complicated.
Of course, these issues are not new, and have been addressed by the authorities in most countries with high cross-border traffic, including the US. Many software solutions, containing large datasets of names, aid to match differently spelled variants. It is not a problem today for a customs clerk to get a notification that Mohammed is the same as Muhammad. Most of these systems might, however, have a problem to match Tsarnaev and Tzarnayev if both of these variants do not exist in the database.
Just building plain lists simply does not work. No matter how big they are, the variety of names in the real world is simply too big. In case of Chinese names, it is simply impractical: as the given names are near-random, the matching efficiency is too low. A more comprehensive and robust approach is to combine both the search for listed alternative spellings in the database, as well as the “master spelling” in the original script, and generate all the possibilities dynamically, based on the list of ambiguous phonemes (e.g. ae => aye, ts => tz, cz).
But there is yet another issue.
When moving to a new country, many people choose to adapt their name to the local standard – not getting a new name but modifying the old one slightly. For example, Mikhail may become Michael; Pyotr may become Peter. Sometimes, local equivalents may sound completely different (e.g. Vladimir would often become Ze’ev in Israel). The local usage patterns also change: Russian surnames have special feminine endings, which are often dropped abroad.
All this may sound nightmarishly difficult, but it does not have to be. Not with LinguaSys’ Carabao toolkit.
The same semantic network-based approach, with which we solve all the other linguistic issues, can be used to resolve the issue of the localized name quite easily. If we link the families in which the “localized” names are stored, as hyponyms of the more “generic” family names, all the caller application needs to do is to verify whether the family of the name in the input, and the name against which we are checking, have a hypernym / hyponym relation (or sit in the same family, if this is an alternative spelling). As our semantic network is crosslingual, it also allows detecting the name in any language we support.
This combination of techniques essentially makes the name matching issue easy if not trivial, minimizing the possibility of matches slipping through the net.
As long ago as the 1950s, at the dawn of the computer era, making machines fluent in human language was one of the goals that researchers hoped to achieve. Sci-fi writers and futurists alike envisioned talking robots in the early 2000s – even if very few could predict super-intelligent devices fitting in one’s pocket, global communication, satellite-driven navigation accessible to all, and cars that park themselves.
Yet only a couple of years ago, machines fluent in human language went mainstream. Siri, released by Apple, and the appearance of IBM Watson on Jeopardy captured the imagination of the public. Reactions varied between amusement to genuine fear that machines are taking over the world, comparable to the fright of viewers of the movie, The Arrival of the Train by the Lumiere Brothers. These are not robots. It’s only appointment scheduling and trivia answering software – and only in a handful of languages.
Why is natural language software so much behind the rest of the software world?
“Give a man a fish; you have fed him for today. Teach a man to fish; and you have fed him for a lifetime”.
But what if fishing was not as simple? Imagine a world in which you needed a specific equipment to catch every species of fish – I don’t mean big, small, freshwater, or saltwater; I mean really, one different piece of equipment for every kind and different from area to area, one type for big lakes, another for small ponds, third for rivers. Imagine that the equipment would cost an arm and leg to build, and depressingly difficult to find. Alternatively, you have free handouts (open-source fishing equipment): a hook (or a thread); you bring the rest.
Welcome to the world of natural language processing.
Would you like to find out whether the message expressed in a post is favourable or unfavourable? Look for sentiment mining software.
Need to find proper names? Entity extraction software.
Understand what the user is saying? Natural language user interfaces.
Every natural language processing need has to be attended to separately, with complex machinery built from scratch, taking resources and time. And no, purely statistical engines are not panacea; in fact, they are the worst offenders when it comes to limiting themselves to a particular task, without a hope of ever growing beyond it, and a supposed ease of development is not at all a given. It’s not that the developers are greedy; natural language software is anything but a way to make money fast.
Doesn’t it make sense to have a shared engine which understands language, and extend it for specific applications?
This is the main idea behind Carabao Language Kit: an architecture which makes it possible to re-use the language models many times, for many purposes, and make the linguistic development itself as rapid as possible. While it took a while to solve the details, the investment paid off. The bulk of natural language processing tasks boils down to two processes: analysis and transformation. This is precisely what Carabao does with its language models. Carabao plays the role of a linguistic virtual machine™ of sorts, embedding linguistic intelligence inside a third party application and handling all the natural language dirt, as shown on the figure below:
Today, Carabao is used for:
- Machine translation
- Multilingual Natural Language User Interface
- Crosslingual retrieval
- Entity and relationship extraction
- Conversion of unstructured text to XML
- Domain of discourse extraction
- Conversion of Chinese and Japanese proper names and addresses to English
- Translation of person and company names in Chinese Telegraphic Codes to English
- Opinion mining (not only in English, but also in German, Spanish, Portuguese, Japanese, Russian)
More exotic and diverse applications are being discussed and studied, and as the usage of Carabao expands, developers come up with new creative mashups. Perhaps the most intriguing one (even if not very practical at this time) is universal decipherment. In his 2011 paper “A semantic ‘engine’ for universal translation” SETI scientist John R. Elliott describes a very similar engine as a pseudo-Rosetta stone to be used to decode hypothetical signals from extra-terrestrial intelligence and manuscripts written in un-deciphered languages.
Language models developed with Carabao are more valuable than a regular machine translation / entity extraction / opinion mining model, because developing one language model essentially means getting several applications at once.
This means, not only that the shelf life of the language models is as long as any of the applications using them, but every single application wins from improvement of the linguistic model, triggered by any other application. Essentially, it means more eyeballs and more motivation to polish the product, letting the language models as well as the linguistically-abstract kernel accumulate the “street wisdom” of being a part of a real-world product.
Certain parts of the lexical database are crosslingual and reused by all the models: for instance, truck is a type of vehicle not in Japanese or Spanish – it is such in real life. Once the link is defined, all the models use this bit of information.
Products in which Carabao is embedded do not have to know anything about the languages they are working with. The analytical applications become “fluent” in a new language as soon as a language model is added to the set of included language models.
So perhaps, the Linguistic Virtual Machine™ paradigm will provide a powerful rapidly evolving tool for the developers of natural applications to catch up with the rest of the software world.
*Puck, A Midsummer Nights Dream
This is the very first blog for this company that Vadim, Can, and I started 3 short years ago. We had a vision and, fortunately for us, some fantastic technology, that we could make a difference in the field of human language technology. Machine Translation has been around for over 50 years in a somewhat state of stasis with one or two large bumps from advanced transfer systems in the 90s and then statistical systems in the Twenty-ohs.
Meanwhile Text Analytics and Natural Language Understanding were upstarts and facing a wild wild west of technologies and methodologies and experts and inexperts. Expert systems could now read through millions of tweets, social posts, blogs and more, and claimed to be able to tell the sentiment (Happy, sad, angry, mad) as well as extract and aggregate analytic information on these millions of social network comments. It would be a brave new world where these computer programs could gather data and as you walked down the street, the signs would visually change to meet your personal needs (Minority Report, 2002).
While we were never actually doused with Puck’s magic potion, we did see a future where for all of those things to happen, there had to be a layer of language transparency. How can you measure the sentiment in 200 Million tweets a day, when 50% of them are NOT in English? How can you glean important information from social networks if you only look at one language. Early systems tried using machine translation to translate massive amounts of data into English, but that is truly a fools errand. Not that the MT was bad, but all MT has an error rate, and that error rate over millions of posts can render the conclusions useless, or worse, wrong. So we set out to make sure that we do all sentiment, analytics, and understanding in the native language. And so far, it is going pretty well!
I’m not sure, at this point, how often we will blog. I expect to pass the pen to others on our team to blog about their favorite topics in the linguistic domain. We have people all over the world, working on different parts of our Carabao Linguistic Virtual Machine ™, and they all bring different perspectives to the linguistic world that makes up our 9-5 days. We are an eclectic group, and I hope that means that we will be posting some interesting, thought-provoking, and certainly heretical ideas. We may be fools, and we may be mortal, but we don’t live in a fairies forest like our friends in A Midsummer Nights Dream. Our world is very much grounded in real world use, requirements, and the very serious business of delivering meaning and language transparency to our clients, in whatever languages they require.