Machine translation is the science of taking some human language content, usually in text format, although speech to speech systems are being developed, and using a computer to digest the content and create output that is true to the fidelity of the original, in another language. Simply put, as an example, English in, German out. So why is that so unbelievably hard to do?
Going back to the history of machine translation (hereafter called MT), you have scientists that believed it was originally a word for word lookup and replacement problem. It is easy for even the uninitiated to see quickly that this solution would break down quickly. For one thing, there was no word rearrangement. How would a Spanish speaker be able to read an English source sentence translated into Spanish if all of the adjectives were before the noun instead of after the noun? That’s only a tiny fraction of the problems, but easy to understand. So, this was obviously a case of needing rules for word rearrangement, which then required that the MT engine figure out the part of speech of each word in the source. How could you move the adjectives behind the noun, if you didn’t know which words were the adjectives and the noun they were acting on. So “Direct” systems were born where the engine would start rearranging the words based on the source, and then flip the words to the target language. Pretty soon, computers got a whole lot more powerful, and with more computer power came even more rules, and then Direct systems morphed into “Transfer” systems, where you could actually break down the sections of the engine into analysis, transfer, synthesis. The quality improved significantly, but that is sort of like saying it was less bad. But investment dollars were plenty, because the holy grail of a near perfect system would literally change the world. In the 80s, scientists started playing with “hidden markov models” (don’t worry, I’m not going to explain what they are) and statistical MT. Basically, what are the odds, that based on the words around a word, that, based on all of the corpus of words that the engine has already read, that the answer is this sense of the word, and not this other sense of the word. Of course, more data is good data in that world, and actually statistical MT got the world on board with systems like Google Translate and other free systems.
I’m going to go out on a limb here and declare my belief that Google Translate has failed with its current technology. If more data is the answer, Google has been scanning the web for existing translations to digest since 2007, and it still is just “less bad”.
So why did all of these millions of person hours that have been put into statistical MT failed? Have you ever heard of the expression “if you are a hammer, everything looks like a nail”? They are using the wrong tools, and, to some degree, have been for decades. Think about it. Where does languages sit at the university level? Liberal Arts. Writing is an act of art, not science. Translating art to art (translating Moby Dick from English to Spanish) is NOT a scientific endeavor, it is an artistic endeavor. Good translators are artists. They capture the mood, the meaning, and even the essence of the source, and try to replicate that experience for the reader of the translation. They are literally painting a verbal picture for the author, in another language.
Ok, so how do we get closer to the artist. For one thing, we need to understand the meaning of the source. We need to understand things like intensity, ambiguity, culture, and many more things that are very hard to quantify scientifically. At LinguaSys, with decades of experience in this space by the founders, we understood all of this, and that shows in the path we have been taking for over a decade of development of our MT technology. Before we begin to think about translating source content, we deeply analyze it both semantically and syntactically. By no means are we done, but what we have been able to produce for our clients, to date, is a system that retains fidelity, sometimes at the expense of good syntax. More importantly, we understand where to use MT and where not to use MT, and where most MT fails, is when it is used wrong. We have proven that MT works great when there is a given domain that we can concentrate on, such as, for one client, financial services. That is a limited domain, and knowing that the source is going to belong to that domain means that we have strong clues in how to disambiguate words that have multiple meanings. Based on the state of the art today, and we are, I believe, the state of the art today, MT to the masses (Bing, Google, etc) is one big FAIL. Use it at your own risk. Your mileage may vary.
Where new invention in AI takes us, must concentrate more on semantics and less on statistics. Yes, statistics helps a lot and can play a big role in being another voting part of the engine in making a disambiguation decision, but for MT, or Speech Recognition, or any of the other imperfect technologies that are “less bad” on a daily basis, the secret sauce to getting them “good” is semantics.