The Internet is thoroughly multilingual, with text in dozens of languages readily available. But human nature being what it is, few of us will learn Portuguese just to read the Jornal de Noticias (www.jnoticias.pt). What if your computer could translate the newspaper into your native tongue?
The field of machine translation (MT) has been investigating such systems for decades. Heavily funded in the 1950's by the US government, early efforts involved translating between Russian and English.
One of the first non-military applications was the Canadian METEO system, which translates English weather forecasts to French. Businesses are now interested in MT, with huge investments in Asia and Europe, regions especially affected by the increasingly multilingual economy.
Today's MT systems do a reasonable job, though they're hardly fluent. AltaVista and SYSTRAN (a major commercial MT vendor) have teamed up to provide free automatic translation between English and a handful of European languages at babelfish.altavista.digital.com.
Alas, as any translator will tell you, translation is hard work. Language is often ambiguous, and computers haven't mastered the common sense knowledge needed to resolve ambiguity. Nevertheless, substantial progress has been made, based on several different approaches.
The so-called "transfer" approach is used by SYSTRAN and many other systems. First, the source text is analysed to classify words as nouns, prepositions, and so on. Rules are then used to determine how these words group into larger phrases.
More rules then group phrases into still larger structures. The result is a "parse tree" describing how each word fits into the sentence's overall meaning.
The actual translation occurs in the "transfer" phase: a second set of rules tells how to convert the source parse tree into a parse tree of the target language. In the third step, sentences are generated by using the target language's parsing rules in reverse, building a sequence of words in the target language that express the meaning in the translated parse tree.
The transfer approach is reasonably successful, but it faces the so-called N 2 problem. Ideally, to add a new language to an MT system, you would just give it rules for understanding the language. But the rules used by transfer systems must be tailored to both the source and target languages.
So if the system translates between N languages, then it needs N*N-1 or approximately N2 sets of transfer rules, one per language pair. The N2 problem doesn't mean that transfer systems are impossible, just that building and maintaining them is difficult.
To make MT systems easier to build, an alternative "interlingual" approach has been advocated. The idea is as abstract as it is powerful: our thoughts are structured in our heads using a language-neutral representation, so using language is a matter of converting between this internal representation and sentences.
This idea can be readily applied to translation. First, convert the source text into its interlingual representation, then convert this interlingual representation into the target language.
The good news is that with N languages, we need only N modules in the system, one per language. When the number of languages involved grows, the `interlingual' approach becomes more attractive. But a problem - and it's serious - is that we don't understand human minds well enough to design a good interlingual representation.
While transfer and interlingual systems rely on hand-crafted rules, another possibility is to build an MT system that learns.
To translate between English and French, a system could first study the proceedings of the Canadian parliament, which are translated into both languages.
One such learning strategy is "example-based" MT. During learning, the translations are first aligned sentence-by-sentence. Sentences are then split into phrases, and rules are applied to determine which phrases correspond in the source and target.
The pairs are then stored in a huge index. At translation time, the system breaks a sentence into phrases, retrieves the closest match for each from the index, and adds the corresponding substitution to the translation. The key point is that the system - not the programmer - generates the index.
One problem with example-based MT is that programmers must supply rules for deciding which phrases to index. These rules will be simpler than needed by transfer or interlingual MT systems, but a fourth, `statistical', approach aims at having the system learning everything.
Statistical MT systems adopt a ridiculously simple notion of translation. The idea is to predict how a translator would handle any given word, based on simple statistics gleaned from a collection of translations.
When learning to translate English to French, the system might keep track of the frequency of an English word being translated as each French word ("dog" usually becomes "chien" but rarely "pomme"), or how many words an English word translates to ("not" usually becomes two French words, sometimes none or one, but rarely three).
As with example-based MT, the key is that these statistics are gathered by the system, not the programmer.
Manipulating these numbers consumes vast amounts of memory and processing, and statistical MT was thought impractical when proposed in the 1950's. But with computers getting more powerful by the day, statistical MT today rivals other techniques - all the more surprising given its lack of linguistic sophistication.
With all these techniques available, when will they come into everyday use? The answer depends on what you want. If you insist that MT systems be completely automatic and produce fluent translations, you're going to have to wait. But if you just need to decide whether a document is relevant, you might be satisfied with today's mediocre translations.
Even if you need high-quality translations, it might be faster to amend the computer system's output than translate by hand. MT researchers are investigating an alphabet soup of ways humans and machines can co-operate.
With human-assisted MT (HAMT) systems, a person helps the machine by resolving subtle ambiguities, while the roles are reversed with machine-assisted human translation (MAHT) systems.
Nicholas Kushmerick is at: nick@compapp.dcu.ie
For more information, see www.compapp.dcu.ie/nick/itr/4.html