1 pepper rating

When polled, everyone involved in multilingual content management agrees that terminology management is essential — for effective communication, for protecting certain types of intellectual property, and for optimized searching both inside and outside the organization. Yet the various processes associated with multilingual terminology management (searching, extracting, validating, formatting and delivering) are still time-consuming, expensive (Lingway’s Bernard Normier reckons a “term” can cost between two and five dollars to create), and therefore obvious targets for a technology fix. There are numerous efforts underway to make existing terminology banks or bases interoperable, using a standard such as TBX. But these mainstream terminology bases have usually been handcrafted (so they tend to be systematically out of date) and are piecemeal (they have neither adequate multilingual coverage, nor adequate subject matter timeliness).

Lingway’s Terminology Builder solution is based on the idea that terminology can be quickly mined from multilingual corpora using any computing methods that align terms with their cross-language equivalents. First, establish an extract-likely term list from a technical domain corpus containing documents in various languages. Use any combination of statistical and linguistic analysis to pull up plausible noun phrases, and verb and object groups from the source corpus. Then use these candidate terms as search queries in the “other” language texts in the corpus, providing a simple word for word translation to see what’s there. The other language terms that occur most frequently in the results of that search are then considered the best candidates for the target language terminology. Then format the multilingual lists to populate Systran’s dictionaries.

This solution takes a similar approach to the recently-announced Google project to provide “translation” (that is, dig out existing multilingual equivalents from the vast mine of texts on the web by seeking various formal and statistical parallels), and adds some linguistic processing to dig down to the structures beneath the strings, and produce not translations of texts but of terms. We can expect to see more and more use of knowledge/text/content mining for various multilingual applications. But the technology won’t solve endemic questions in terminology management, especially where terms are considered part of an organization’s semantic DNA.

First, high quality translation automation term bases built this way will always depend on the right mix of available subject matter data in relevant languages. This means that building the right corpus may prove more important than the speed and cost benefits resulting from the terminology itself. Second, organizations will still need a validation process that secures buy-in from many departments and individuals.

We’re still more than a few years away from putting a squishy fish in our ears — or an omniscient linguistic server on our networks — for fully automated and rhetorically compelling translation. But each and every step like this one from Lingway — and similar ones from MultiCorpora and KCSL — will get us closer.

Share or tag this post on:
del.icio.us Digg Furl Reddit Ask Google Ma.gnolia Technorati Windows Live Yahoo!