Apr 08 3
Unnatural language processing
Posted by Angus Kidman @ 14:22 3 comments
Indexing a large chunk of data is a bit like joining Weight Watchers: it's a useful first step, but it doesn't immediately solve the problem of how you're going to deal with all that blubber.
Getting the indexing to be more intelligent — that is, working out what's actually being said, rather than just what sequences of characters are in place — is nearly as challenging as resisting just one more Tim Tam.
I was reminded of this last week when enterprise analysis software specialist SAS announced that it was buying out Teragram, a company which specialises in just that kind of process.
More specifically, Teragram uses "large annotated dictionaries containing several hundred million words in more than 30 languages" to help categorise documents according to criteria set by users. SAS will use the technology to enhance its Text Miner software and other products (though it will maintain Teragram as a separate "SAS company").
Now, I did an honours degree in linguistics rather too long ago, so I have a bit of an interest in how language processing works. Teragram is essentially using the brute force approach: lots of lots of data to handle lots and lots of potential scenarios.
In an era where processing power is ludicrously cheap, that's not a terrible approach. But it's nowhere near as elegant as an algorithm based on a more nuanced understanding of how language actually works. Our understanding of language is still far too fragmentary to make such an approach entirely feasible, but it remains a worthy goal — and it would produce smaller, faster software in the long run.
Such coding concerns aside, there's another more insidious problem. Processing text is hard enough when it's written in a relatively coherent fashion. But as anyone who hangs around on message boards, Wikipedia talk pages or classrooms can tell you, in the SMS-speak age assuming that to be the case is dangerous.
For an increasing number of people of all ages, capital letters are a foreign language, punctuation is a waste of space, accuracy in spelling is optional and sentences are like you know words what go 2gether they dont have to mak sense much lol.
While business communications should arguably still be more formal (and accurate), I wouldn't want to stake money (or a Tim Tam) on it.
If the trend continues, text mining intelligence will also need a degree of "un-intelligence" — trying to extract meaning from something that really didn't have any meaning in the first place. That might well require a lot more examples, which is good for storage manufacturers if nothing else.





4%
2%




My approach to the analysis of idioms is based on determining the etymology of the idiom. It is no better or more accurate than determining the etymology of any other word or phrase. But, the phonetic aspect is often easier because most idioms have more syllables than most single words.
To use an idiom competently/properly does not require any knowledge of its etymology. However, this knowledge may help an L2 student remember an idiom and how/when to use it.
When I was a young kid, all of my friends and I knew the meaning of "escape by the skin of my teeth" and not a single one of us knew it was the translation of B'3or SHinai, a Hebrew pun on the word B'QoSHi (which means barely, hardly, with difficulty) in the biblical book of Job 19:20.
The majority of idioms are transliterated (not translated) from a foreign language directly into words that look/sound/feel like the target language. For English idioms, there are not a lot of foreign languages involved: Germanic languages, Latin, Aramaic (during the 600 years it was a lingua franca), French (1066), Hebrew & Greek (biblical translation), Arabic (7 Crusades, Spanish Armada 1588 => Black Irish), Yiddish (in England prior to the Expulsion in 1290; 1840s from Germany, early 1900s from Eastern Europe), etc.
A minority of idioms are the translation of foreign idioms. These are more difficult to analyze because one needs to know not only the language of the source but also the language into which the original transliteration (sic) was made, which may or may not be the same. Additional intermediate translations (sic) should not affect the result if they were faithful.
A cute English translation idiom is "count sheep !" to go to sleep. This is probably the translation of a Hebrew pun S'PoR TSo@N on the Latin phrase sopor (as in soporific) sond (as in soundly / deeply). This English idiom has been retranslated back into Israeli Hebrew as LiSPoR KeVeS = to count sheep.
In a few cases, the "original" was a euphemism and not "plain text". I suspect this is the case with "kick the bucket". It seems to be the direct transliteration of a Semitic euphemism for dying: to make love in Paradise. Using 3 for aiyin with its ancient G/K-sound: 3aGaV = make physical love + B'3aiDeN = in Eden. 3G => Kick, vB3Dn => BucKeT.
In other words, this type of idiom formation represents the target languag-ification of a foreign word or phrase. It can be most easily illustrated with a foreign phrase that did *not* become an idiom: Latin e pluribus unum = out of many, one. This is a motto of the USA. If it had become an idiom, it might have become "a flower bush you name" but would retain its original Latin meaning. It would probably acquire a folk etymology, such as: we could give a flower bush many names, but we usually give it only one.
Transliteration idioms are most easily formed at a time when most target-language speakers do not read and write. They hear a foreign word/phrase, understand its meaning in context, and convert its sounds into target-language words they do know.
For a rare modern example, "face the music" is attested in the United States from the 1840s. This "music" is probably from Yiddish MoSKoNeh = inference, deduction, hence, consequences, from Hebrew MaSKaNah with the same meaning.
Etymology is not an exact science. The three etymologies that a non-linguist is most likely to "know" are all false. Muscle is not from Latin musculus = a small mouse. Sabotage is not from French sabot = an old shoe. And cabal is from Hebrew het-bet-lamed = to plot, scheme, not from Hebrew Kabbalah = esoteric knowledge, literally, received (tradition). Porcelain has nothing to do with a porcine vulva, and gossamer is from Latin Gossypium = cotton, not from goose + summer :-). But that is another story.
For more idiom etymologies, do a Google search for < idioms Hebrew "izzy cohen" >
Best regards,
Israel "izzy" Cohen
http://tech.groups.yahoo.com/group/BPMaps/