Snorage by Angus Kidman

If everyone thinks storage is so boring, how come we always want more of it? Angus Kidman dives into the murky world of enterprise storage, covering everything from the best way to manage a storage area network to the wisdom of trying to ban USB keys and iPods. Go on -- you know size matters.

Unnatural language processing

Posted by Angus Kidman @ 14:22 3 comments

Indexing a large chunk of data is a bit like joining Weight Watchers: it's a useful first step, but it doesn't immediately solve the problem of how you're going to deal with all that blubber.

Getting the indexing to be more intelligent — that is, working out what's actually being said, rather than just what sequences of characters are in place — is nearly as challenging as resisting just one more Tim Tam.

I was reminded of this last week when enterprise analysis software specialist SAS announced that it was buying out Teragram, a company which specialises in just that kind of process.

More specifically, Teragram uses "large annotated dictionaries containing several hundred million words in more than 30 languages" to help categorise documents according to criteria set by users. SAS will use the technology to enhance its Text Miner software and other products (though it will maintain Teragram as a separate "SAS company").

Now, I did an honours degree in linguistics rather too long ago, so I have a bit of an interest in how language processing works. Teragram is essentially using the brute force approach: lots of lots of data to handle lots and lots of potential scenarios.

In an era where processing power is ludicrously cheap, that's not a terrible approach. But it's nowhere near as elegant as an algorithm based on a more nuanced understanding of how language actually works. Our understanding of language is still far too fragmentary to make such an approach entirely feasible, but it remains a worthy goal — and it would produce smaller, faster software in the long run.

Such coding concerns aside, there's another more insidious problem. Processing text is hard enough when it's written in a relatively coherent fashion. But as anyone who hangs around on message boards, Wikipedia talk pages or classrooms can tell you, in the SMS-speak age assuming that to be the case is dangerous.

For an increasing number of people of all ages, capital letters are a foreign language, punctuation is a waste of space, accuracy in spelling is optional and sentences are like you know words what go 2gether they dont have to mak sense much lol.

While business communications should arguably still be more formal (and accurate), I wouldn't want to stake money (or a Tim Tam) on it.

If the trend continues, text mining intelligence will also need a degree of "un-intelligence" — trying to extract meaning from something that really didn't have any meaning in the first place. That might well require a lot more examples, which is good for storage manufacturers if nothing else.

Talkback 3 comments

    Unnatural language processing Anonymous -- 04/04/08

    My approach to the analysis of idioms is based on determining the etymology of the idiom. It is no better or more accurate than determining the etymology of any other word or phrase. But, the phonetic aspect is often easier because most idioms have more syllables than most single words.

    To use an idiom competently/properly does not require any knowledge of its etymology. However, this knowledge may help an L2 student remember an idiom and how/when to use it.

    When I was a young kid, all of my friends and I knew the meaning of "escape by the skin of my teeth" and not a single one of us knew it was the translation of B'3or SHinai, a Hebrew pun on the word B'QoSHi (which means barely, hardly, with difficulty) in the biblical book of Job 19:20.

    The majority of idioms are transliterated (not translated) from a foreign language directly into words that look/sound/feel like the target language. For English idioms, there are not a lot of foreign languages involved: Germanic languages, Latin, Aramaic (during the 600 years it was a lingua franca), French (1066), Hebrew & Greek (biblical translation), Arabic (7 Crusades, Spanish Armada 1588 => Black Irish), Yiddish (in England prior to the Expulsion in 1290; 1840s from Germany, early 1900s from Eastern Europe), etc.

    A minority of idioms are the translation of foreign idioms. These are more difficult to analyze because one needs to know not only the language of the source but also the language into which the original transliteration (sic) was made, which may or may not be the same. Additional intermediate translations (sic) should not affect the result if they were faithful.

    A cute English translation idiom is "count sheep !" to go to sleep. This is probably the translation of a Hebrew pun S'PoR TSo@N on the Latin phrase sopor (as in soporific) sond (as in soundly / deeply). This English idiom has been retranslated back into Israeli Hebrew as LiSPoR KeVeS = to count sheep.

    In a few cases, the "original" was a euphemism and not "plain text". I suspect this is the case with "kick the bucket". It seems to be the direct transliteration of a Semitic euphemism for dying: to make love in Paradise. Using 3 for aiyin with its ancient G/K-sound: 3aGaV = make physical love + B'3aiDeN = in Eden. 3G => Kick, vB3Dn => BucKeT.

    In other words, this type of idiom formation represents the target languag-ification of a foreign word or phrase. It can be most easily illustrated with a foreign phrase that did *not* become an idiom: Latin e pluribus unum = out of many, one. This is a motto of the USA. If it had become an idiom, it might have become "a flower bush you name" but would retain its original Latin meaning. It would probably acquire a folk etymology, such as: we could give a flower bush many names, but we usually give it only one.

    Transliteration idioms are most easily formed at a time when most target-language speakers do not read and write. They hear a foreign word/phrase, understand its meaning in context, and convert its sounds into target-language words they do know.

    For a rare modern example, "face the music" is attested in the United States from the 1840s. This "music" is probably from Yiddish MoSKoNeh = inference, deduction, hence, consequences, from Hebrew MaSKaNah with the same meaning.

    Etymology is not an exact science. The three etymologies that a non-linguist is most likely to "know" are all false. Muscle is not from Latin musculus = a small mouse. Sabotage is not from French sabot = an old shoe. And cabal is from Hebrew het-bet-lamed = to plot, scheme, not from Hebrew Kabbalah = esoteric knowledge, literally, received (tradition). Porcelain has nothing to do with a porcine vulva, and gossamer is from Latin Gossypium = cotton, not from goose + summer :-). But that is another story.

    For more idiom etymologies, do a Google search for < idioms Hebrew "izzy cohen" >

    Best regards,
    Israel "izzy" Cohen
    http://tech.groups.yahoo.com/group/BPMaps/

    virgin blue Anonymous -- 06/04/08

    Virgin Blue stay away from those bastards

    I had a misfortune to organize a flight from Brisbane to Melbourne with them on 31.03.08 .

    Due to problems , of not my fault , I had with getting to airport I managed to arrive still within reasonable limit to board the plane .
    When I got to the Virgin Blue counter it was 4.53 am ( 27 min to the flight ) . The girl tried to book me in but the system did not let her do it .
    She asked me to go to the � service centre � next to counters and this is where trouble continued .
    I asked the person ( later I found out that his name is � LEE HADEN � ) � can you get me on this plane , I do not have check in luggage so I can go straight to the cabin � � without even pretending of making any effort he said � you pay $50 and you will go on the next plane � . Since I had to attend urgent matters in Melbourne - I said � can you just try to put me on this one , still over 20 min to the departure �
    I was surprised ( to say the least when in an aggressive and arrogant tone he replied � pay $50 or you will not travel at all � .You do not expect such attitude here in Australia from the company which claims to be � reputable � .
    I said � why don�t you try to help me � .- and that moron feeling the power he possessed at that moment said � OK , YOU WILL NOT TRAVEL AT ALL � .
    I asked � Who do you think you are ? � and when he replied with his own question � Who are you ?� I told him � � I am the customer of your company and it is your job to help me � .
    Few word exchanges followed and he was absolutely determined that he will not let me to change the ticket to the next flight .
    Since I had to go to Melbourne as soon as possible without having time to ring the travel agency , etc - the only thing I could do was to go to Qantas and buy ticket from them . So I paid $ 261 and had a flight at 6am .

    I wonder who was stupid enough to employ such creature in � service department � or may be this is the unofficial policy of Virgin Blue to treat customers with contempt and that parasite was doing good job .
    I got the message and I do not intend to use them again and so all my friends , neighbours , acquittances , people I know and I do not know .

    Peter M .

    dealing with cretins in uniform Anonymous -- 08/04/08 (in reply to #320099268)

    you're obviously too young to remember when "the customer is always right." The guy must have been having a bad day, and had left his professionalism at home. I hope you send a copy of your email to the airline. If the idiot wouldn't book you, you should:"Book him, Danno!" (They do have nice leather seats, though!)

Add your opinion

Angus Kidman

Angus Kidman

Journalist

[+] Read bio

Latest Videos

Sponsored content

Power Centre - Content from our premier sponsors

Tags

Back to top

Featured