Google: Unicode vanquishes ASCII

Unicode has overtaken ASCII as the most popular character encoding scheme on the World Wide Web. Also vanquished at almost exactly the same time was the Western European encoding.

Unicode is a character encoding standard that accommodates dozens of languages as well as Roman characters with diacritical marks. ASCII, a tried-and true, decades-old standard, is limited to 128 or 256 characters and has a hard time extending beyond the range of a century-old Remington typewriter.

Mark Davis, Google's senior international software architect, said in a blog post that Unicode vanquished ASCII and Western European within 10 days in December.

"What's more impressive than simply overtaking them is the speed with which this happened," he added, pointing to a graph showing the meteoric rise of Unicode.

Google's a fan of Unicode Web sites. When it processes data from Web sites, it converts it into Unicode first if it's not already there. That improves international search abilities.

"The continued rise in use of Unicode makes it even easier to do the processing for the many languages that we cover," he said.

Google just converted to Unicode 5.1, he added, "so people speaking languages such as Malayalam can now search for words containing the new characters," he said.

One disadvantage Unicode has over ASCII, though, is that it takes at least twice as much memory to store a Roman alphabet character because Unicode uses more bytes to enumerate its vastly larger range of alphabetic symbols.

Talkback 2 comments

    The last paragraph is incorrect Dean -- 06/05/08

    "One disadvantage Unicode has over ASCII, though, is that it takes at least twice as much memory to store a Roman alphabet character"

    The vast majority of web pages would be encoded as UTF-8, which required only 1 byte to store US-ASCII text. That paragraph is simply incorrect.

    Not to mention the fact that video and images take up vastly more space than the average piece of text. The 300x250 pixel image making up just one of the ads on this very page is bigger than the entire HTML file it is contained in, for example.

    Unicode needs to get bent Anonymous -- 08/05/08

    Unicode is just a holdover from the old PC-correctness era of the 1990's. Internationalization is no longer important as everyone is practically required to learn Spanish, especially in the US.

Add your opinion

Latest Videos

Sponsored content

Power Centre - Content from our premier sponsors

Blogs

  • Brad Howarth The key Topik is always money
    One of the big problems of the internet is that is practically impossible to keep up-to-date on preferred topics. You can limit your sources, but this can mean missing a lot of valuable data.
  • Array Do we need the legislative blackmail?
    Virtually everyone in the telecommunications industry has their say in the Senate Standing Committee's public hearing into the pending legislation to split up Telstra, in this week's Twisted Wire podcast.
  • Array Give Tax a break for a Change
    Considering the circumstances the Australian Taxation Office's (ATO) Change Program has been operating in over the last few years, it really hasn't been going too badly.
  • More blogs »

Tags

Back to top

Featured