Google: Unicode vanquishes ASCII

Unicode has overtaken ASCII as the most popular character encoding scheme on the World Wide Web. Also vanquished at almost exactly the same time was the Western European encoding.

Unicode is a character encoding standard that accommodates dozens of languages as well as Roman characters with diacritical marks. ASCII, a tried-and true, decades-old standard, is limited to 128 or 256 characters and has a hard time extending beyond the range of a century-old Remington typewriter.

Mark Davis, Google's senior international software architect, said in a blog post that Unicode vanquished ASCII and Western European within 10 days in December.

"What's more impressive than simply overtaking them is the speed with which this happened," he added, pointing to a graph showing the meteoric rise of Unicode.

Google's a fan of Unicode Web sites. When it processes data from Web sites, it converts it into Unicode first if it's not already there. That improves international search abilities.

"The continued rise in use of Unicode makes it even easier to do the processing for the many languages that we cover," he said.

Google just converted to Unicode 5.1, he added, "so people speaking languages such as Malayalam can now search for words containing the new characters," he said.

One disadvantage Unicode has over ASCII, though, is that it takes at least twice as much memory to store a Roman alphabet character because Unicode uses more bytes to enumerate its vastly larger range of alphabetic symbols.

Advertisement

Talkback 2 comments

    The last paragraph is incorrectDean -- 06/05/08

    "One disadvantage Unicode has over ASCII, though, is that it takes at least twice as much memory to store a Roman alphabet character"

    The vast majority of web pages would be encoded as UTF-8, which required only 1 byte to store US-ASCII text. That paragraph is simply incorrect.

    Not to mention the fact that video and images take up vastly more space than the average piece of text. The 300x250 pixel image making up just one of the ads on this very page is bigger than the entire HTML file it is contained in, for example.

    Unicode needs to get bentAnonymous -- 08/05/08

    Unicode is just a holdover from the old PC-correctness era of the 1990's. Internationalization is no longer important as everyone is practically required to learn Spanish, especially in the US.

Add your opinion


Latest Videos

Blogs

  • Chris Duckett PayPal launches Aussie developer program
    PayPal announced the opening of its certification program for Australian developers today, making Australia the first country outside of the US to offer certification.
  • Array Cash cow in a BigTinCan?
    Around one third of Australia's telcos have shut their doors over time, but that isn't stopping new ventures hoping to chip away at carriers' mobile call bonanza. By fighting carriers at the smartphone rather than the home phone, could the latest two contenders be onto something big?
  • Array A third of the way to a zettabyte
    This week on Twisted Wire we look at how internet usage is changing in Australia and around the world. How are we meeting this demand and how is the cost structure changing for the service provider?
  • More blogs »

Tags

Back to top

Featured