The magic that makes Google tick

By Matt Loney, ZDNet UK
02 December 2004 10:43 AM
Tags: goog, search, google, page
The numbers alone are enough to make your eyes water.

  • Over four billion Web pages, each an average of 10KB, all fully indexed.
  • Up to 2,000 PCs in a cluster.
  • Over 30 clusters.
  • 104 interface languages including Klingon and Tagalog.
  • One petabyte of data in a cluster -- so much that hard disk error rates of 10-15 begin to be a real issue.
  • Sustained transfer rates of 2Gbps in a cluster.
  • An expectation that two machines will fail every day in each of the larger clusters.
  • No complete system failure since February 2000.

    It is one of the largest computing projects on the planet, arguably employing more computers than any other single, fully managed system (we're not counting distributed computing projects here), some 200 computer science PhDs, and 600 other computer scientists.

    And it is all hidden behind a deceptively simple, white, Web page that contains a single one-line text box and a button that says Google Search.

    When Arthur C. Clarke said that any sufficiently advanced technology is indistinguishable from magic, he was alluding to the trick of hiding the complexity of the job from the audience, or the user. Nobody hides the complexity of the job better than Google does; so long as we have a connection to the Internet, the Google search page is there day and night, every day of the year, and it is not just there, but it returns results. Google recognises that the returns are not always perfect, and there are still issues there -- more on those later -- but when you understand the complexity of the system behind that Web page you may be able to forgive the imperfections. You may even agree that what Google achieves is nothing short of sorcery.

    On Thursday evening, Google's vice-president of engineering, Urs Hölzle, who has been with the company since 1999 and who is now a Google fellow, gave an insight to would-be Google employees into just what it takes to run an operation on such a scale, with such reliability. ZDNet UK snuck in the back to glean some of the secrets of Google's magic.

    Google's vision is broader than most people imagine, said Hölzle: "Most people say Google is a search engine but our mission is to organise information to make it accessible."

    Behind that, he said, comes a vast scale of computing power based on cheap, no-name hardware that is prone to failure. There are hardware malfunctions not just once, but time and time again, many times a day.

    Yes, that's right, Google is built on imperfect hardware. The magic is writing software that accepts that hardware will fail, and expeditiously deals with that reality, says Hölzle.

    Google indexes over four billion Web pages, using an average of 10KB per page, which comes to about 40TB. Google is asked to search this data over 1,000 times every second of every day, and typically comes back with sub-second response rates. If anything goes wrong, said Hölzle, "you can't just switch the system off and switch it back on again."

    How to slam spam
    The job is not helped by the nature of the Web. "In academia," said Hölzle, "the information retrieval field has been around for years, but that is for books in libraries. On the Web, content is not nicely written -- there are many different grades of quality."

    Some, he noted, may not even have text. "You may think we don't need to know about those but that's not true -- it may be the home page of a very large company where the Webmaster decided to have everything graphical. The company name may not even appear on the page."

    Google deals with such pages by regarding the Web not as a collection of text documents, but a collection of linked text documents, with each link containing valuable information.

    "Take a link pointing to the Stanford university home page," said Hölzle. "This tells us several things: First, that someone must think pointing to Stanford is important. The text in the link also gives us some idea of what is on the page being pointed to. And if we know something about the page that contains the link we can tell something about the quality of the page being linked to."

    This knowledge is encapsulated in Google's famous PageRank algorithm, which looks not just at the number of links to a page but at the quality or weight of those links, to help determine which page is most likely to be of use, and so which is presented at the top of the list when the search results are returned to the user. Hölzle believes the PageRank algorithm is 'relatively' spam resistant, and those interested in exactly how it works can find more information here.

  • Talkback 20 comments

      Can't resist being a little PC ...Anonymous -- 03/12/04

      Can't resist being a little PC and finding the parallel between Klingon and Tagalog a bit weird. The latter is a real language spoken by tens of millions of people. Its name sure sounds funny, but is that enough? Better to mention Google's other funny options like "Bork, bork, bork" and "Elmer Fudd".

      Very nice overview. Our compu ...Anonymous -- 03/12/04

      Very nice overview. Our compute farm has no direct correlation to google's, but there's still a lot we're learning from their work.

      Nice article. Always nice to g ...Anonymous -- 03/12/04

      Nice article. Always nice to get an "inside" look at one of the most popular and useful web applications on the planet. Hope to see more stuff like this - especially about Google.

      Thanks for the story. More int ...Anonymous -- 03/12/04

      Thanks for the story. More interesting insights on the Google cluster can also be found here on <a href="http://www.tnl.net/blog/entry/How_many_Google_machines">http://www.tnl.net/blog/entry/How_many_Google_machines</a>

      This was a really cool article ...Anonymous -- 03/12/04

      This was a really cool article. Thanks

      Hey, a "give me less comm ...Anonymous -- 03/12/04

      Hey, a "give me less commercial" button sounds wonderful. If you were looking for a camera then clicking this button would probably bring up technical articles, reviews, techniques etc, rather than 1000 results from shops and price comparison spammers.

      How have you guys not done a s ...Anonymous -- 04/12/04

      How have you guys not done a story on the Google Sandbox??!

      Klingon and Tagalog? I'm curio ...Anonymous -- 05/12/04

      Klingon and Tagalog? I'm curious if you were aware that Tagalog was a real language since you put it in the same sentence as Klingon.

      Why is Tagalog placed in the s ...Anonymous -- 05/12/04

      Why is Tagalog placed in the same context as Klingon? It's hardly a rare language - it's used by the entire country of the Philippines!

      Wouldn't a more culturally sensitive connotation be in order here? Tagalog is hardly in the same class as an artificial language created for a science fiction series!

      Very cool article. Great insig ...Anonymous -- 05/12/04

      Very cool article. Great insight on how the website works. Now, if only I could fit all of that power into my PC, maybe I could play some really top notch games.

      THE PAGE IS HANGING! In both I ...Anonymous -- 06/12/04

      THE PAGE IS HANGING!

      In both IE and Firefox!

      Congrats to you and Matt Loney ...Anonymous -- 07/12/04

      Congrats to you and Matt Loney; good stuff.
      Got onto you from Wired.

      Spelling and grammar make this ...Anonymous -- 07/12/04

      Spelling and grammar make this article a pain, although the subject matter is interesting. Too bad parts of it are nearly unreadable.

      Jesus, i never knew any of thi ...Anonymous -- 08/12/04

      Jesus, i never knew any of this, its really inresting, what surpised me was "Google runs its systems on cheap, no-name IU and 2U servers -- so cheap that Google refers to them as PCs. After all each one has a standard x86 PC processor, standard IDE hard disk, and standard PC reliability -- which means it is expected to fail once in three years."!!lol, using ide!!even i have sata, but with such a laugh business i can understand why they would use this as it is more cost effective.cheers !!!Mike!!...www.suprmobo.net!!

      Get a proofreader. 10-15 was s ...Anonymous -- 08/12/04

      Get a proofreader. 10-15 was supposed to be 10 to the power 15. And there are other errors and typos too. Embarr****ing.

      What amazes me is that with 20 ...Anonymous -- 09/12/04

      What amazes me is that with 200 computer doctors and 600 other computer science people, the results from Google in an average search is, many times, similar to other engines, such as Altavista and Webcrawler, with the same people and companies spamming the top slots.

      Uh...Google keeps locking up.. ...Anonymous -- 25/12/04

      Uh...Google keeps locking up....(Just Kiddin')

      Have A Merry Christmas!

      "104 interface languages ...Anonymous -- 28/12/04

      "104 interface languages including Klingon and Tagalog."

      Do your research man. You're crossing the racist line.

      Nice article Ravi Shiraguppi -- 08/10/07

      wah... Nice Article.

      Ravi Shiraguppi.
      Sangolli Rayanna nagar,
      Dharwad.Karnataka.
      INDIA

    Add your opinion

    Sponsored content

    Power Centre - Content from our premier sponsors

    Blogs

    Tags

    Back to top

    Featured