Advertisement
To print: Select File and then Print from your browser's menu
-------------------------------------------------------------- This story was printed from ZDNet Australia. --------------------------------------------------------------
In search of intelligent seach

By Charles Babcock, Inter@ctive Week
November 07, 2000
URL: http://www.zdnet.com.au/news/business/soa/In-search-of-intelligent-seach/0,139023166,120106737,00.htm


On the internet, no one knows you're a dog. Except maybe the big outdoor store REI.

A search for hiking boots on the retailer's Web site yielded 84 selections. Top on the list: Ruff Wear Bark 'N Boots, $34, booties for dogs.

Why would the search engine on REI's site offer mutt booties to a serious boot buyer? David Harris, vice president of marketing at EasyAsk, which makes search engines for e-commerce catalogs, thought it was likely a result of REI's knack for presenting outdoor wear in an appealing manner.

"Somewhere on the site its content says, 'When you're out hiking with your dog, it's great to wear these boots,' " he speculated. The search engine indexing mechanism didn't try to distinguish what was more important - the hiking footwear or the dog. Hence, dog booties were its first selection.

REI is not alone. As Interactive Week staffers turned their attention to how search is implemented, they found many examples of poor results.

Suspicious of the unshaven character eyeing your new BMW? A check for the "10 most wanted" criminals at the Federal Bureau of Investigation Web site will tell you about the apprehension of a burglar in Lee County, Fla., one of the 10 most wanted in that locale. But that's the only reference to wanted criminals on the first page.

At www.whitehouse.gov, a query of "Whitehouse" brings up the President's Council on Sustainable Development and the Office of Management and Budget, links one might expect from a heavy-hitting government site. But changing the keyword to "White House" produced an application for White House internships. Maybe that's what the president wanted. But it's more likely another example of what makes Internet users ask the real question: "Must search stink?"

That's the title of a recent report by Paul Hagen, a senior analyst at the Cambridge, Mass., market research firm Forrester Research. Hagen and others said search doesn't have to stink. Nevertheless, it often does. And - despite new technology, or even properly implemented existing technology that can eliminate many of the more onerous problems - it likely will for the foreseeable future.

For the Forrester survey, a staffer typed "HTML" into the IBM site search engine and got back all the pages where content coders had forgotten to close the initial HTML bracket.

A search for "buy" at Lucent Technologies' site took a visitor into the company's human resource documents that told employees how to buy extra vacation days, complete with links to executive compensation policies, according to Must Search Stink?

In the Interactive Week review, reporters looking for problems said they didn't have to go any further than the magazine's own site on ZDnet, powered by respected search engine Thunderstone. In some cases, the search function had clearly been "dumbed down" to produce lots of results for unskilled surfers, a common problem with search on the Web, experts said. The REI search engine came up with the dog booties because "it is designed to capture everything relevant," said Jennifer Lind, an REI spokeswoman.

At Office Depot's online store, trying the vague keyword "tapes" or the potentially confusing "3 ring binder" produced good results. But searching for Quick Books yielded only references to business and computer books - not to QuickBooks, the popular small-business accounting software package sold by Office Depot.

Keyword technology

Search on many of the Web sites is based on keyword technology, which parses a word or phrase and quickly scrutinizes an index of text references that match.

The index is constructed from feedback that is supplied by automated crawlers, sometimes called bots or spiders, which comb all the content on a given Web site, a domain of several sites or the Web itself. Most crawlers capture keyword references to content, based on titles of pages and frequently used nouns in the first few paragraphs of content. Those references are stored in the index underlying the search engine.

But the frequency with which keywords occur in the index is where the trouble begins. Many businesses use freeware search engines or off-the-shelf software packages, which have varying levels of indexing and classifying capability.

A search engine powerhouse, such as AltaVista, Excite or Northern Light, covers the entire Web, building a huge index based on its crawlers' survey of millions of Web pages.

AltaVista has indexed 350 million - the number that's left after duplicates have been stripped out, offensive sites have been removed by "family filtering" and spammers, who load up sites with popular keywords in hopes of attracting traffic, have been eliminated. As a result, a simple keyword search on AltaVista for "men's brown belts" will yield 6,953,460 hits, because each word in the phrase is found on many sites.

"You have to control what gets displayed at the top of the results," John Piscitello, senior business manager at AltaVista, said he tells customers such as e-commerce software seller Ariba and book and music seller Amazon.com. "It's a lot like shelf space. The majority of users will only look at what's on the first page," much as shoppers in a store look for favorite brands in the prime shelf space.

But getting the most relevant results onto that first page remains a supreme challenge.

If you are trying to find out who said, "The business of America is business," and you search on the words without surrounding them with quote marks, the results vary widely from search engine to search engine, with most of the results referring to business topics. If you search with the sentence in quotes, Alta Vista, Google, Lycos, Netscape Search, Northern Light and other major search engines return Calvin Coolidge as the source of the comment on the first page. On the other hand, if you ask for an "I Feel Lucky" single answer from Google's natural language-capable site, you get back the day's headlines from Business News America, a Latin American news service, the Interactive Week test showed.

One of the best search engines at deciding relevance is Ask Jeeves. Its replies to natural language queries get away from keyword limitations and are frequently appropriate. But they are based on human editors who observe frequently asked questions and make sure Jeeves' results include the sites most likely to have the desired answers.

Ask Jeeves a question that his background editors haven't anticipated - for example, The City of New Orleans ran over how many miles of track? - and Jeeves' reply is as nonsensical as those of other search engines. The City of New Orleans is a former passenger train. But Jeeves' first response to the question is to offer to tell you how far it is from New Orleans to New Orleans.

On the other hand, Ford is implementing Ask Jeeves to help it supply answers to the most commonly asked questions concerning Ford Explorer tires, said Sean Murphy, vice president of product management at Ask.com, supplier of Ask Jeeves.

Most businesses, however, can't justify the expense of staffing their search function to the Ask Jeeves level, Hagen said.

Northern Light indexes 310 million pages, but includes all the words in a document, not just titles or the first few paragraphs. So when it hunts for where the words, "men's brown belts" occur, it gets 72,622 results. Put the phrase in quotes and the search engine is restricted to those documents where the three words occur in order together. Northern Light then comes back with 39 results, primarily focused on karate and secondarily on shopping, one or the other of which might reflect the searcher's interest.

The relevance of search

For the sites of companies and organizations, one of the most important factors in deciding the relevance of a search is not the type of search engine used but how well the content behind a site has been tagged and labeled, Hagen said. Although many businesses are generating content at a prodigious rate, they could employ content management systems from Autonomy, Allaire, Mindwave Software or Semio to make certain their content puts its best foot forward to a search engine.

One department of a company may aggressively tag its pages; another might neglect to tag them at all or use the same tag over and over again for a set of pages. "The tagged content rises to the top of the search results, even if it is not the best fit for the query," Hagen said. Tags used over and over show up as duplicates in the results, leaving the user baffled as to what content lies behind them.

All documents made available to a site's search tool should have different titles and descriptions that an average user can recognize, Hagen said. "Unfortunately, rewriting those portions of the content requires a manual effort," he said. If a firm can't make that investment in its site, Hagen said, it should simply remove untitled and undescribed documents from its index, since they only clutter up user results.

Although few sites practice it, one of the best ways to present relevant results to end users is to categorize them. That's one of the specialties of search engine Northern Light. Instead of just a stack of results extending 20 or 30 pages deep, Northern Light also brings back a set of labeled folders with different results in each. Thus a search for "red beans and rice" brought back both a burrito recipe folder and a rhythm and blues folder, for the Monterey, Calif., band of the same name.

Webvan Group, the online grocery retailer, is about to implement both a keyword and a category search on its site because it was missing sales due to ineffective searches.

In one case, a customer attempting to buy Murphy's Oil Soap, a wood cleaner, couldn't find it without knowing its correct name. Searching on soap or oil won't yield the product, but, by the end of November, searching "wood cleaner" will find it, according to Bill Brougher, director of technical alliances at the Foster City, Calif., company.

Webvan is adding the Mercado Software search engine, IntuiFind, to its site in November. Like EasyAsk's and a handful of other search engines, IntuiFind can search databases of structured information as well as unstructured text, providing a cross-search of two or more types of sources - text content and databases.

"We recently put in categories of food on our site, even though we already carried the products," Brougher said. A customer may find a Japanese noodle she wants more directly by searching in the "Japanese" food category, with its underlying database, than by a general search on "noodle," he said.

Likewise, if you go to Blockbuster.com and search "Belushi" with "frat," you will come up with the name of the movie, Animal House, even though minimal or misleading keywords have been provided to the search engine. "Belushi" shows up in the Blockbuster actors database, and "fraternity" shows up in a file that describes Belushi movies; both are linked to Animal House, said Yaron Dycian, product marketing director at Mercado.

On the other hand, categorization that does not work is particularly frustrating, Forrester's Hagen said.

Searching for "sleeping bags" on Backcountry Gear's site yields no sleeping bags, even though sleeping bags is one of the categories listed under Our Gear Shops at the top of the site, he said.

Conducting the search

Conducting a search does not have to be conceived as a one-shot process that brings back exactly the right result on the first try, Hagen said. It can be more of a dialogue between the site and the visitor, with the visitor's feedback offering a means of narrowing the search.

Indeed, University of California at Berkeley search researcher Marti Hearst said next-generation search will take a step back from search engines and implement search paths based on hyperlinked material. Leading the user from one hyperlink to the next provides the user with a more limited search path, something like boarding a train as opposed to going cross-country in an all-terrain vehicle.

Following described links gets the user more easily to his or her destination, while moving down untracked terrain, Hearst said, "may get you wedged between two boulders on the side of a cliff."

In addition, promising improvements in search technology are just around the corner or already here, waiting to be implemented. While keyword searches are based predominantly on one or two keywords, so-called matching engines are able to take multiple variables and match them up with another set of variables.

Matching engines are particularly useful when trying to pair a job applicant with an employer, said Sean Luitjens, director of strategic technology at recruitment site Monster.com. That ability is important to Monster, which lists 402,828 job postings.

Matching engines can deal with many more factors than keyword search engines, said George Karypis, computer science professor at the University of Minnesota. They can also outperform collaborative filtering engines, which use another technology to classify individuals. Matching engines still under development at Burning Glass Technologies and iXmatch are likely to bring a new level of search to sites in the near future, Karypis said.

In addition, search engines based on eXtensible Markup Language are appearing. These make use of the flexible tags on XML documents and can better describe the contents of those documents in search results, according to representatives of two such engines: XYZFind, an XML search engine backed by BEA Systems, and Xdex from Sequoia Software.

However, the biggest improvement won't come until Web operators critically review their own sites to see if the search functions are working, something they often fail to do, according to Forrester's Hagen. Effective search on a site enables customers to serve themselves, which "costs 30 times less than phone calls and 10 times less than e-mails," Hagen said.

Yet at 68 percent of the sites tested by Forrester, fewer than half of the results had anything to do with the query, and nearly two-thirds of the sites failed to list the best results on the first page.

Search has always been part of the Webvan site, but ease of search and obtaining quick, relevant results "is critical," Brougher said.

"The foundation of our business is saving the customer time. Think of the customer walking up and down the aisles of a supermarket looking for something," vs. finding it in seconds at the Webvan site, he said. If the search is lengthy, Webvan's justification for being in business goes away, he said.

Gaga over Google

By Connie Guglielmo and Charles Babcock

As the director of technology at search engine company Google, Craig Silverstein doesn't understand why Web sites don't put more into their search functions.

"It's frustrating to me to get a bad search, because we know it's possible to do a good search," he said. "I want to shake them and say, 'C'mon, why don't you get a good search engine?' "

One of the reasons companies don't invest more in search is because they are overconfident in the efficacy of the design of their sites, according to Silverstein.

"They think their site is so well-defined that people don't need to search because the navigation is so clear. Maybe they're so close to the site that they don't see the problems," he said. Google is notable for the relevance of its results, which are based on an analysis of a Web site's value by its Googlebot Web crawler. The Googlebot scores each page it finds by examining what sites link to the page.

"If high-quality sites point to you, then you're probably a high-quality site," Silverstein said.

By counting the links and evaluating their quality via the PageRank component of the Google engine, Google comes up with a rating for a given page and awards it a position higher on the list of results than pages that don't rank as well.

This ranking process gives Google an identifiable profile among the many search engines now available on the Web, according to Brian Cooper, author of the guidebook Searching the Internet. Google.com is not attempting to be a portal with many purposes and specialized links. Rather, it is a simple, straightforward search page offering a broad search or an "I Feel Lucky" button that takes the user to Google's best guess of where he or she most wishes to go.

The page-ranking system was developed by Larry Page and Sergey Brin at Stanford University, and they came up with the Lucky button based on their confidence in the search engine's ability to deliver results.

"Many times, it is uncannily accurate," Cooper said.

Silverstein said Google, like other search engines, is based on searching text on the Web, but the company is working on developing the ability to search audio and video files, as well as 3-D images.

Google was founded in 1998 and remains privately held. The Mountain View, Calif., company received $25 million in funding from Sequoia Capital and Kleiner Perkins Caufield & Byers.

Following the northern star

Founded in 1995 in the basement of an old Cambridge, Mass., mill building, Northern Light Technology offers a search engine that has been adopted by many sites for the comprehensiveness of its searches and its ability to categorize the results.

In addition to the typical stack of results, Northern Light brings results to a user in category folders.

"These Custom Search Folders are unique to your search, rather than representing the search provider's own Web directory categories," said Brian Cooper in his guide, Searching the Internet.

By selecting the folder with the title that most closely reflects the subject in which you are interested, you are most likely to find the results you want, stripping away much of the chaff of search. Among the folders brought back by Northern Light after a search for "The business of America is business," is one on Calvin Coolidge, the originator of the quote. Other folders deal with economics, the Great Depression, John F. Kennedy and the New Testament. The latter includes articles on the relationship between business and the ministry and business and ethics.

The Northern Light index includes a special collections database of 6,000 licensed publications and 25 million pages of their content, on top of 310 million Web pages, a total that prompts it to advertise itself as "the largest search engine throughout 1999." The measure is tricky, given the proportions of the Web.

By comparison, AltaVista says it indexes 350 million pages and Google advertises that it searches more than 1.2 billion - without stating on its home page exactly how many it indexes. AltaVista spokespeople say its crawlers search more than 1 billion pages to come up with the ones it wishes to index.

Northern Light is building up its index consistently with quality Web crawling, said Joyce Ward, vice president of enterprise marketing. When it comes to searching an individual site, it not only matters what search engine has been implemented, but how thoroughly it sends crawlers to comb the site and related material on the Web, collecting the latest documents and information and adding it to the index, Ward said.

"We load new licensed content every 90 seconds," she said, and a total of 10 million documents a week, week after week.

Because Northern Light indexes every word in a document, rather than just the title or first few major references, it is possible to type in a paragraph of a student paper that looks uncomfortably familiar and see if it has been plagiarized, she said.

Copyright © 2009 CBS Interactive, a CBS Company. All Rights Reserved.
ZDNET is a registered service mark of CBS Interactive. ZDNET Logo is a service mark of CBS Interactive.