SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Pastimes : Computer Learning -- Ignore unavailable to you. Want to Upgrade?


To: mr.mark who wrote (15312)1/25/2001 2:06:55 PM
From: SIer formerly known as Joe B.  Respond to of 110652
 
Mining the 'Deep Web' With Specialized Drills
January 25, 2001
nytimes.com

By LISA GUERNSEY

TWO weeks ago, online
newspapers and magazines
were buzzing with news about
Linda Chavez, President Bush's
first choice for labor secretary.

But from the results coming up in
most popular search engines, you
would never have known it.
Instead of retrieving articles about
an illegal immigrant who had lived
in Ms. Chavez's home, a Google
search on "chavez" led to several
encyclopedia entries on Cesar
Chavez, the American labor leader
and advocate of farmworkers'
rights.

Lycos turned up several Web sites
with information about Eric
Chavez, an Oakland A's third
baseman. On Alta Vista, some of
the first results linked to Ms.
Chavez's old columns for an online
magazine, but none of the links
provided even a hint of the fact
that she had become front-page
news.

"I don't see anything that anyone
would feel is relevant to her given
the context of this past week," said
Danny Sullivan, the editor of
SearchEngineWatch.com, as he
typed "chavez" into other search
engines.

His demonstration illustrated a
problem that has long been
apparent longtime problem that
has to anyone casting about for
online news reports: search
engines can be pitifully inadequate,
partly because they rely on
Web-page indexes that were
compiled weeks before. It is not
just timely material that seems to
escape their reach. Pages deep
within Web sites are also often
missed, as are multimedia files,
bibliographies, the bits of
information in databases and pages
that come in P.D.F., Adobe's
portable document format.

In fact, traditional search engines
have access to only a fraction of 1
percent of what exists on the Web.
As many as 500 billion pieces of
content are hidden from the view
of those search engines, according
to BrightPlanet.com, a search
company that has tried to tally
them. To many search experts, this
is the "invisible Web." BrightPlanet
prefers the term "deep Web," an
online frontier that it estimates may
be 500 times larger than the
surface Web that search engines
try to cover. And that uncharted
territory does not include Web
pages that are behind firewalls or
part of intranets.

To dig deeper into the Web, a
new breed of search engine has
cropped up that takes a different
approach to Web page retrieval.
Instead of broadly scanning the
Web by indexing pages from any
links they can find, these search
engines are devoted to drilling
further into specialty areas —
medical sites, legal documents,
even Web pages dedicated to
jokes and parody. Looking for
timely financial data? Try
FinancialFind.com. Seeking
sketches of molecular structures or
even scientific humor?
Biolinks.com may help.

"Instead of grabbing everything on
the Web and then trying to deal
with this big mess," Mr. Sullivan
said, these boutique search engines
have decided to do some filtering.
"They may say, we'll pick 40 sites
that we know are related to this
topic," he said. "And that means
you won't get these irrelevant
search results."

Some search engines go even
further, sending out finely tuned
software agents, or bots, that learn not only which pages to search, but
also what information to grab from those pages. Either way, the theory is
the same: The smaller the haystack, the better chance of finding the
needle.

Finding those smaller haystacks can be a challenge in itself. It is the same
problem faced by patrons who walk into a library, said Gary Price, a
librarian at George Washington University and co-author of the
forthcoming book "The Invisible Web" (CyberAge Books). People may
know to come to the library, but they probably do not know which
reference books to pull off the shelf. Of course, in such cases, patrons
can at least consult a reference librarian. On the Web, people are usually
fending for themselves.

"The end user should have a better idea of all the different options that
exist," Mr. Price said. "But this is easier said than done."

Lately, however, a few specialty search engines have been popping up
on lists of most-visited Web sites — evidence that people are learning to
find them. MySimon, a service that specializes in culling product prices
and information across 2,500 shopping sites, is one of the most popular.
In December, the site attracted 5 million unique visitors, a huge increase
from its 1.9 million visitors a year before, according to Jupiter Media
Metrix, an Internet research firm. FindLaw.com, a search engine and
Web- based directory of legal information, has as many as 900,000
visitors a month.

Moreover.com, a site that opened in 1999 with a search engine that
gathers headlines from 1,800 online news sources, has also appeared on
Jupiter Media Metrix's reports of Web use, which track only sites with at
least 200,000 visitors a month. Last month, about 340,000 people
visited Moreover.com's pages — and that is without any consumer
marketing from the company, which offers the search engine free as a
teaser for businesses that might buy its search software.

Like most specialty search engines, Moreover manages to find those
news stories because its bots have been designed to hunt for only specific
pages within a specific realm of the Web. They are like sniffing dogs that
have been given a whiff of a scent and are taught to disregard everything
else. Font tags in the source code underlying the Web page, for example,
are a giveaway. Between 6 and 18 words in large type near the top of a
Web page look a lot like headlines. In most cases they are, and the site's
bots retrieve them, using the headline as the link in the list of search
results.

Once in a while, however, those supposed headlines turn out to be
something else, like a copyright disclaimer page. So to filter further,
Moreover's spiderlike bots learn the structure of the Web address, noting
which words and numbers show up between the slashes. If an address
ends with the word "copyright," a bot may decide to disregard that page.
Similar rules are used to categorize the news articles so that people can
narrow their searches before even entering a search term. "Our spiders
are very good readers," said Nick Denton, Moreover's chief executive.

MySimon also employs bots that are designed to hunt for very specific
information. But first the bots must watch the click- through routines of
MySimon employees who have learned the ins and outs of particular
online shops — like exactly which pages typically provide prices, sizes or
shipping fees. Once trained, the bots follow those paths themselves,
prowling shops for information to put into databases and then display
online. For example, one bot is assigned to Amazon.com's bookshelves;
another is assigned to its electronics merchandise.

"What we're doing is teaching our agents to shop on behalf of
consumers," said Josh Goldman, president of MySimon.

Meanwhile, general search engines have also decided to offer smaller
fields for foraging. Northern Light has a news search service that
searches a two-week archive of articles on 56 news wires. It also offers
a "geosearch" service that allows people to look for businesses based
within a few miles of a given address. Google recently opened an "Uncle
Sam" area, where people can search for governmental material.

Services that limit searches to audio or video files — typically found
under the heading "multimedia search" — are now offered on sites like
Alta Vista, Excite and Lycos. And shopping search engines are linked
from almost all of the major search sites.

But again, many Web users do not know that the narrow searching tools
exist. So reference librarians and library Web sites are now directing their
patrons to those areas on the Web. Mr. Sullivan, Mr. Price and Chris
Sherman, a search guide on About.com who is working with Mr. Price
on the "Invisible Web" book, are among the several information- retrieval
experts who have built online directories of specific search sites. Another
tool is the LexiBot, a downloadable program designed by BrightPlanet to
demonstrate the search technology it sells to businesses. The LexiBot,
which costs $89.95 but is free for the first 30 days, gathers information
simultaneously from 600 search sites and databases — including the
databases that form the basis of specialty search engines.

The harder part may be to change people's behavior. All the boutique
search engines in the world will not alter the fact that the majority of Web
surfers are still inclined to type a single keyword into a huge, general
search engine and hope for the best. The thought of narrowing a search
— by either going to a specialty search page or clicking through a menu
of choices on a general search site — does not seem to occur to most
users, Mr. Sullivan said.

He poses this challenge to the major search sites: Wouldn't search
engines be more helpful if they would automatically narrow a search
without requiring their users to make that realization on their own?

"Can you automatically detect what database to search," he asked in
posing his challenge, "based on what people have typed in?" During the
second week of January, for example, perhaps a search engine could
have been directed to steer people to news sites whenever they typed in
words that made headlines, like "chavez."

A few search engines have tried to take that step, with mixed results. For
example, when Mr. Sullivan typed "chavez" into the search box at Ask
Jeeves earlier this month, the site pointed to a recent news story — a link
provided by Ask Jeeves' editors who were assembling information about
potential members of a Bush cabinet. Using the same search a few weeks
later, the news reports were nowhere to be found. (Paul Stroube, the
company's vice president for Web production, said that the news link
disappeared because Ms. Chavez was taken off Ask Jeeves' list of
President Bush's nominees.)

Unless the big search engines get better at delivering timely information,
searchers might be better off with Moreover.com and other
news-oriented search services. With those, Mr. Sullivan has found
success. Two weeks ago, in a Moreover search using the word "chavez,"
more than 30 relevant stories appeared, at least half of which had been
posted that day.