[FreePint] The Invisible Web By Chris Sherman freepint.co.uk
There's a big problem with most search engines, and it's one many people aren't even aware of. The problem is that vast expanses of the Web are completely invisible to general purpose search engines like AltaVista, HotBot and Google. Even worse, this "Invisible Web" is in all likelihood growing significantly faster than the visible Web you're familiar with.
So what is this Invisible Web and why aren't search engines indexing it? To answer this question, it's important to first define the "visible" Web, and describe how search engines compile their indexes.
The Web was created a little over ten years ago by Tim Berners-Lee, a researcher at the CERN high-energy physics laboratory in Switzerland. Berners-Lee designed the Web to be platform-independent, so that researchers at CERN could share materials residing on any type of computer system, avoiding cumbersome and potentially costly conversion issues. To enable this cross-platform capability, Berners-Lee created HTML, or HyperText Markup Language - essentially a dramatically simplified version of SGML (Standard Generalized Markup Language).
HTML documents are simple: they consist of a "head" portion, with a title and perhaps some additional meta data describing the document, and a "body" portion, the actual document itself. The simplicity of this format makes it easy for search engines to retrieve HTML documents, index every word on every page, and store them in huge databases that can be searched on demand.
What's less easy is the task of actually finding all the pages on the Web. Search engines use automated programs called spiders or robots to "crawl" the Web and retrieve pages. Spiders function much like a hyper-caffeinated Web browser - they rely on links to take them from page to page.
Crawling is a resource-intensive operation. It also puts a certain amount of demand on the host computers being crawled. For these reasons, search engines will often limit the number of pages they retrieve and index from any given Web site. It's tempting to think that these unretrieved pages are part of the Invisible Web, but they aren't. They are visible and indexable, but the search engines have made a conscious decision not to index them.
In recent months, much has been made of these overlooked pages. Many of the major engines are making serious efforts to include them and make their indexes more comprehensive. Unfortunately, the engines have also discovered through their "deep crawls" that there's a tremendous amount of duplication and spam on the Web. Current estimates put the Web at about 1.2 to 1.5 billion indexable pages. Both Inktomi and AltaVista have claimed that they've spidered most of these documents, but have been forced to cull their indexes to cope with duplicates and spam. Inktomi puts the size of the distilled Web at about 500 million pages; AltaVista at about 350 million.
But these numbers don't include Web pages that can't be indexed, or information that's available via the Web but isn't accessible by the search engines. This is the stuff of the Invisible Web.
Why can't some pages be indexed? The most basic reason is that there are no links pointing to a page that a search engine spider can follow. Or, a page may be made up of data types that search engines don't index - graphics, CGI scripts, Macromedia flash or PDF files, for example.
But the biggest part of the Invisible Web is made up of information stored in databases. When an indexing spider comes across a database, it's as if it has run smack into the entrance of a massive library with securely bolted doors. Spiders can record the library's address, but can tell you nothing about the books, magazines or other documents it contains.
There are thousands - perhaps millions - of databases containing high-quality information that are accessible via the Web. But in order to search them, you typically must visit the Web site that provides an interface to the database. The advantage to this direct approach is that you can use search tools that were specifically designed to retrieve the best results from the database. The disadvantage is that you need to find the database in the first place, a task the search engines may or may not be able to help you with.
Another problem is that content in some databases isn't designed to be directly searchable. Instead, Web developers are taking advantage of database technology to offer customized content that's often assembled on the fly. Search engine results pages are an example of this type of dynamically generated content - so are services like My Excite and My Yahoo. As Web sites get more complex and users demand more personalization, this trend toward dynamically generated content will accelerate, making it even harder for search engines to create comprehensive Web indexes.
In a nutshell, the Invisible Web is made up of unindexable content that search engines either can't or won't index. It's a huge part of the Web, and it's growing. Fortunately, there are several reasonably thorough guides to the Invisible Web.
Gary Price, Reference Librarian at the Gelman Library at George Washington University, is considered one of the foremost authorities on online databases and other invaluable search resources on the Invisible Web. Price has assembled a massive collection of links to Invisible Web resources at his Direct Search page <http://gwis2.circ.gwu.edu/~gprice/direct.htm>.
"A good librarian would not start looking for a phone number (specialized, Invisible Web info) by searching the Encyclopaedia Britannica (general knowledge resource)," says Price. "Both professional and casual searchers should at least be aware that they could be missing some information or wasting time finding what could be found more easily if the right tool for the job is easily accessible. This is very similar to a good reference librarian 'knowing' the major reference tools in his or her collection."
What kinds of databases does Price consider to be essential Invisible Web search tools? He names four as examples:
- The many databases that make up GPO Access. <http://www.access.gpo.gov/su_docs/aces/aaces002.html>
- Any of the telephone directory databases such as Anywho <http://www.anywho.com/>, Switchboard <http://www.switchboard.com/>, and Phone Net U.K. <http://www.bt.com/phonenetuk/>.
And two that are crucial to the business searcher:
- Any of the many flavors of EDGAR, particularly the 10K Wizard. <http://www.tenkwizard.com/> - The Mercury Center searchable version of the PricewaterhouseCoopers Money Tree Survey of venture capital made available by the San Jose Mercury News. <http://wwdyn.mercurycenter.com/business/moneytree/>
"In addition to text media, the Internet is serving up many other formats. "One that interests me a great deal is streaming media. One experimental project that is noteworthy is the Speechbot engine that is being developed and tested by Compaq," says Price. <http://speechbot.research.compaq.com/>
Two other Invisible Web resources Price maintains are his NewsCenter <http://gwis2.circ.gwu.edu/~gprice/newscenter.htm>, which focuses on sources providing up to the minute news stories on any subject imaginable, and his Web Audio Current Awareness Resources page <http://gwis2.circ.gwu.edu/~gprice/audio.htm>, with links to hundreds of live and recorded audio/video news and public affairs programming on the Web.
"By the way, do not mistake an interest in the Invisible Web as a slam on the general search engines because it is NOT," says Price. "General search tools are still 100% essential for accessing material on the Internet."
One of the largest gateways to the Invisible Web is the aptly named Invisibleweb.com <http://www.invisibleweb.com> from Intelliseek. "Invisible Web sources are critical because they provide users with specific, targeted information, not just static text or HTML pages," says Sundar Kadayam, CTO and Co-Founder, Intelliseek. "InvisibleWeb.com is a Yahoo-like directory. It is a high quality, human edited and indexed, collection of highly targeted databases that contain specific answers to specific questions," says Kadayam.
Intelliseek also makes BullsEye, a desktop based meta search engine that can also access many of the sites included in InvisibleWeb.com. More information can be found at <http://www.intelliseek.com/prod/bullseye.htm>.
Other notable Invisible Web resources include:
AlphaSearch <http://www.calvin.edu/library/searreso/internet/as/> AlphaSearch is an extremely useful directory of "gateway" sites that collect and organize Web sites that focus on a particular subject. Created and maintained by the Hekman Library at Calvin College, it's both searchable and browsable by either subject discipline or descriptor.
The Big Hub <http://www.thebighub.com/> The Big Hub maintains a directory of over 1,500 subject specific searchable databases in over 300 categories. Listings for each database feature both annotations and search forms to directly access the database. While these are useful for quick and dirty searches, Big Hub's search forms omit most advanced searching features offered by each database on their own site.
Infomine Multiple Database Search <http://infomine.ucr.edu/search.phtml> Infomine might be called an "academic" search engine, focusing on scholarly resource collections, electronic journals and books, online library card catalogs, and directories of researchers. Unlike many Invisible Web search tools, Infomine allows simultaneous searching of multiple databases.
WebData.com <http://www.webdata.com/> WebData is a database portal, specializing in finding, categorizing and organizing online databases, and providing annotated links with quality rankings.
As fast as the Web has been growing over the past ten years, it's likely that its growth rate is accelerating, perhaps exponentially. Speaking at the NetWorld+Interop conference in May 2000, Inktomi CEO David Peterschmidt said he expected the Web to grow to more than 8 billion documents by the end of the year - more than a fivefold increase from its current size.
The major search engines have done a creditable job of scaling with the visible Web. For the foreseeable future, however, valuable resources that are part of the Invisible Web will be beyond their reach. Fortunately, we have other workmanlike tools that can help us navigate the portion of the Web that the search engines can't see.
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Chris Sherman is the Web Search Guide for About.com, <http://websearch.about.com>. Chris holds an MA from Stanford University in Interactive Educational Technology and has worked in the Internet/Multimedia industry for two decades, currently as President of Searchwise.net, a Web consulting and training firm. He's a frequent contributor to information industry trade publications including Online Magazine and Information Today. His email address is websearch.guide@about.com.
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Related Free Pint links:
* Respond to this article and chat to the author now at the Bar <http://www.freepint.co.uk/bar> * Read this article online, with activated hyperlinks <http://www.freepint.co.uk/issues/080600.htm#feature>
> = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
>>> PLEASE LINK TO FREE PINT <<<
If you'd like to show your support for Free Pint then please consider adding a small graphic to your Web site. Simply copy the HTML code from the page at freepint.co.uk
> = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = |