Web growth outpaces search engines New estimate places size of public Web at 800 million pages By Alan Boyle MSNBC July 7 — The amount of information on the World Wide Web is outpacing the coverage of the search engines indexing that information, researchers report. They say that may be the root of a phenomenon well known to Internet entrepreneurs: Not all Web sites are created equal.
‘The search engine coverage hasn't increased as quickly as the size of the Web has increased.' — STEVE LAWRENCE NEC researcher AS OF FEBRUARY, the publicly indexable World Wide Web contained 800 million pages on 2.8 million computer servers, comprising 6 trillion bytes of textual information and 3 trillion bytes worth of images, NEC researchers Steve Lawrence and C. Lee Giles report in Thursday's issue of the journal Nature. That's the good news. The bad news is some of that information is getting harder to find. According to the study, even the best search engine only keeps track of 16 percent of the Internet's Web pages. Collectively, the top 11 Web search tools only index 42 percent of the Web. A REAL PROBLEM The practical issue this study raises is the suggestion that a majority of the Web's pages aren't indexed by any search engine. And it's getting harder for average home-page builders to get and keep high-ranking spots in search results. The Internet promises instant access to virtually any information, but some information is apparently being left behind or is falling through the cracks. “It's definitely a question of who controls the information,” Lawrence said. “There's no evidence [search engines] abuse that power. But there are issues that come about just by how they work.” For example, simply putting up a page is no guarantee it will ever make it into any engine's database. Many search engines put a limit on how many Web pages from any individual domain will be indexed — so they give up on free Web hosting services like Geocities and its reported 34 million home pages. Web authors who want their sites to be indexed are better off going with their own domains, or posting pages with services that have fewer members. That's just one of the tricks of the trade for making sure a Web site doesn't slip through the cracks — tricks not everyone knows. “The forgotten masses are people who don't bother registering with search engines at all,” says Lawrence. “It's a good idea to register. But even if you do, that's no guarantee.” The study also raised other questions about equal access to information. For example, non-U.S. sites are less likely to be indexed then U.S. sites, and educational sites are less likely to make an engine's database than commercial sites. RANDOM SAMPLE The new Web estimates, based on a random sampling of numerical Internet addresses, come with a lot of asterisks attached. The NEC survey takes in only Web pages that would show up in public searches — which excludes password-protected information and pages hidden behind database-searching forms as well as audio and video on the Web. The estimates on data content also exclude the Web coding itself and the “white space” in HTML files. If you throw in the HTML coding and “white space” on those pages, the estimate of the total data content rises to 15 trillion bytes, the researchers say. Or to put it into the proper gee-whiz context, that is the equivalent of 15 million books, or a stack of paper 450 miles high — which is higher than the orbit of the Hubble Space Telescope. All this means that the true volume of the Web certainly exceeds even the NEC estimate. It also far outstrips an estimate published by the same researchers in the journal Science a little more than a year ago, contending that the indexable Web contained at least 320 million pages — which at the time was significantly higher than other estimates. “The size of the Web has increased to 800 million pages, which is not so amazingly surprising,” Lawrence told MSNBC. “More importantly, the search engine coverage hasn't increased as quickly as the size of the Web has increased.” SEARCH ENGINE COVERAGE Which is the most effective search engine? Northern Light Snap Alta Vista HotBot Microsoft Infoseek Google Yahoo Excite Lycos Euroseek
Vote to see results
Which is the most effective search engine? * 1004 responses Northern Light 11% Snap 5% Alta Vista 31% HotBot 8% Microsoft 5% Infoseek 9% Google 6% Yahoo 18% Excite 4% Lycos 3% Euroseek 0%
Survey results tallied every 60 seconds. Live Votes reflect respondents' views and are not scientifically valid surveys.
The real point of the new survey, like the previous one, was to gauge how much of the Web is covered by the major search engines — and whether there were factors that affected which pages got indexed. This isn't just an academic question: An estimated 85 percent of all Web users use search engines to locate information. Search sites rank among the most highly trafficked destinations on the Internet, and serve as a foundation for highly valued Web portal sites. The earlier study contended that six major search engines collectively covered about 60 percent of the indexable Web, with the biggest database covering just 34 percent. The new study estimates that 11 selected search engines collectively cover 335 million pages. That amount exceeds the researchers' previous estimate for total Web size in December 1997, but comes to just 42 percent of the much bigger Web base estimated for February 1999. 16 PERCENT INDEXED The search engine with the widest coverage, Northern Light, indexes 128 million pages — just 16 percent of estimated total. Again, there are asterisks: The NEC researchers said their results related to the “relative coverage of the engines for real queries, which can be substantially different from the relative number of pages indexed by each engine.” For example, a search engine may have a larger than estimated total database — but it may place limits on processing time used for a query, which would effectively reduce its coverage. The researchers also counted out-of-date links returned by the 11 search engines, and sought to measure the lag time for indexing updated Web pages. On average, 5 percent of the results from a search turned out to be invalid, and when identical queries were repeated over time, the average age of newly returned Web links was 186 days. “Although our results apply only to pages matching the queries performed and not to general Web pages, they do provide evidence that indexing of new or modified pages can take several months or longer,” the researchers wrote. THE ECONOMICS OF SEARCHES Why does it seem that search engines index such small slices of the total Web, and take so long to update their databases? The researchers speculate that “there may be a point beyond which it is not economical for them to improve their coverage or timeliness.” With the rise of portal sites, search engine companies may find it more profitable to offer a wider array of services, such as free e-mail, chats and auctions, Lawrence said. Some search engine companies contend that it's better to return a small number of high-quality results than a large number of results that may or may not be relevant. “Our response to that is that you don't need to choose,” Lawrence said. “You can have both.” The researchers contended that the search engines did not provide equal access to Web pages. Instead, the databases are biased toward popular sites, they said. SKEWED TOWARD POPULAR SITES “A very strong trend can be seen across all engines, where sites with few links to them have a low probability of being indexed, and sites with many links to them have a high probability of being indexed,” they noted. On the other hand, sites with little or no links to them could be left out completely. “If the engines were indexing all of the Web, you wouldn't have this problem,” Lawrence said. Some of the newer search sites, such as Google and DirectHit, consciously skew their results toward popular sites. “For ranking based on popularity, we can see a trend where popular pages become more popular, while new, unlinked pages have an increasingly difficult time becoming visible in search-engine listings. This may delay or even prevent the widespread visibility of new high-quality information,” the researchers said.
The trend could have an significant effect on the timely exchange of scientific and educational information, which was housed on 6 percent of the Web servers they sampled, the researchers said. In comparison, they reported that 83 percent of the servers sampled contained commercial content, while 1.5 percent contained pornography. COMPREHENSIVE INDEXES Lawrence called for the creation of comprehensive, frequently updated indexes for scientific Web sites, health sites and government sites. The new study, like the previous one, was funded by the NEC Research Institute with no contributions from any companies that operate search engines. Although the figures from the two surveys can't be directly compared, due to differences in the survey method, Lawrence said the Web's growth appeared likely to outrun the search engines' ability to keep up in the short term. But he said the trend would probably reverse itself eventually. “The reason for that is that the rate of increase for computational resources is faster than the rate of increase for the production of information by humans,” he said
|