SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : The New QLogic (ANCR)
QLGC 16.070.0%Aug 24 5:00 PM EST

 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext  
To: Alan Bershtein who wrote (23482)7/24/1999 11:07:00 AM
From: George Dawson  Read Replies (1) of 29386
 
Good article on searching the web for information:

Lawrence S, Giles CL. Accessibility of information on the web. Nature 1999, 400: 107 - 109.

A few highlights:

1. There are about 4.4 billion possible IP addresses and IPv6 will increase this.

2. The authors sampled 3.6M IP addresses and used this to estimate that there are 16M web servers. They then estimated there were 2.8M servers on the publicly indexable web.

3. They estimated there were an average of 289 pages per web server and used this to estimate 800M pages on the publicly indexable web.

4. Using a page size estimate of 18.7KB this amounts to a total of 15 TB of TB of pages or 6 TB of text after the HTML modifiers are removed. Looking at images there are about 3 TB of public images.

5. They did search engines comparisons of 11 major search engines for how they cover the web. With respect to estimated web size they covered only 2.2 ->16% of the existing web. Mean age of new matching documents varied from 141 - 240 days.

I thought this article was interesting for a couple of reasons. First, if you are using any of the standard search engines to find pages on FC or Ancor it is very inefficient based on these statistics. It also explains, why many of us have encountered pages in unorthodox ways that were not found by the usual search engines. Secondly, there is a lot of proprietary data out there that dwarfs the publicly accessible web. For example IBM has a TeraByte Club that currently has 120 members - much greater than the authors estimates of 9 TB on the web. For additional perspective the Library of Congress has about 5 TB of information. Getting library information onto disks will greatly increase the size of the web.

George D.
Report TOU ViolationShare This Post
 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext