SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
SI - Site Forums : Silicon Investor - Welcome New SI Members! -- Ignore unavailable to you. Want to Upgrade?


To: David Lawrence who wrote (17991)5/30/2003 12:11:23 PM
From: Jon Tara  Respond to of 33005
 
David, correct. Actually, it's not quite as simple as indexing words. Some schemes index word roots. Some allow partial-search on words and use tries (that's a real technical term, not a misspelling) to index down to the letter.

But, in general, the number of index entries will be a rough linear relationship with the number of unique words.

Why is the index database actually larger than the documents it is indexing, then?

Because for each word indexed in each document, there must be a pointer to a document, and in some schemes, an offset into a document.

That is, each word appears in the index once, with a list of pointers to where it was found.

Various crafty ways have been devised to minimize the size of these pointers. The best create an index a bit larger than the documents it is indexing.



To: David Lawrence who wrote (17991)5/30/2003 1:44:40 PM
From: SI Bob  Respond to of 33005
 
A rolling subset is the first other approach we're going to try if the system just can't handle all messages.

Forgot another thing I have a major problem with in SQL Server's full-text indexing:

You can't apply CONTAINS() to a query-created recordset. It only works against the table itself.

So though it would be possible to write a query that looks for a word only in messages 18 million through 19 million or any other range of a million, the way it actually works at the server is that all rows containing that word are returned, THEN it's narrowed down to the range you specified.

I spent a long time trying to search in batches of 100k (until 50 results are found) before I finally figured out that each read was still going at the whole table, and found out that's just the way it works.

It's what I consider a serious flaw in SQL Server 2000, as is the way you have to dink with it to get anything but a forward-only recordset, especially if you've got more than one SELECT in your query.