SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
SI - Site Forums : Silicon Investor - Welcome New SI Members! -- Ignore unavailable to you. Want to Upgrade?


To: David Lawrence who wrote (17976)5/30/2003 1:30:53 AM
From: Jon Tara  Read Replies (2) | Respond to of 32871
 
David, why "in this case"?

Do SI users use bigger words than others?

The index space is generally not a LOT greater than the underlying text - but, generally, larger. Maybe, say, a factor of 1.2.

Given the cost of disk space, not an outrageous cost. Indeed, the low cost of disk space must be a contributing factor to the ubiquity of site-wide searches on web sites of all sizes.



To: David Lawrence who wrote (17976)5/30/2003 1:11:23 PM
From: SI Bob  Read Replies (1) | Respond to of 32871
 
Yes, the search database is, in my experience, about the same size as the message table. Sometimes larger. SQL Server lets you exclude "noise" words (I think later versions of Windoze all come with a default "noise.eng" because it's really making use of a Windoze function; not a SQL Server one) like "the", "and", "or", "for", etc, which helps some.

SQL Server has four functions built in to make use of the full-text indexing, CONTAINS() being the one I use.

It has some very big downsides, though. For example, add and populate a field to a 19-million-row table, and the equipment will be busy for days (literally) completely rebuilding the search index from scratch. And that's not the only thing that can trigger a complete rebuild. I've also noticed that it's not consistent about how it handles word delimiters other than spaces. Like punctuation. I've found instances where a message didn't show up in a search because of the character immediately following the word I was looking for.

But overall it's really not too bad. I'm just not sure how well it'll perform with 19 million messages. If it can't do it or do it well, I already have some approaches in mind to try before resorting to a home-grown method.