SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
SI - Site Forums : Silicon Investor - Legacy Interface Discussion (2004-2011) -- Ignore unavailable to you. Want to Upgrade?


To: Sam Citron who wrote (2664)11/24/2004 9:19:08 PM
From: SI Bob  Read Replies (2) | Respond to of 6035
 
I simply want to understand why with storage and processing power getting cheaper all the time, it so hard to enable thousands of users to do advanced searches without bringing down the system?

The system is quite scalable.

The weak link is SQL Server's method for enabling full-text searching.

Though I ended up making shadow copies of the message table, segregated by the year in which the messages were written, let's use the whole message table for an example.

If you enable full-text indexing, SQL Server will build a catalog (basically a separate table/index combo) that contains every word of all 20 million messages and a list of all the message numbers containing each of the words. The catalog itself is nearly as large as the message table, which is pretty big.

Now suppose I want to find every message that I wrote in which I used the word "index".

I've written 2618 messages. If the query, in pseudo-code looked like "Of all the 2618 messages this person wrote, find the ones using the word 'index' in them", it should run pretty quickly. It takes less than a second for it to return all messages written by me. It likely would take about that long to find which of those messages have "index" in them.

But SQL Server's full-text search doesn't work that way.

I just tried it and it took 108 seconds to simply count the messages containing "index" in the whole message table. Would take substantially longer to return the messages themselves. And in the whole public message table, there are 209,681 messages containin the word "index".

What would make sense would be for the system to narrow down my 2618 messages to a resultset, then find how many of those messages had the word "index".

SQL Server does it the other way around. It first finds all 209,681 messages containing "index" in them, THEN will determine which ones were written by me. The other criteria on the advanced search screen are nearly zero-cost. The part that actually looks up words is very expensive. Because it returns every message containin the word "index", no matter what other criteria are part of the query.

It's for that reason that I've had to make separate tables for each of the year ranges and have SQL Server full-text index those tables and limit searches to those tables individually. Still, that only helps somewhat.

In 2000, 2,386,656 public messages were written, 34,178 of which have the word "index" in them. None of them were written by me. In fact, I only wrote 556 messages that year.

Imagine how much quicker it would be to narrow it down to the 556 messages I wrote THEN determine how many of that tiny resultset contained "index".

There are 3 ways around it:

1. Write a system like SI's old one that doesn't rely on SQL Server's full-text indexing. That's only a partial solution. Searches were fast on the previous version, but the design of it made for a number of different potential failure points (and I think we all know how frequently failure happened), and part of why it was so fast was that it, too, only worked on a subset of the whole message table.

2. Write Google's software from scratch and throw many computers at the task like they do. I don't remember how many computers are working concurrently to do the job there, but remember it's at least in the thousands.

3. Wait for the new version of SQL Server, which I hear is supposed to do away with this major flaw in full-text searching.

If you have these kind of scalability problems even after 16 months of working on the problem, I can see why you want to ration demand for advanced search.

Ummmmm, I assume you know that very little of that 16 months was spent writing full-text search. A lot of the time was spent writing ASP.NET code from scratch for the interface, and migrating the data from Oracle to SQL Server, which was a LOT harder than one might think. All that this system has in common with the old one is that the data was imported from the old one. Very tediously. One problem was that DTS refused to work correctly on tables with CLOBs in them. Another problems was that I was dealing with a few dozen gig of dynamic data, much of which ended up having to be imported one record at a time using ASP.

But even if Advanced Search were a very inexpensive process to run, I would still limit its access because our business model is to give everyone free access, but the cooler features require subscribing. It's as much a business decision as a technical one.

As far as scalability goes, I never cease to be amazed at how powerful this equipment is and how scalable it all is. It's actually caused some bad habits. For example, normally I'd rather store information in session variables or cookies so I don't have to go to the db for it all the time. But this sucker's so fast, that there are a number of very small data reads I go ahead and just get from the database on every page load just because the system doesn't seem to mind.

it should not be too hard to search a couple of million messages for a boolean expression.

20 million messages. And not just bits and integers. But very long text fields. Imagine using Basic's INSTR() function on a huge string. 20 million times. It's amazing full-text works as well as it does, actually, since INSTR() or LIKE would literally take DAYS for a search.

I haven't timed searches yet, but I think they're typically running 5-10 seconds.

If you're familiar with SQL, but your knowledge of it is outdated, you might not be aware of the CONTAINS() function that's used in full-text. It works, but as I've pointed out above, it's inefficient because it's the first part of your WHERE clause that gets executed, no matter what you do. And, no, you can't use the function on a resultset. It can only be used on a table itself.

But imagine how much worse it could be.

When I took over iHub, full-text search was accomplished by using LIKE (ie LIKE '%index%'), which I'm sure you can imagine isn't terribly efficient. It was so slow, they disabled full-text searching once the database had reached something like 50k messages. LOL

Anyway, scalability we've got. Out the wazoo. Full-text search is a very weak spot, though, but I think this implementation of it is going to be very workable.