SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Microcap & Penny Stocks : TGL WHAAAAAAAT! Alerts, thoughts, discussion. -- Ignore unavailable to you. Want to Upgrade?


To: Jim Bishop who wrote (57373)8/1/2000 11:48:31 AM
From: Vision21  Respond to of 150070
 
WLDC moving, bid ut.



To: Jim Bishop who wrote (57373)8/1/2000 12:01:36 PM
From: Kevin Clarke  Read Replies (1) | Respond to of 150070
 
ANYONE REMEMBER THIS?
biz.yahoo.com
Looks like the accumulating has begun.
finance.yahoo.com



To: Jim Bishop who wrote (57373)8/1/2000 12:49:00 PM
From: StocksDATsoar  Respond to of 150070
 
FROM FLWEST 19~13B173B8 P. 1

DNA PRINT GENOMICS

1748 INDEPENDENCE BLVD. D-1
SARASOTA. FL 34234
(941) 351-4543
- (941)351-7388

July 31, 2000

To. Mr. Carl Smith III

Enclosed you will find what was discussed earlier. lfyou should have any questions feel free to call.

Sincerely,

Dr. Tony Frudakis
FROM FLWEST 19~13S173gg P. 2

-

A Geometric Model of Information
Retrieval Systems

Myung Ho Kim

DNAPYInT Genomks
Sayasota, FL

The past decade saw a great deal of progress in the development of in-formation retrieval systems. Unfortunately, we still lack a systematic
understanding of the behavior of the systems and their relationship with c documents. In this paper we present a completely new approach towards V the understanding of information retrieval systems. Recently, it has been
observed that retrieval systems inTREC 6 show some remarkable patterns in retrieving relevant documents. Based on the TREC 6 observations, we introduce a geometric linear model of information retrieval systems. We then apply the model to predict the number of relevant documents found by the retrieval systems. The model is also scalable to a much larger data set. Although the model is developed based on the TR.EC 6 routing test data, I believe it can be readily applicable to other information retrieval systems. In the appendix, we explain a simple and cf~cient way of making a better system from existing systems.

1. Introduction

Presently, the internet is being more and more frequently used for in-formation retrieval for a wide variety of activities ranging from routine tasks such as shopping and reading the newspaper to esoteric research. As the internet was about to become ubiquitous and grow at an ex-plosive rate, in 1992, National Institute of Standards. & Technology, Information Access & User Interfaces Division (IAUI), and Defense Ad-vanced Research Projects Agency (DARPA), Ioined to start a conference named Text Retrieval Conference (TREC) for exchanging technology and research. One of the most exciting events at TREC is the compe-tition of retrieving relevant documents by participating institutes from all over the world. To get some idea, we need to know the following competition procedure.

Step 1. To 31 participating systems, TREC 6 provided

1. 47 topics;

2. 120,6S3 documents, which each system was to search through and pick up relevant documents, for each of the 47 topics.

Compk~t S’yst~ms. 12 (2000) 93-102; @2000 Complcc Sys!cms Pub1,cat;o~s. Inc.
7-31 -29O~i ~1:B4PM FROM FLWEST 19~13B173SS P 3
94 M. R. Kim

Step 2. 31 computer systems submitted 1000 retrieved documents for each topic ranked by relevance. A document of rank 1 is considered as the most relevant by a system. The total is 31 x 47 x 1000 lists of pairs of topic and document identity.

Step 3. For each topic, TREC collected the top 100 documents from each system and judged their relevance to a given topic. For example, for topic 1, if F128 is retrieved by system atr97re as document 20, it judged whether the document was relevant for topic 1. The performance results of the 31 systems were announced. This is the data I worked on and analyzed.

For some topics, most of the systems did not do well in retrieving relevant documents. As a matter of fact, there were few relevant docu-ments to be retrieved and it is natural for systems to choose a sufficient number of relevant documents for determining a certain pattern of be-havior. In other words, the collection of documents is nor suitable for some topics. This is the reason we chose the best five systems and six topics for testing the model. (See section 4.)

2. A geometric linear model

For a set of retrieval systems and a set of relevant documents for a topic, we describe a geometric approximation model. Before we go into the abstract setup, it is good to see the overall picture.
For example, suppose chat there are three systems, A, ~, and C and a topic, “car.” And suppose that there is a collection of 1000 documents. Let A, B, and C retrieve 100 documents about car from the collection. Then by the lemmas below, we can calculate all the angles, in this case, three angles and three ratios (a kind of relative detection power for each pair of systems). Now, if we have the same three systems and a new collection of one million documents choose any one of A, B, or C and let it retrieve relevant documents. Let the number of relevant documents retrieved be n. Then we may estimate all the numbers of relevant documents retrieved by the remaining two systems, by multiplying ratios calculated in the previous steps. Moreover, we may estimate the total number of relevant documents retrieved by all three systems.
Now, we are ready for the model. Let R~ be the euclidean n-dimensional space with the usual inner product, that is,
‘1
V. U! L/~2V~

i=1

where n is a number.

C~mpkx Sy.~tems. 12 (2000) 93-102
731-2Cm ~. ~5PP.1 FROM FLWEST 1 9~. 1 3B 1 73S8
A Geomerric Model of lnformanon Retrietial Sys~ms 95

Assumption 1. For each topic, each retrieval system is represented in RM by a line passing through the origin arid the set of all lines (i.e., systems) are linearly independent.

Assumption 2. The set of relevant documents is represented as a vector, which we call the relevant set vector. Moreover, consider the projection of the relevant set vector to the line represented by a system. We denote it by Pr(v), where v iS the relevant set vector.
Assumption 3. Assume the line length (the absolute value o~ Pr(v), lPr(v)l) is an approximation of the number of relevant documents re-trieved by the system.

Before we make the fourth assumption in section 3, we need to mention the following useful lemma. By using it, given the angle between a pair of projection vectors, we may estimate the number of relevant documents retrieved by the pair.

Lemma 1. Suppose that there is a vector u in a plane and two straight lines through the origin as below. Let two vectors ii and ru be the
• projections of to those two lines and 0 be the angle between those vectors u and w. Then u may be expressed as follows.
Case 1.

U

I
1L42 - v!2 w ILlh - lw!2 v
U sinO lu.’t sine Il/i

Case II.
1u12 - v!2 w Iz.~l~ - wi2 v
U =
sine jWl sinO lvi
Compkx Sygtems. 12 (20001 93-102
P. ~
/

I
731 2CC0 ij. BBPtI FROM FLWEST 19413rn 7388 P~ S
96 M. H. Kim
Case III.
_________ l~j2iwi2
sinO j~j~ sinO H

and
lu lw!2 + 1v12 - 2lvl lwlcosO
sinO

Proof. Refer to any calculus book. .

3. Behavior of the systems in Text Retrieval Conference 6

According to TREC 6 data, for each topic, it is evident that all systems showed surprising patterns in retrieving relevant documents (Kantor and others, 1999; TREC website 1999). Here is one such pattern, which motivated us to write this paper and leads to the last and most important assumption.

Notations. For topic t, let a1(~,t~, a2(r,t), and a12(r,t) be the accumu-lated numbers of relevant documents retrieved by System 1, System 2, and by both, up to a certain stage r, representing rank. More precisely. if System 1 and System 2 retrieved the following relevant documents, for a topic 2:
r System I System 2
I F234 1 F345 1
2 F345 1 P1789 0
3 F4 I F56 1
4 F78 1 F3S90 0
5 P1789 0 P23 1
6 P23 1 F4 1
7 F57 0 F983 0

where in the third and fifth columns, I means ‘~relevant” and 0 means
“irrelevant,” then

a1(1,t) = 1,a1(2,t) 2,a1(3,t) = 3,.. .a1(7,i) = 5,a2(1,t) 1,
a2(2,z)=1...a2(7,t)=4,

~ornpIcx Sys:em~. 12 (2000) 93-102
/

U

/1
FROM FLWEsT 19413517388

P. 6
A Geometric Model of Information Rc~ieval Systems 97

and

a12(1,t) 0,a.12(2,±) = 1... ,a12(7,t) 3.

Remarkably we have found that the ratio of a1(r,t) to a2(r,t) is almost constant over all rand, so is the ratio of a12(r,t) to a,(r,r). For example, for topic 62, if we run a linear regression for the number of relevant documents retrieved by two systems (Cor6Rlcc and Eli-f 6R2) up to rank (or r) 100, the graph has the slope 0.95 (a1/a2) and 0.58 = (a121a2) with high accuracy, that is, rsquare 0.9994 and 0.9910. respectively. This implies that the ratio, or so-called “relative detection powers” of those two systems, is almost constant. The results are similar for all topics. Prom this remarkable observed fact, we assume the most interesting assumption.

Assumption 4. For each topic, all the ratios of a1 (r, t) to a2fr, t) and
a12(r, ~) to a,(r, t) are constant over all r.

4. Application

In this section, we apply the model consisting of the four Assumptions to calculate the total number of relevant documents retrieved by several systems. To get an idea, it is enough to consider three systems, since we may reduce any case to the case of two systems.
Suppose that there are three systems (i.e., lines) s~, s2, and s3; their relative detection powers (i.e., ratios of a1(r, r) to a2(r, t) and a12(r, t) to a2(r, t)) are known for each topic. Suppose that a relevant set vector lies in the space spanned by the three systems. Then, given the number of relevant documents retrieved by any one of the three systems for some topic, by applying the model above, we may estimate the total number of relevant documents retrieved by the three systems for the given topic as follows. To do rhis, we first need to calculate the cosine value of the angle between the lines represented by the systems. More precisely, the angle between the projection vectors of the given relevant see vector to the lines.

Lemma Z. Let a1 be the number of relevant documents retrieved by
System 1, a2 the mu rnber by System 2, and a12 the number from both
systems. Let 9 be the angle between the two systems. And let k and
p be the ratios of a1 to a2 and a12 to a, respectively. Then a~ ka2,
a11 = pa2 and we have the following.
Ca3e I.

(a1 # a~ - a11)2

CompThx Sy.sts’ns. 12 12000) 53-102
7-31 -29O~ 4:56PM FROM FLWEST 19413517388 P~ 7
98 M. H. Kim

or

k- (k-i-1-p)2-k2 (k+1-p)2--l

(k+ 1-p)2

Cases H and III.

cos6 a1a,+ (a1+a~-a12)2-d~ (a1+a1-a12)2-4
(a1 +a2-a12)2

or

k+ (k+1-p)2-k2 (k41-p)2-1

(k + I -p)2

Proof. It follows from the cosine sum formula to two triangles in the
figures of Lemma 1. u
Suppose z.~ is the projection vector ~f the systems and n~ is the number of relevant documents retrieved by each systems1 for i = 1,2, 3; in other words, 1v11 = n~. As mentioned in section 3, we can obtain all the k and p values for all pairs of systems for each topic with the help of statistical software such as SAS. Suppose only n1 is known. Then n2 and n3 are estimated as n1 times a proper ratio k. By Lemma 1, we get the relevant set vector u (in the plane generated by v1 and ii,) of two projection vectors v1, z2. Again by applying Lemma 1 to u and v3, we get the relevant set vector w (in the three-dimensional space generated by v1,v2,v3) for the v1,~2,v3 and the absolute value of the sum vector (relevant set vector) w will be the total number of relevant documents retrieved by the three systems. In this way, this procedure can be applied to any finite number of systems.

Remark. Professor F. Boros made a particularly important suggestion for reducing computing errors. Let Pr~’(v) be the projection vector of the relevant set vector v to the rn-dimensional subspace generated by some lines represented by systems. He suggested the following method:

where P’,(v) is the projection vector to the ith line.
Then
IPr”’(v)I = ~ lPr~(v)I2 + ~ Pr~(v)~ Prk(v)
N 1~i, jsm

and we may calculate it simultaneously.

Complcx Systems. 12 ~200O) 93-102
7-31 -29w 4:56PM FROM FLWEST 19413517388 P. 8
A Geomerric Mode! of Inform~rion Re~vidva1 Systems 99

T~ test the model, we chose the five best (Cor6Rlcc, ETH6R2, atr97re, city6rl, and pirc7R2) out of 31 systems in TREC 6, and seven topics(189, 161, 111, 10002, 62, 54, and 154) having the most relevant documents. By assuming Case I for all pairs, the following results are obtained.

Estimation by the Geometric Model

Topic One Two Three Four Five
.54 89 100.643 106.624 110.110 125.084
62 81 116.067 123.336 181.108 312.631
111 74 137.786 160.335 194.045 263.126
154 83 110.412 113.516 121.442 124.426
161 71 78.726 87.896 102.343 110.606
189 90 150.801 180.346 247.611 286.277
10002 69 102.358 115.578 119.521 120.998

Numbers of True Relevant Documents
Topic One Two Three Four Five
54 89 101 111 113 114
62 81 116 129 170 217
111 74 141 162 180 208
154 83 113 117 128 130
161 71 80 91 102 102
189 90 152 174 200 216
10002 65 103 114 121 126

Here “One” means the total number of documents by Cor6Rlcc; “Two” by Cor6Rlcc and ETHGR2; “Three” by Cor6Rlcc, ETH6R2, and att97re; “Four” by Cor6Rlcc, ETH6R2, att97re, and ciry6rl; and “Five” by CorERlcc, ET.H6R2, art97re, city6rl, and pirc7R2.

I ~ Conclusion

As the table in section 4 shows, the geometric model works well for three systems, for all topics (note that, since we get the angle from the number of relevant documents retrieved by two. systems, the model is expected to fit well with two systems). To make this geometric model work better for all topics and more than three systems, further investigation is needed to find the best combination of cases (see Lemma 1 and recall that in our experiments, we assume Case I for all).
Note that, to each system, by associating a line in an euclidean space, a new concept of “independence” of systems was introduced. In other words, if the set of representing lines is linearly independent, then we might say that the corresponding systems are independent. It seems that there is a subtle point in this concept, because all systems have a tendency

CompJe.~ Sysz~ms, 12 ~2O00) 93~.lO2
7-31 ~29~ 4: 572M FROM FLWEST 19413517388
iu~ P.9

to resemble each other. In the appendix, we present a very simple and interesting scoring method, in other words, we describe a way of making a new system from the existing systems. The experimental results show that this new system performs better than each individual system, the principle of which is widely applicable.

I Acknowledgements

I would like to thank the Rutgers University department of statistics for their hospitality while visiting from October, 1998 to September, 1999, and giving me the chance to present a talk for this paper. I am indebted to Vladimir Menkov~ Gene Kim, and W. Shim for technical support and criticism. I also give special thanks to NIST for making TREC 6 results available. I want to express my appreciation for the thoughtful advice and suggestions by the reviewer.

Appendix

A. On a scoring method of retiieved documents by systems

We introduce a new scoring algorithm based on the ranks given by systems for better results over all individual systems. Since each system has its own scoring method it needs to be normalized in some way. The ranks given by the systems are our natural choice and here is our definition of the new scoring method.

1 4.1 DefInition of scoring and results

Given a pair of topic and document,’ first, we interpret 100-tank (or 101-rank) as its score given by each system, if ranks of a pair are less than or equal to 100, otherwise 0. Second, compute the average of the scores and assign it to the score of the document for a given topic.
Using this definition we rescored all pairs (document, rank) judged
by TREC 6 and chose the top 100 documents for each topic.
Here is the table by our new scoring method.

T: topic

5: the system of averaging all scores from the 31 systems

531: number of systems which perform better than 5 among the 31
systems

f~p the system of averaging the scores from only the six best systems

B6: number of systems which perform better than (6

G: number of relevant documents retrieved by all 31 systems Corn ptex Sy.ct~rns, 12 C2000) 93-102
731-2999 ~1:57PM FRONI FLWEST l9~.13B173SS
P. 19
A Ge~me.tric Model of lnformario~z Retvieval Systems 101
T ~ B31 f~ B6 G
1 39 0 35 4 51
3 45 3 43 0 70
4 41 4 45 3 70
5 6 0 6 0 7
6 54 S 55 4 146
1148 745 7150
12 78 6 81 2 270
23 4 3 4 3 6
24 11 14 17 3 34
44 4 0 4 0 4
54 94 1 92 3 164
58 16 2 17 1 18
62 86 6 85 7 368
77 13 1 13 1 16
78 40 1 39 1 45
82 61 0 59 2 30
94 48 3 49 3 174
95 36 5 38 1 123
100 80 4 82 4 179
103 71 5 77 2 293
111 90 5 91 1 480
114 37 1 38 1 57
118 38 0 36 2 83
119 20 5 22 2 76
123 44 2 43 2 62
125 18’ 3 23 0 .27
126 iS 1 iS 1 19
128 65 2 70 1 281
142 57 1 51 3 200
148 79 11 36 11 241
154 88 6 92 0 168
161 84 6 88 2 118
173 11 3 15 0 16
iSO 2 9 5 5 17
185 16 2 16 2 18
187 10 7 10 7 19
189 91 4 90 5 667
192 7 0 4 13 7
194 3 0 3 0 4
202 87 S S8 4 534
228 29 4 28 5 58
240 29 0 25 2 121
282 16 0 17 0 28
10001 70 1 66 2 127
10002 73 3 72 3 267
10003 47 3 55 (1 72
10004 12 4 14 0 17
Cornpiex Sy~remc, 12 ~Z0O0j 93-102