FROM FLWEST 19~13B173B8 P. 1
DNA PRINT GENOMICS 1748 INDEPENDENCE BLVD. D-1 SARASOTA. FL 34234 (941) 351-4543 - (941)351-7388
July 31, 2000
To. Mr. Carl Smith III
Enclosed you will find what was discussed earlier. lfyou should have any questions feel free to call.
Sincerely,
Dr. Tony Frudakis FROM FLWEST 19~13S173gg P. 2
-
A Geometric Model of Information Retrieval Systems
Myung Ho Kim
DNAPYInT Genomks Sayasota, FL
The past decade saw a great deal of progress in the development of in-formation retrieval systems. Unfortunately, we still lack a systematic understanding of the behavior of the systems and their relationship with c documents. In this paper we present a completely new approach towards V the understanding of information retrieval systems. Recently, it has been observed that retrieval systems inTREC 6 show some remarkable patterns in retrieving relevant documents. Based on the TREC 6 observations, we introduce a geometric linear model of information retrieval systems. We then apply the model to predict the number of relevant documents found by the retrieval systems. The model is also scalable to a much larger data set. Although the model is developed based on the TR.EC 6 routing test data, I believe it can be readily applicable to other information retrieval systems. In the appendix, we explain a simple and cf~cient way of making a better system from existing systems.
1. Introduction
Presently, the internet is being more and more frequently used for in-formation retrieval for a wide variety of activities ranging from routine tasks such as shopping and reading the newspaper to esoteric research. As the internet was about to become ubiquitous and grow at an ex-plosive rate, in 1992, National Institute of Standards. & Technology, Information Access & User Interfaces Division (IAUI), and Defense Ad-vanced Research Projects Agency (DARPA), Ioined to start a conference named Text Retrieval Conference (TREC) for exchanging technology and research. One of the most exciting events at TREC is the compe-tition of retrieving relevant documents by participating institutes from all over the world. To get some idea, we need to know the following competition procedure.
Step 1. To 31 participating systems, TREC 6 provided
1. 47 topics;
2. 120,6S3 documents, which each system was to search through and pick up relevant documents, for each of the 47 topics.
Compk~t S’yst~ms. 12 (2000) 93-102; @2000 Complcc Sys!cms Pub1,cat;o~s. Inc. 7-31 -29O~i ~1:B4PM FROM FLWEST 19~13B173SS P 3 94 M. R. Kim
Step 2. 31 computer systems submitted 1000 retrieved documents for each topic ranked by relevance. A document of rank 1 is considered as the most relevant by a system. The total is 31 x 47 x 1000 lists of pairs of topic and document identity.
Step 3. For each topic, TREC collected the top 100 documents from each system and judged their relevance to a given topic. For example, for topic 1, if F128 is retrieved by system atr97re as document 20, it judged whether the document was relevant for topic 1. The performance results of the 31 systems were announced. This is the data I worked on and analyzed.
For some topics, most of the systems did not do well in retrieving relevant documents. As a matter of fact, there were few relevant docu-ments to be retrieved and it is natural for systems to choose a sufficient number of relevant documents for determining a certain pattern of be-havior. In other words, the collection of documents is nor suitable for some topics. This is the reason we chose the best five systems and six topics for testing the model. (See section 4.)
2. A geometric linear model
For a set of retrieval systems and a set of relevant documents for a topic, we describe a geometric approximation model. Before we go into the abstract setup, it is good to see the overall picture. For example, suppose chat there are three systems, A, ~, and C and a topic, “car.” And suppose that there is a collection of 1000 documents. Let A, B, and C retrieve 100 documents about car from the collection. Then by the lemmas below, we can calculate all the angles, in this case, three angles and three ratios (a kind of relative detection power for each pair of systems). Now, if we have the same three systems and a new collection of one million documents choose any one of A, B, or C and let it retrieve relevant documents. Let the number of relevant documents retrieved be n. Then we may estimate all the numbers of relevant documents retrieved by the remaining two systems, by multiplying ratios calculated in the previous steps. Moreover, we may estimate the total number of relevant documents retrieved by all three systems. Now, we are ready for the model. Let R~ be the euclidean n-dimensional space with the usual inner product, that is, ‘1 V. U! L/~2V~
i=1
where n is a number.
C~mpkx Sy.~tems. 12 (2000) 93-102 731-2Cm ~. ~5PP.1 FROM FLWEST 1 9~. 1 3B 1 73S8 A Geomerric Model of lnformanon Retrietial Sys~ms 95
Assumption 1. For each topic, each retrieval system is represented in RM by a line passing through the origin arid the set of all lines (i.e., systems) are linearly independent.
Assumption 2. The set of relevant documents is represented as a vector, which we call the relevant set vector. Moreover, consider the projection of the relevant set vector to the line represented by a system. We denote it by Pr(v), where v iS the relevant set vector. Assumption 3. Assume the line length (the absolute value o~ Pr(v), lPr(v)l) is an approximation of the number of relevant documents re-trieved by the system.
Before we make the fourth assumption in section 3, we need to mention the following useful lemma. By using it, given the angle between a pair of projection vectors, we may estimate the number of relevant documents retrieved by the pair.
Lemma 1. Suppose that there is a vector u in a plane and two straight lines through the origin as below. Let two vectors ii and ru be the • projections of to those two lines and 0 be the angle between those vectors u and w. Then u may be expressed as follows. Case 1.
U
I 1L42 - v!2 w ILlh - lw!2 v U sinO lu.’t sine Il/i
Case II. 1u12 - v!2 w Iz.~l~ - wi2 v U = sine jWl sinO lvi Compkx Sygtems. 12 (20001 93-102 P. ~ /
I 731 2CC0 ij. BBPtI FROM FLWEST 19413rn 7388 P~ S 96 M. H. Kim Case III. _________ l~j2iwi2 sinO j~j~ sinO H
and lu lw!2 + 1v12 - 2lvl lwlcosO sinO
Proof. Refer to any calculus book. .
3. Behavior of the systems in Text Retrieval Conference 6
According to TREC 6 data, for each topic, it is evident that all systems showed surprising patterns in retrieving relevant documents (Kantor and others, 1999; TREC website 1999). Here is one such pattern, which motivated us to write this paper and leads to the last and most important assumption.
Notations. For topic t, let a1(~,t~, a2(r,t), and a12(r,t) be the accumu-lated numbers of relevant documents retrieved by System 1, System 2, and by both, up to a certain stage r, representing rank. More precisely. if System 1 and System 2 retrieved the following relevant documents, for a topic 2: r System I System 2 I F234 1 F345 1 2 F345 1 P1789 0 3 F4 I F56 1 4 F78 1 F3S90 0 5 P1789 0 P23 1 6 P23 1 F4 1 7 F57 0 F983 0
where in the third and fifth columns, I means ‘~relevant” and 0 means “irrelevant,” then
a1(1,t) = 1,a1(2,t) 2,a1(3,t) = 3,.. .a1(7,i) = 5,a2(1,t) 1, a2(2,z)=1...a2(7,t)=4,
~ornpIcx Sys:em~. 12 (2000) 93-102 /
U
/1 FROM FLWEsT 19413517388
P. 6 A Geometric Model of Information Rc~ieval Systems 97
and
a12(1,t) 0,a.12(2,±) = 1... ,a12(7,t) 3.
Remarkably we have found that the ratio of a1(r,t) to a2(r,t) is almost constant over all rand, so is the ratio of a12(r,t) to a,(r,r). For example, for topic 62, if we run a linear regression for the number of relevant documents retrieved by two systems (Cor6Rlcc and Eli-f 6R2) up to rank (or r) 100, the graph has the slope 0.95 (a1/a2) and 0.58 = (a121a2) with high accuracy, that is, rsquare 0.9994 and 0.9910. respectively. This implies that the ratio, or so-called “relative detection powers” of those two systems, is almost constant. The results are similar for all topics. Prom this remarkable observed fact, we assume the most interesting assumption.
Assumption 4. For each topic, all the ratios of a1 (r, t) to a2fr, t) and a12(r, ~) to a,(r, t) are constant over all r.
4. Application
In this section, we apply the model consisting of the four Assumptions to calculate the total number of relevant documents retrieved by several systems. To get an idea, it is enough to consider three systems, since we may reduce any case to the case of two systems. Suppose that there are three systems (i.e., lines) s~, s2, and s3; their relative detection powers (i.e., ratios of a1(r, r) to a2(r, t) and a12(r, t) to a2(r, t)) are known for each topic. Suppose that a relevant set vector lies in the space spanned by the three systems. Then, given the number of relevant documents retrieved by any one of the three systems for some topic, by applying the model above, we may estimate the total number of relevant documents retrieved by the three systems for the given topic as follows. To do rhis, we first need to calculate the cosine value of the angle between the lines represented by the systems. More precisely, the angle between the projection vectors of the given relevant see vector to the lines.
Lemma Z. Let a1 be the number of relevant documents retrieved by System 1, a2 the mu rnber by System 2, and a12 the number from both systems. Let 9 be the angle between the two systems. And let k and p be the ratios of a1 to a2 and a12 to a, respectively. Then a~ ka2, a11 = pa2 and we have the following. Ca3e I.
(a1 # a~ - a11)2
CompThx Sy.sts’ns. 12 12000) 53-102 7-31 -29O~ 4:56PM FROM FLWEST 19413517388 P~ 7 98 M. H. Kim
or
k- (k-i-1-p)2-k2 (k+1-p)2--l
(k+ 1-p)2
Cases H and III.
cos6 a1a,+ (a1+a~-a12)2-d~ (a1+a1-a12)2-4 (a1 +a2-a12)2
or
k+ (k+1-p)2-k2 (k41-p)2-1
(k + I -p)2
Proof. It follows from the cosine sum formula to two triangles in the figures of Lemma 1. u Suppose z.~ is the projection vector ~f the systems and n~ is the number of relevant documents retrieved by each systems1 for i = 1,2, 3; in other words, 1v11 = n~. As mentioned in section 3, we can obtain all the k and p values for all pairs of systems for each topic with the help of statistical software such as SAS. Suppose only n1 is known. Then n2 and n3 are estimated as n1 times a proper ratio k. By Lemma 1, we get the relevant set vector u (in the plane generated by v1 and ii,) of two projection vectors v1, z2. Again by applying Lemma 1 to u and v3, we get the relevant set vector w (in the three-dimensional space generated by v1,v2,v3) for the v1,~2,v3 and the absolute value of the sum vector (relevant set vector) w will be the total number of relevant documents retrieved by the three systems. In this way, this procedure can be applied to any finite number of systems.
Remark. Professor F. Boros made a particularly important suggestion for reducing computing errors. Let Pr~’(v) be the projection vector of the relevant set vector v to the rn-dimensional subspace generated by some lines represented by systems. He suggested the following method:
where P’,(v) is the projection vector to the ith line. Then IPr”’(v)I = ~ lPr~(v)I2 + ~ Pr~(v)~ Prk(v) N 1~i, jsm
and we may calculate it simultaneously.
Complcx Systems. 12 ~200O) 93-102 7-31 -29w 4:56PM FROM FLWEST 19413517388 P. 8 A Geomerric Mode! of Inform~rion Re~vidva1 Systems 99
T~ test the model, we chose the five best (Cor6Rlcc, ETH6R2, atr97re, city6rl, and pirc7R2) out of 31 systems in TREC 6, and seven topics(189, 161, 111, 10002, 62, 54, and 154) having the most relevant documents. By assuming Case I for all pairs, the following results are obtained.
Estimation by the Geometric Model
Topic One Two Three Four Five .54 89 100.643 106.624 110.110 125.084 62 81 116.067 123.336 181.108 312.631 111 74 137.786 160.335 194.045 263.126 154 83 110.412 113.516 121.442 124.426 161 71 78.726 87.896 102.343 110.606 189 90 150.801 180.346 247.611 286.277 10002 69 102.358 115.578 119.521 120.998
Numbers of True Relevant Documents Topic One Two Three Four Five 54 89 101 111 113 114 62 81 116 129 170 217 111 74 141 162 180 208 154 83 113 117 128 130 161 71 80 91 102 102 189 90 152 174 200 216 10002 65 103 114 121 126
Here “One” means the total number of documents by Cor6Rlcc; “Two” by Cor6Rlcc and ETHGR2; “Three” by Cor6Rlcc, ETH6R2, and att97re; “Four” by Cor6Rlcc, ETH6R2, att97re, and ciry6rl; and “Five” by CorERlcc, ET.H6R2, art97re, city6rl, and pirc7R2.
I ~ Conclusion
As the table in section 4 shows, the geometric model works well for three systems, for all topics (note that, since we get the angle from the number of relevant documents retrieved by two. systems, the model is expected to fit well with two systems). To make this geometric model work better for all topics and more than three systems, further investigation is needed to find the best combination of cases (see Lemma 1 and recall that in our experiments, we assume Case I for all). Note that, to each system, by associating a line in an euclidean space, a new concept of “independence” of systems was introduced. In other words, if the set of representing lines is linearly independent, then we might say that the corresponding systems are independent. It seems that there is a subtle point in this concept, because all systems have a tendency
CompJe.~ Sysz~ms, 12 ~2O00) 93~.lO2 7-31 ~29~ 4: 572M FROM FLWEST 19413517388 iu~ P.9
to resemble each other. In the appendix, we present a very simple and interesting scoring method, in other words, we describe a way of making a new system from the existing systems. The experimental results show that this new system performs better than each individual system, the principle of which is widely applicable.
I Acknowledgements
I would like to thank the Rutgers University department of statistics for their hospitality while visiting from October, 1998 to September, 1999, and giving me the chance to present a talk for this paper. I am indebted to Vladimir Menkov~ Gene Kim, and W. Shim for technical support and criticism. I also give special thanks to NIST for making TREC 6 results available. I want to express my appreciation for the thoughtful advice and suggestions by the reviewer.
Appendix
A. On a scoring method of retiieved documents by systems
We introduce a new scoring algorithm based on the ranks given by systems for better results over all individual systems. Since each system has its own scoring method it needs to be normalized in some way. The ranks given by the systems are our natural choice and here is our definition of the new scoring method.
1 4.1 DefInition of scoring and results
Given a pair of topic and document,’ first, we interpret 100-tank (or 101-rank) as its score given by each system, if ranks of a pair are less than or equal to 100, otherwise 0. Second, compute the average of the scores and assign it to the score of the document for a given topic. Using this definition we rescored all pairs (document, rank) judged by TREC 6 and chose the top 100 documents for each topic. Here is the table by our new scoring method.
T: topic
5: the system of averaging all scores from the 31 systems
531: number of systems which perform better than 5 among the 31 systems
f~p the system of averaging the scores from only the six best systems
B6: number of systems which perform better than (6
G: number of relevant documents retrieved by all 31 systems Corn ptex Sy.ct~rns, 12 C2000) 93-102 731-2999 ~1:57PM FRONI FLWEST l9~.13B173SS P. 19 A Ge~me.tric Model of lnformario~z Retvieval Systems 101 T ~ B31 f~ B6 G 1 39 0 35 4 51 3 45 3 43 0 70 4 41 4 45 3 70 5 6 0 6 0 7 6 54 S 55 4 146 1148 745 7150 12 78 6 81 2 270 23 4 3 4 3 6 24 11 14 17 3 34 44 4 0 4 0 4 54 94 1 92 3 164 58 16 2 17 1 18 62 86 6 85 7 368 77 13 1 13 1 16 78 40 1 39 1 45 82 61 0 59 2 30 94 48 3 49 3 174 95 36 5 38 1 123 100 80 4 82 4 179 103 71 5 77 2 293 111 90 5 91 1 480 114 37 1 38 1 57 118 38 0 36 2 83 119 20 5 22 2 76 123 44 2 43 2 62 125 18’ 3 23 0 .27 126 iS 1 iS 1 19 128 65 2 70 1 281 142 57 1 51 3 200 148 79 11 36 11 241 154 88 6 92 0 168 161 84 6 88 2 118 173 11 3 15 0 16 iSO 2 9 5 5 17 185 16 2 16 2 18 187 10 7 10 7 19 189 91 4 90 5 667 192 7 0 4 13 7 194 3 0 3 0 4 202 87 S S8 4 534 228 29 4 28 5 58 240 29 0 25 2 121 282 16 0 17 0 28 10001 70 1 66 2 127 10002 73 3 72 3 267 10003 47 3 55 (1 72 10004 12 4 14 0 17 Cornpiex Sy~remc, 12 ~Z0O0j 93-102 |