Vector Spaces for Document Similarity
Introduction

How can a search engine determine whether two documents are similar? In our post on search engine operation we looked at how search engines can retrieve information on documents given a query string. Vector methods are also used to decide whether documents are related.
Example
Let’s make this a concrete example. Say we have three toy documents which have been parsed into their keywords so that all the unimportant word have been removed. For example, commonly occurring words such as: as, of, the, … and so on are removed. We might be left with:
document 1 = {news, information, campaign, raise, awareness}
document 2 = {news, today’s, world, information, news}
document 3 = {world, news}
We define the term space as all the unique words that occur in all the documents:
term-space = {news,information,campaign, raise, awareness, today’s, world}
Now we can convert the documents into a vector whose dimension is the same as the number of terms in the term-space. We can compare the documents with the term space and where the word occurs in the document we give it a value of 1 for that component in the term space vector.
| news | information | campaign | raise | awareness | today’s | world | |
| d1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |
| d2 | 1 | 1 | 0 | 0 | 0 | 1 | 1 |
| d3 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
The magnitude of the three document in document term space is:
|di|2 = ti12 + ti22 + ti32 + … + tin2
for each of the documents this becomes:
|d1| = √5
|d2| = 2
|d3| = √2
Thinking in a geometric way, these document now represent vectors pointing in 7-dimensional term space. Their similarity, is given by the how closely they they point in the same direction. If they had the same components they would be the same. We determine the similarity between two vectors by calculating the cosine of the angle between them.
Remembering back to your school-days, that the dot product of two vectors is x1 . x2 = |x1||x2| cos θ.
Therefore, cos θ =(x1.x2)/(|x1||x2|)
and x1.x2 = x11x21 + x12 x22 + x13 x23 + … + x1nx2n
Plugging in the numbers,
| d1 | d2 | d3 | |
| d1 | 1.000 | 0.447 | 0.632 |
| d2 | 0.447 | 1.000 | 0.707 |
| d3 | 0.632 | 0.707 | 1.000 |
The method works with a search query q also. Assume, that the search term was “today’s news”, in this case we can write a table in the search term space.
| d1 | d2 | d3 | |
| q | 0.316 | 0.354 | 0.500 |
This shows that document 3 has the greatest relevance. As a test we can change the query to be “world news” and this produces a 100% match with document 3 as would be expected.
| d1 | d2 | d3 | |
| q | 0.316 | 0.354 | 1.000 |
Weighting
If a document contains more occurrences of a word in the search query, then it is fair to say at this basic level that the document has more relevance, we can use different weights to increase the importance of the document with the most occurrences of the word. There are many different ways in which you could weight the results but one obvious example would be to sum the number of times a word appears in the document.
Using the same example documents, in document 2 the word news occurs twice so this is given additional weight.
| d1 | d2 | d3 | |
| d1 | 1.000 | 0.507 | 0.316 |
| d2 | 0.507 | 1.000 | 0.802 |
| d3 | 0.316 | 0.802 | 1.000 |
Formally we write the new equation as cos θ =(Σk=1wik . wjk)/(|xi||xj|)
After the similarity between documents has been determined, the results can be displayed.An Excel spreadsheet calculator is available to explain the vector similarity calculations.
This may be either by assuming a cut off value, documents that have a value which is above the threshold value are shown and the those that are not are discarded. Alternatively, the documents can be ranked in terms of their similarity with the most similar to the least similar.
The idea of converting documents into vectors is very useful one and it allows us to define to perform another useful task. By normalising the documents, they will have the same magnitude and the results will lye on the surface of a sphere in n-dimensional term space. Taking a small area of the surface it will appear to be flat and so we see that documents that are very similar will appear close together as shown in figure. 1.
Looking for a : Garage Door ?
DIY Garage Doors at reasonable prices
If you need Space Saver Stairs come to us.
Ford Garages here
Tags: search engine operation, similarity, term space, vector representation


![Validate my RSS feed [Valid RSS]](http://www.seothegame.com/wp-content/uploads/2008/11/valid-rss.png)