Basics of an Information retrieval system, understanding tf idf scoring with an example
The following blog explains the indexing, query handling, searching and retrieval concepts of Information retrieval (IR) with the help of a toy example.
Bird’s eye view of an IR system
The overall workflow of the IR system starts by maintaining a storage of documents (in this blog they are blogs). When the user enters a query the relevant blogs will be returned to the user by the search system. Suppose there are 5000 blogs and based on the relevance with respect to the user query the search system will rank the blogs. However, the user will be interested in the top ranking documents. So based on the requirement of the user, the search system provides the top K relevant blogs.
Toy example of an IR system
- Blogs
There are three blogs Blog0, Blog1 and Blog2. To keep the example simple, I have only kept single sentence in each blog. However, mostly there are multiple sentences.
- Analyzer
The analyzer deals with the preprocessing of the blogs. The preprocessing consists of tokenization, stop word removal, stemming, lemmatization. The analyzer gives each blog a set of tokens. In case of our example, the analyzer removed the punctuation, lowercased the words and removed the stop word “is”. The entire vocabulary now consists of
{today, sunny, berlin, excite}
- Indexer
The preprocessed blogs are then passed to the indexer. The main motivation of indexing is to reduce the overall retrieval time. There are various types of indexing that can be used like B+ trees, hash tables. In our case we are using the inverted indexes (also called as postings list) for each token in the vocabulary. Let me explain how the inverted index is created for the term “berlin”.
berlin[2:2]->[1:1:[2]]->[2:1:[0]]
The term berlin is occurring 2 times in the entire corpus (our corpus consists of the 3 blogs) which is called the total term frequency.
berlin[2:2]->[1:1:[2]]->[2:1:[0]]
The term “berlin” is occurring in 2 documents which is known as the document frequency (df).
berlin[2:2]->[1:1:[2]]->[2:1:[0]]
Each blog is given a id. blog0 has id as 0, blog1 has id as 1 and blog2 has id as 2. The term “berlin” is occurring in two blogs with id 1 and 2.
berlin[2:2]->[1:1:[2]]->[2:1:[0]]
The term “berlin” is occurring in blog with id 1 for 1 time and in blog with id 2 for 1 time. This frequency of a term in a particular blog is called the term frequency(tf).
berlin[2:2]->[1:1:[2]]->[2:1:[0]]
The term “berlin is occurring a position 2 in the blog with id 1 and position 0 in the blog with id 2.
- Scoring
We will use the above tf-idf scoring to calculate the scores of blogs with respect to the given query “sunny Berlin”. After passing it to the analyzer, we get the tokens “sunny” and “berlin”. The calculations are as follows:
- Ranking and Retrieval
The simplest way is to rank the blogs in the descending order. The user will see the blogs in the following order:
- Sunny sunny Berlin!!
- Berlin is exciting!
- Today is…..
In this case we just had 3 blogs, however, in the real world application there will be millions of blogs and top K blogs are shown to the user.