Basics of an Information retrieval system, understanding tf idf scoring with an example

3 min readApr 1, 2021

The following blog explains the indexing, query handling, searching and retrieval concepts of Information retrieval (IR) with the help of a toy example.

Bird’s eye view of an IR system

The overall workflow of the IR system starts by maintaining a storage of documents (in this blog they are blogs). When the user enters a query the relevant blogs will be returned to the user by the search system. Suppose there are 5000 blogs and based on the relevance with respect to the user query the search system will rank the blogs. However, the user will be interested in the top ranking documents. So based on the requirement of the user, the search system provides the top K relevant blogs.

Toy example of an IR system

Blogs

There are three blogs Blog0, Blog1 and Blog2. To keep the example simple, I have only kept single sentence in each blog. However, mostly there are multiple sentences.

Analyzer

The analyzer deals with the preprocessing of the blogs. The preprocessing consists of tokenization, stop word removal, stemming, lemmatization. The analyzer gives each blog a set of tokens. In case of our example, the analyzer removed the punctuation, lowercased the words and removed the stop word “is”. The entire vocabulary now consists of

{today, sunny, berlin, excite}

Indexer

The preprocessed blogs are then passed to the indexer. The main motivation of indexing is to reduce the overall retrieval time. There are various types of indexing that can be used like B+ trees, hash tables. In our case we are using the inverted indexes (also called as postings list) for each token in the vocabulary. Let me explain how the inverted index is created for the term “berlin”.