Information Retrieval using whoosh in python and lucene in java

4 min readDec 20, 2021

This blog is divided into four parts. In the first part, data preparation for both the search engines is explained. This part is followed by the detailed explanation of implementation of search engine using whoosh in python and further with the details about search engine using lucene in java. The last part first starts with a comparison between the two implementations and concluded with some cherry picked examples.

Data Preparation

As I created two different search engines one using lucene in java and the other one using whoosh library in python, I parsed all the xml files beforehand using lxml parser in python.

I only considered the ‘post’ tag and completely ignored the ‘date’ tag. After parsing the blogs using the given code snippet, out of the 19,320 bloggers files, around 33 files weren’t valid, leading to 670250 out of the 6.8 million posts. There were some duplicate blog posts, after removing the duplicate posts the final count was 602078.

Search Engine using whoosh library in python

Indexing of the blogs

The textfile which was created in the data preparation process is read and sequentially blog ids are given. The indexes are created only if they aren’t present in the current directory of the project.

Retrieving the Blogs

Query Expansion: Retrieves similar words to the given query and added features to handle negated queries when the words not, no, nor and neither are present, the antonyms are taken instead of the synonyms and the term whose antonyms are taken is ignored in the query parser.

The frontend calls this function with parameters as the query and number of results to be retrieved given by the user. It produces 2 times the required results for the further integration of semantic knowledge using pretrained word embeddings.

Converting the blogs and the query to vector using average of all the vectors of the words present in them

created custom highlighter to underline the words in the displayed results which are there in the entered query by a user

The frontend was created using flask. The snippet of the frontend is given below:

UI BlogRetrieval using whoosh and flask python (glove pretrained vector-25 used)

In the above diagram, part of UI after entering the query “food for thoughts” is given. The expanded query is given which consists of synonyms for the query terms along with the original query. User entered query terms are underlined and below the text of the post, the score is given.

Challenges and future work

whoosh provides highlighting, however, I was unable to use it to highlight the query terms and hence, I created a custom highlighter.
train word embeddings or fine tuning the pretrained word embeddings
instead of averaging the embeddings of each word to get the respective embedding for the entire blog ,concatenation of embeddings can be done followed by some dimensionality reduction technique.
instead of averaging the embeddings of each query term and then obtaining the cosine similarity, something like dual embedding space model can be used
Postings list from Whoosh can be used and entirely semantic based search engine can be created in python
Knowledge graph can be created for all the blogs using relation extraction
reranking of the results with user relevance feedback

Apart from whoosh, there is one more library scout which can be used to get the relevant matches. However, scout just performs exact matches and has very less features compared to whoosh for building/customising your own search engine.

Search using lucene in java

Indexing of the blogs

Retrieving the Blogs

Query Expansion

Explanability using lucene. explain which explains how the score for each resulting text is obtained and lucene. highlighter that highlights the terms that were entered by the user as query.

No frontend was created for this part; snippet of CLI is shown below:

In the above diagram, the user entered query terms are surrounded by <B></B>

Comparison of search engine using lucene and search engine using whoosh

Comparison based on my observation while working with the two libraries

Example Queries and respective results

“1”

using lucene: matches with number 1 but no synonyms found
using whoosh: Expanded query is 1 single 1 ace ane one unity but 1 is not highlighted as this case isn’t handled by the custom highlighter

“not well”

using lucene: Synonyms are: considerably intimately easily comfortably substantially wellspring advantageously good swell fountainhead as negated queries not handled
using whoosh:
Expanded query is badly disadvantageously ill

“jjjjjkjkj”

using lucene: no results retrieved
using whoosh: no results retrieved

“food for thoughts”

using lucene: retrieves blogs related to food
using whoosh: whoosh gives higher score to the blogs with the word food, however as semantics are used for reranking, the topmost document is in context with the given phrase.

References

GitHub - sj2812/blogretrieve

Download the blogs from the link https://u.cs.biu.ac.il/~koppel/BlogCorpus.htm Run the GetText.py after setting the path…

github.com

Welcome to Lucene Tutorial.com - Lucene Tutorial.com

Lucene is an open-source Java full-text search library which makes it easy to add search functionality to an application…

www.lucenetutorial.com

Quick start - Whoosh 2.7.4 documentation

Whoosh is a library of classes and functions for indexing text and then searching the index. It allows you to develop…

whoosh.readthedocs.io