Stemming Experiment

For our first experiment, we created a Google Colab Notebook that runs an Elasticsearch instance and compared the standard Elasticsearch analyzer with two built-in stemming methods. To do this, we parsed and indexed these 8 publically available datasets, provided by the University of Glasgow, with the different analyzers and evaluated them with the Elasticsearch Ranking Evaluation API.
Our Parsing Guide shows you how these datasets are structured and how we preprocessed them for indexing. The complete code - and workflow to arrive at the results presented here - is available in our Experiment Notebook where you can reproduce this experiment step by step.

These are the two built-in methods that we compared to the standard Elasticsearch analyzer:

1. Standard Elasticsearch Analyzer#

We ran our searches with the multi-match query and used the default standard analyzer. It searches multiple fields of the documents - in our case 'text' and 'title' - and by default uses the highest matching score from one field to rank the document.

We evaluated the ranking on the 20 highest ranked documents returned by each query.

The following is an example request against the Ranking Evaluation API to evaluate a query on the ADI Corpus:

#GET adi-corpus/_rank_eval{    "metric": {        "recall": {            "k": 20,            "relevant_rating_threshold": 1        }    },   "requests": [{            "id": "Query_1",            "request": {                "query": {                    "multi_match": {                        "query": "How can actually pertinent data, as opposed to references or entire articles  themselves, be retrieved automatically in response to information requests?",                        "fields": ["title", "text"]                    }                }            },            "ratings": [{                    "_index": "adi-corpus",                    "_id": "17",                    "rating": 1                }, {                    "_index": "adi-corpus",                    "_id": "46",                    "rating": 1                }, {                    "_index": "adi-corpus",                    "_id": "62",                    "rating": 1                }            ]        }   ]}

For our experiment we evaluated the Precision and Recall at K. These are metrics provided by the Ranking Evaluation API of Elasticsearch. The Elasticsearch documentation of the Ranking Evaluation API gives an overview of all available metrics. We chose precision and recall and calculated the F1-score , which is often used for measuring search performance , with them.

These were the results without any further processing of the data, using the multi-match query of Elasticsearch:

Standard Analyzer	ADI	CACM	CISI	Cranfield	LISA	Medline	NPL	Time
Recall	0.537	0.283	0.103	0.51	0.368	0.335	0.24	0.772
Precision	0.12	0.14	0.154	0.182	0.163	0.378	0.218	0.15
F1-Score	0.196	0.187	0.123	0.268	0.226	0.355	0.229	0.251

2. Stemming#

To experiment with stemming, Elasticsearch provides several built-in stemming methods. We evaluated two of them. First is the algorithmic stemmer; it applies a series of rules to each word to reduce them to their stems. In Elasticsearch this is called the Stemmer Token Filter. In addition, there is the dictionary stemmer which replaces unstemmed words with stemmed variants from a provided dictionary. The Elasticsearch built-in method is called the Hunspell Token Filter.

What is "Stemming"?

The goal of stemming is to remove all morphological features from a word and create truncated, ambiguous Stems. Therefore, it tries to strip off suffixes at the end of a word.
For example, learning changes to learn after stemming.
If the stemmer cuts off too much information the word could become too short and lose semantic meaning; this is called overstemming.

2.1. Stemmer Token Filter#

The Stemmer Token Filter supports different languages, but since our corpora were all in English we used the English stemmer. Compared to the dictionary approach, algorithmic stemming requires less memory and is significantly faster. For this reason, it is usually better to use the Stemmer Token Filter. The configuration in Elasticsearch is not complicated and can be done quickly.

Irregular words or names can cause strange forms that don't give any helpful information since the filter is only rules-based. In spite of this, we expected an improvement to the standard Elasticsearch operator. This is how we configured the stemming analyser in Elasticsearch:

stemming_analyser = {    "filter": {        "eng_stemmer": {            "type": "stemmer",            "name": "english"        }    },    "analyzer": {        "default": {            "tokenizer": "standard",            "filter": ["lowercase", "eng_stemmer"]        }    }}

Afterwards, we incorporated the stemming_analyser into the index settings, which we then used to index the documents of each corpus:

settings = {    "settings": {        "number_of_shards": 1,        "number_of_replicas": 0,        "analysis": stemming_analyser    }}

These were the results with the Stemmer Token Filter applied:

Stemmer Token Filter	ADI	CACM	CISI	Cranfield	LISA	Medline	NPL	Time
Recall	0.608	0.321	0.121	0.538	0.413	0.343	0.292	0.78
Precision	0.13	0.177	0.169	0.193	0.187	0.38	0.259	0.152
F1-Score	0.214	0.228	0.141	0.284	0.257	0.36	0.274	0.255

2.2. Hunspell Token Filter#

The Hunspell Token Filter from Elasticsearch recognizes irregular words and should, therefore, preprocess them better. However, we must keep in mind that words that do not appear in the dictionary are not processed correctly. In addition, a large dictionary will naturally require significantly more time and resources. Since the dictionaries we used were not optimized for the domains of our corpora, we didn't expect a significant change in the results compared to the Stemmer Token Filter.

The procedure of the Hunspell Token Filter is similar to a lemmatization, but the words are not always reduced to the lemma with Hunspell. So the meaning of the words is not sufficiently addressed every time.

What is "Lemmatization"?

Lemmatization tries to connect words to their root form - their **Lemma** - by removing all morphological rules. It's similar to stemming, but it takes affixes and plural forms into account as well, instead of simply stripping away suffixes.
For example, `mice` would connect to `mouse`, and `froze` & `frozen` to `freeze`.

The procedure is the same as with the *Stemmer Token Filter*. First, we configured the Hunspell Stemming analyser:

#the order of filter and analyser is arbitrarydictionary_analyser = {    "filter" : {        "dictionary_stemmer" : {          "type" : "hunspell",          "locale" : "en_US",          "dedup" : True  #duplicate tokens are removed from the filter’s output        }    },    "analyzer" : {        "default" : {            "tokenizer" : "standard",            "filter" : ["lowercase", "dictionary_stemmer"]        }    }}

Afterwards, we embedded the analyser into the settings to index our documents with it:

settings = {    "settings": {        "number_of_shards": 1,        "number_of_replicas": 0,        "analysis": dictionary_analyser    }

These were the results of the *Hunspell Token Filter*:

Hunspell	ADI	CACM	CISI	Cranfield	LISA	Medline	NPL	Time
Recall	0.575	0.312	0.12	0.528	0.405	0.335	0.3	0.777
Precision	0.127	0.164	0.167	0.19	0.184	0.368	0.259	0.151
F1-Score	0.208	0.215	0.14	0.28	0.253	0.351	0.278	0.252

3. Results#

To compare our results properly we plotted Recall, Precision, and F1-Score for every method on every corpus.

Recall

What is "Recall"?

Recall measures the probability that relevant documents are retrieved. Therefore, the number of all retrieved relevant documents is divided by the number of all documents that are labeled as relevant. For example, if we were to return 10 documents, 8 of which are relevant to our search and 4 of these are retrieved, then the Recall measure would be 4/8 = 0.5.

To measure Recall it is necessary to have the relevant documents labeled. Recall only looks at relevant documents that were retrieved and does not take into account any irrelevant documents which may have been retrieved.

Precision

What is "Precision"?

Precision measures the probability that retrieved documents are relevant to the search query. Therefore, the number of all retrieved relevant documents is divided by the number of all retrieved documents. For example if we retrieve 10 search results and only 5 are relevant for our search, the Precision measure would be: 5/10 = 0.5.

To measure the Precision it is necessary to have the relevant documents labeled as such. Precision only looks at the documents that are retrieved and does not account for relevant documents which were not retrieved.

F1-Score

What is "F1-Score"?

The F1-Score measures a harmonic mean between Precision and Recall. Therefore we multiply Precision and Recall by two and divide it by the sum of Precision and Recall:
`F1-Score=(2*Precision*Recall)/(Precision+Recall)` This is the simplest way to balance both Precision and Recall, there are also other common options to weight them differently.

It is apparent that the algorithmic stemmer offers some improvements over no stemming. The dictionary based hunspell stemming seems to perform either better or worse depending on the corpora. For a more detailed analysis check out our Simple Search vs. Stemming Comparison.

Acknowledgements:
Thanks to Dario Alves for implementing the Stemmers inside of Google Colab for Elasticsearch and Kenny Hall for proofreading this article.

Written by Samy Ateia and Miriam Rupprecht, September 2020