Stemming Experiment
For our first experiment, we created a Google Colab Notebook that runs an Elasticsearch instance and compared the standard Elasticsearch analyzer with two built-in stemming methods.
To do this, we parsed and indexed these 8 publically available datasets, provided by the University of Glasgow, with the different analyzers
and evaluated them with the Elasticsearch Ranking Evaluation API.
Our Parsing Guide shows you how these datasets are structured and how we preprocessed them for indexing.
The complete code - and workflow to arrive at the results presented here - is available in our Experiment Notebook where you can reproduce this experiment step by step.
These are the two built-in methods that we compared to the standard Elasticsearch analyzer:
#
1. Standard Elasticsearch AnalyzerWe ran our searches with the multi-match query and used the default standard analyzer. It searches multiple fields of the documents - in our case 'text' and 'title' - and by default uses the highest matching score from one field to rank the document.
We evaluated the ranking on the 20 highest ranked documents returned by each query.
The following is an example request against the Ranking Evaluation API to evaluate a query on the ADI Corpus:
#GET adi-corpus/_rank_eval{ "metric": { "recall": { "k": 20, "relevant_rating_threshold": 1 } }, "requests": [{ "id": "Query_1", "request": { "query": { "multi_match": { "query": "How can actually pertinent data, as opposed to references or entire articles themselves, be retrieved automatically in response to information requests?", "fields": ["title", "text"] } } }, "ratings": [{ "_index": "adi-corpus", "_id": "17", "rating": 1 }, { "_index": "adi-corpus", "_id": "46", "rating": 1 }, { "_index": "adi-corpus", "_id": "62", "rating": 1 } ] } ]}
For our experiment we evaluated the Precision and Recall at K. These are metrics provided by the Ranking Evaluation API of Elasticsearch. The Elasticsearch documentation of the Ranking Evaluation API gives an overview of all available metrics. We chose precision and recall and calculated the F1-score , which is often used for measuring search performance , with them.
These were the results without any further processing of the data, using the multi-match query of Elasticsearch:
Standard Analyzer | ADI | CACM | CISI | Cranfield | LISA | Medline | NPL | Time |
---|---|---|---|---|---|---|---|---|
Recall | 0.537 | 0.283 | 0.103 | 0.51 | 0.368 | 0.335 | 0.24 | 0.772 |
Precision | 0.12 | 0.14 | 0.154 | 0.182 | 0.163 | 0.378 | 0.218 | 0.15 |
F1-Score | 0.196 | 0.187 | 0.123 | 0.268 | 0.226 | 0.355 | 0.229 | 0.251 |
#
2. StemmingTo experiment with stemming, Elasticsearch provides several built-in stemming methods. We evaluated two of them. First is the algorithmic stemmer; it applies a series of rules to each word to reduce them to their stems. In Elasticsearch this is called the Stemmer Token Filter. In addition, there is the dictionary stemmer which replaces unstemmed words with stemmed variants from a provided dictionary. The Elasticsearch built-in method is called the Hunspell Token Filter.
What is "Stemming"?
The goal of stemming is to remove all morphological features from a word and create truncated, ambiguous Stems. Therefore, it tries to strip off suffixes at the end of a word.
For example, learning
changes to learn
after stemming.
If the stemmer cuts off too much information the word could become too short and lose semantic meaning; this is called overstemming.
#
2.1. Stemmer Token FilterThe Stemmer Token Filter supports different languages, but since our corpora were all in English we used the English stemmer. Compared to the dictionary approach, algorithmic stemming requires less memory and is significantly faster. For this reason, it is usually better to use the Stemmer Token Filter. The configuration in Elasticsearch is not complicated and can be done quickly.
Irregular words or names can cause strange forms that don't give any helpful information since the filter is only rules-based. In spite of this, we expected an improvement to the standard Elasticsearch operator. This is how we configured the stemming analyser in Elasticsearch:
stemming_analyser = { "filter": { "eng_stemmer": { "type": "stemmer", "name": "english" } }, "analyzer": { "default": { "tokenizer": "standard", "filter": ["lowercase", "eng_stemmer"] } }}
Afterwards, we incorporated the stemming_analyser
into the index settings, which we then used to index the documents of each corpus:
settings = { "settings": { "number_of_shards": 1, "number_of_replicas": 0, "analysis": stemming_analyser }}
These were the results with the Stemmer Token Filter applied:
Stemmer Token Filter | ADI | CACM | CISI | Cranfield | LISA | Medline | NPL | Time |
---|---|---|---|---|---|---|---|---|
Recall | 0.608 | 0.321 | 0.121 | 0.538 | 0.413 | 0.343 | 0.292 | 0.78 |
Precision | 0.13 | 0.177 | 0.169 | 0.193 | 0.187 | 0.38 | 0.259 | 0.152 |
F1-Score | 0.214 | 0.228 | 0.141 | 0.284 | 0.257 | 0.36 | 0.274 | 0.255 |
#
2.2. Hunspell Token FilterThe Hunspell Token Filter from Elasticsearch recognizes irregular words and should, therefore, preprocess them better. However, we must keep in mind that words that do not appear in the dictionary are not processed correctly. In addition, a large dictionary will naturally require significantly more time and resources. Since the dictionaries we used were not optimized for the domains of our corpora, we didn't expect a significant change in the results compared to the Stemmer Token Filter.
The procedure of the Hunspell Token Filter is similar to a lemmatization, but the words are not always reduced to the lemma with Hunspell. So the meaning of the words is not sufficiently addressed every time.
What is "Lemmatization"?
For example, `mice` would connect to `mouse`, and `froze` & `frozen` to `freeze`.
The procedure is the same as with the *Stemmer Token Filter*. First, we configured the Hunspell Stemming analyser:
#the order of filter and analyser is arbitrarydictionary_analyser = { "filter" : { "dictionary_stemmer" : { "type" : "hunspell", "locale" : "en_US", "dedup" : True #duplicate tokens are removed from the filter’s output } }, "analyzer" : { "default" : { "tokenizer" : "standard", "filter" : ["lowercase", "dictionary_stemmer"] } }}
Afterwards, we embedded the analyser into the settings to index our documents with it:
settings = { "settings": { "number_of_shards": 1, "number_of_replicas": 0, "analysis": dictionary_analyser }
These were the results of the *Hunspell Token Filter*:
Hunspell | ADI | CACM | CISI | Cranfield | LISA | Medline | NPL | Time |
---|---|---|---|---|---|---|---|---|
Recall | 0.575 | 0.312 | 0.12 | 0.528 | 0.405 | 0.335 | 0.3 | 0.777 |
Precision | 0.127 | 0.164 | 0.167 | 0.19 | 0.184 | 0.368 | 0.259 | 0.151 |
F1-Score | 0.208 | 0.215 | 0.14 | 0.28 | 0.253 | 0.351 | 0.278 | 0.252 |
#
3. ResultsTo compare our results properly we plotted Recall, Precision, and F1-Score for every method on every corpus.
Recall
What is "Recall"?
To measure Recall it is necessary to have the relevant documents labeled. Recall only looks at relevant documents that were retrieved and does not take into account any irrelevant documents which may have been retrieved.
Precision
What is "Precision"?
To measure the Precision it is necessary to have the relevant documents labeled as such. Precision only looks at the documents that are retrieved and does not account for relevant documents which were not retrieved.
F1-Score
What is "F1-Score"?
`F1-Score=(2*Precision*Recall)/(Precision+Recall)` This is the simplest way to balance both Precision and Recall, there are also other common options to weight them differently.
It is apparent that the algorithmic stemmer offers some improvements over no stemming. The dictionary based hunspell stemming seems to perform either better or worse depending on the corpora. For a more detailed analysis check out our Simple Search vs. Stemming Comparison.
Acknowledgements:
Thanks to Dario Alves for implementing the Stemmers inside of Google Colab for Elasticsearch and Kenny Hall for proofreading this article.