Simple Search vs. Stemming
#
1. What is Stemming?Stemming is the process of removing the alternating endings from words. Words with the same meaning, but different endings can then be matched better. For example, The words fish, fishing, and fished would all be reduced to their common word stem fish. For the usage in a search engine, this stem is then stored in the inverted index instead of the original word at that position. When you search with any of the original words, your query is also stemmed and the search engine can find all documents that contain words with the same stem. This way, more relevant documents are found since the search query does not only rely on the given word forms but also matches the same word with different endings.
There are two common approaches to implement stemming in search engines.
#
Algorithmic StemmingThe Porter stemming algorithm, created by Martin Porter, is one of the most popular algorithmic stemmers and is used by default in the built-in language analyzers in Elasticsearch.
Algorithmic stemmers are very fast and do not require a lot of memory as they only apply a set of rules that represents the common rules on how words are inflected in a specific language.
Because there are always exceptions in natural language, an algorithmic stemmer might produce the same stem for words that have different meanings.
This error is called overstemming.
Another error that can happen is called understemming and happens when two inflected words with different endings but the same meaning and stem are not reduced to the same stem.
#
Dictionary Based StemmingThe second approach to stemming is dictionary-based stemming. Dictionary stemmers look up words in a dictionary before they remove their inflected endings and reduce them to their stem. This approach should work better on irregular words or words that are similar but have different meanings. Dictionary stemmers should be able to reduce over- and understemming and achieve better results than algorithmic stemmers. In practice, they sometimes perform worse than algorithmic stemmers because freely available dictionaries are never complete and are often missing important words or are outdated. Dictionary stemmers also have the drawback that they are slower and require more memory as they have to look up all the words in their dictionary.
#
ComparisonIn this comparison we will answer the following questions:
- How much does the relevancy of search results improve by using a stemmer?
- How well does the Hunspell stemmer that is included in Elasticsearch perform with a free dictionary?
#
2. ResultsThe following results are from our First Experiment. You can reproduce these results yourself if you follow the instructions in our experiment section. We calculated Precision and Recall at k=20 for several datasets and show the resulting F1-score.
F1-Score
What is an "F1-Score"?
The F1-Score measures a harmonic mean between Precision and Recall. Therefore, we multiply Precision and Recall by two and divide it by the sum of Precision and Recall:
F1-Score=(2*Precision*Recall)/(Precision+Recall)
This is the simplest way to balance both Precision and Recall, there are also other common options to weight them differently.
On all evaluation datasets except NPL, the Algorithmic Stemmer outperformed the Standard Analyzer as well as the Hunspell Token Filter. The achieved F1-Score gains - measured in percentage points - with the Standard Analyzer as the baseline can be seen here:
Depending on the use-case, algorithmic stemming can create up to a 22% improvement in the relevance of the search results. Hunspell stemming performs worse than algorithmic stemming on nearly all datasets and can even lead to worse results than no stemming.
The average F1-Score gains across all datasets are:
Using the Stemmer Token Filter on English documents can therefore improve the F1-Score by an average of 11%.
#
3. DiscussionLooking at the F1-Score from our approaches we can see that both stemming approaches perform better than the standard baseline on most of the datasets. The Algorithmic Stemmer performs better than the Standard Analyzer with the largest gains on the CACM and NPL corpora. The Hunspell Stemmer performs worse than the Algorithmic Stemmer on most of the datasets except the NPL dataset, where it performs slightly better. On the Medline corpus, the Hunspell Stemmer performs even worse than the Standard Analyzer.
There are several interesting questions that we can ask based on these results:
- Why does the Algorithmic Stemmer perform better than the Standard Analyzer?
- Why does the Algorithmic Stemmer perform better than the Hunspell Stemmer?
- Why does the Hunspell Stemmer perform worse than the Standard Analyzer on the Medline dataset?
- Why does the Hunspell Stemmer perform better than the Algorithmic Stemmer on the NPL dataset?
We will try to answer these questions by breaking down the results of the individual query and word matches.
You can follow all steps in detail and reproduce the results by running our comparison notebook.
#
3.1 Algorithmic Stemmer improvements over the Standard AnalyzerWe can see from the gains that the Algorithmic Stemmer performs better than the Standard Analyzer on all tested datasets. But how does the Algorithmic Stemmer achieve these results?
We can answer this by looking at the NPL dataset which shows a significant gain of 19.65% for the Stemmer Token Filter over the Standard Analyzer.
We first used our search analysis tool to figure out the queries that demonstrate the largest F1-Score improvements with the Stemmer Token Filter. The three queries with the biggest F1-Score difference are 62, 73, and 52. We will be looking at query 62.
The query text of query 63 is:
FAST TRANSISTOR COUNTERS.
We first looked at the disjoint true positives sets of the query results for both the Algorithmic Stemmer and the Standard Analyzer.
The highest-ranked document for the Algorithmic Stemmer that is not present in the top k=20 results of the Standard Analyzer is document 8341
. For the Standard Analyzer, this document appears at position 34 which is outside the k=20 result set that we used to calculate precision and recall.
The text of the relevant document is:
nonsaturating pulse circuits using two junction transistors junction transistors are found to be fast enough for pulse applications if the collector voltage is prevented from reaching zero switching times can be achieved with available types the required limiting action is effected by introducing diodes which terminate the switching transients by their breakdown a two transistor binary counter is described
For the Algorithmic Stemmer the document is at position 3 and returns the following highlighted text:
nonsaturating pulse circuits using two junction transistors junction transistors are found to be fast (...) a two transistor binary counter is described
The overall document score of 12.49 is the sum of these individual term scores:
Term | Score | Term Frequency | Document Frequency |
---|---|---|---|
fast | 4.31 | 1 | 84 |
transistor | 4.22 | 3 | 640 |
counter | 3.95 | 1 | 127 |
For the Standard Analyzer the same document is at position 34 and returns the following highlighted text:
nonsaturating pulse circuits using two junction transistors junction transistors are found to be fast is effected by introducing diodes which terminate the switching transients by their breakdown a two transistor
The overall document score of 7.10 is the sum of these individual term scores:
Term | Score | Term Frequency | Document Frequency |
---|---|---|---|
fast | 4.31 | 1 | 84 |
transistor | 2.78 | 1 | 479 |
We can see that for the Standard Analyzer the inflected word counters is not matched with the word counter in the document. The occurrences of the word transistors are also not matched with the stem transistor used in the query and, therefore, the term frequency and score of the word are lower. The document frequency is also lower, which normally leads to a higher score, but cannot make up for the lower term frequency in this instance.
#
3.2 Hunspell Stemmer improvements over the Algorithmic Stemmer:The NPL dataset is also the only dataset where our Hunspell Token Filter performed slightly better than the Algorithmic Stemmer. We can use our search analysis tool again to extract relevant examples from the datasets that show how the Hunspell Stemmer achieves better results.
We again started by calculating the F1-Score differences for all queries between the Hunspell stemmer and the Algorithmic Stemmer.
Likewise, we picked the query that showed the biggest F1-Score difference favoring the Hunspell Stemmer, which is query 80
.
The query text of query 80 is:
COULD YOU PLEASE GIVE ME ARTICLES ABOUT THE POSSIBILITIES OF GETTING RECTIFICATION USING METALLIC DEVICES.
If we look at the disjoint set of true positives, we can see that the Hunspell Stemmer finds document 8094
at position 1
, while the Algorithmic Stemmer ranks this document at position 130
.
The text of document 8094
is:
component design trends metallic rectifiers approach infinite life developments in cu se si ge and rectifiers are surveyed new designs give reduced size and longer life together with higher operating temperature output current and reverse voltage ratings
As we can see, there are some strange tokens in the text: cu se si ge
. We double-checked the original files of the dataset to ensure that it is actually in the files and not an artifact of our processing.
The Hunspell Stemmer returns the following highlighted text:
component design trends metallic rectifiers approach infinite life developments in cu se si ge and rectifiers are surveyed new designs give reduced size and longer life together with higher operating temperature'
The overall document score with the Hunspell Stemmer of 16.61 is the sum of these individual term scores:
Term | Score | Term Frequency | Document Frequency |
---|---|---|---|
g or give | 4.74 | 2 | 426 |
rectification or rectify | 5.98 | 2 | 180 |
metallic | 5.88 | 1 | 47 |
The Algorithmic Stemmer on the other hand returns the following highlighted text:
component design trends metallic rectifiers approach infinite life developments in cu se si ge and rectifiers are surveyed new designs give reduced size and longer life together with higher operating temperature
The overall document score with the algorithmic stemmer of 7.27 is the sum of these individual term scores:
Term | Score | Term Frequency | Document Frequency |
---|---|---|---|
give | 3.45 | 1 | 426 |
metal | 3.82 | 1 | 300 |
These results show us that the Algorithmic Stemmer is missing the hit on the search term RECTIFICATION which appears in the document as rectifiers. We also see that the keyword metallic receives a better (lower) document frequency with the Hunspell Stemmer because it is not stemmed to metal.
Why doesn't the Algorithmic Stemmer match rectification with rectifiers? We can use the Analyze API endpoint of Elasticsearch to see how the Algorithmic Stemmer handles these words:
POST /pragmalingu-stemming-npl-corpus/_analyze{ "text": "rectification rectifiers rectify"}# Response:{ "tokens" : [ { "token" : "rectif", "start_offset" : 0, "end_offset" : 13, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "rectifi", "start_offset" : 14, "end_offset" : 24, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "rectifi", "start_offset" : 25, "end_offset" : 32, "type" : "<ALPHANUM>", "position" : 2 } ]}
Here we can see that rectification gets overstemmed to rectif instead of rectifi and is, therefore, not matched with the other tokens.
But how exactly does the Hunspell Stemmer manage to match this keyword? We can also look at the output of the Analyze API endpoint to answer this question:
Request:
POST /pragmalingu-hunspell-npl-corpus/_analyze{ "text": "rectification rectifiers rectify"}
Response:
{ "tokens" : [ { "token" : "rectification", "start_offset" : 0, "end_offset" : 13, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "rectify", "start_offset" : 0, "end_offset" : 13, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "rectify", "start_offset" : 14, "end_offset" : 24, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "rectify", "start_offset" : 25, "end_offset" : 32, "type" : "<ALPHANUM>", "position" : 2 } ]}
How does the Hunspell stemmer find the keyword "rectifiers"?
We can see that the search term rectification gets expanded to the following synonym query: Synonym(text:rectification text:rectify)
.
This means that Elasticsearch is looking up both terms at the same position, combining their term frequency, and using the highest document frequency for the score.
Rectify seems to be the stem that the Hunspell Token Filter found in its dictionary for the word rectification and also rectifiers. If we look at the Hunspell dictionary we find the following entry: rectify/XNDRSZG
,
To understand these letters, we can look at the corresponding suffix rules that enable the Hunspell stemmer to match these words.
The first relevant rule here is N:
SFX N Y 3 SFX N e ion e SFX N y ication y SFX N 0 en [^ey]
There is a good explanation of how these Hunspell dictionaries and affix files work and what the individual columns in the rules mean on the Chromium developer documentation site.
The third line SFX N y ication y
enables the matching of rectification.
But in the text rectifiers are mentioned and not rectify, so we need an additional rule during indexing that matches rectifiers to the stem rectify.
The rule Z has a matching line for that:
SFX Z Y 4 SFX Z 0 rs e SFX Z y iers \[^aeiou\]y SFX Z 0 ers \[aeiou\]y SFX Z 0 ers \[^ey\]
Here, the third line SFX Z y iers [^aeiou]y
would match the word rectifiers and thus enable the token filter to replace it with the base entry in the dictionary.
#
3.3 Algorithmic Stemmer improvements over the Hunspell Stemmer:In most cases, the Algorithmic Stemmer outperforms the Hunspell Stemmer. To find examples that explain this performance, we looked at the CACM and ADI datasets, as they showed the most significant differences.
First, we looked at the CACM dataset to find good examples. We calculated the F1-Score differences for both approaches with our search analysis tool and picked query 26 because it exhibits the greatest F1-Score difference between the two approaches.
The query text of query 26 is:
Concurrency control mechanisms in operating systems
The first document in the disjoint set of true positives that is not found by the Hunspell Stemmer is document 1198
.
The text and the title of the document are:
Title:
Solution of a Problem in Concurrent Programming Control
Text:
A number of mainly independent sequential-cyclic processes with restricted means of communication with each other can be made in such a way that at any moment one and only one of them is engaged in the "critical section" of its cycle.
The document appears at position 3
for the Algorithmic Stemmer and at position 50
for the Hunspell Stemmer.
The Algorithmic Stemmer returns the following highlighted text:
independent sequential-cyclic processes with restricted means of communication with each other can be made in such a way that at any moment one and only one of them is engaged in the "critical section" of its And the following highlighted title: 'Solution of a Problem in Concurrent Programming Control'
The overall document score with the Algorithmic Stemmer of 12.06 is the sum of the term scores in the title field:
Term | Field | Score | Term Frequency | Document Frequency |
---|---|---|---|---|
in | text | 0.40 | 2 | 1236 |
concurr | title | 5.77 | 1 | 8 |
control | title | 4.30 | 1 | 38 |
in | title | 1.98 | 1 | 416 |
Only the matches in the title field are counted for the overall document score. This happens because we are using the default scoring strategy for the multi-match query that is best_field, which takes the highest score of one of the searched fields as the document's overall score.
The Hunspell Stemmer, on the other hand, returns the following highlighted text:
independent sequential-cyclic processes with restricted means of communication with each other can be made in such a way that at any moment one and only one of them is engaged in the "critical section" of its
And the following highlighted title: Solution of a Problem in Concurrent Programming Control
The overall document score with the Hunspell Stemmer of 6.66 is also the sum of the term scores in the title field:
Term | Field | Score | Term Frequency | Document Frequency |
---|---|---|---|---|
in | text | 0.41 | 2 | 1236 |
control | title | 4.62 | 1 | 31 |
in | title | 2.04 | 1 | 416 |
The Hunspell Stemmer could not match the important keyword Concurrency with Concurrent in the document title. We can look at the output of the Analyze API endpoint of Elasticsearch to see how the Hunspell Stemmer handles these two words: Request:
POST /pragmalingu-hunspell-cacm-corpus/_analyze{ "text": "concurrency concurrent"}
Response:
{ "tokens" : [ { "token" : "concurrency", "start_offset" : 0, "end_offset" : 11, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "current", "start_offset" : 12, "end_offset" : 22, "type" : "<ALPHANUM>", "position" : 1 } ]}
We can see that concurrency is not stemmed and concurrent seems to be overstemmed to current. We can look at the Hunspell dictionary and the affix file to see which rules lead to this outcome:
For concurrency there is exactly one entry: concurrency
For concurrent
we find the following rule that seems to explain the overstemming to the term current.
current/FAY
The rules FAY in the affix file are:
PFX A Y 1PFX A 0 re .
PFX F Y 1PFX F 0 con .
SFX Y Y 1SFX Y 0 ly .
This means that the words current, recurrent, concurrent, and currently are all stemmed to current.
This seems to indicate that the Hunspell dictionaries we are using, which are freely available for spell-checking, are not optimized for this stemmer use case.
#
3.4 Standard Analyzer improvements over the Hunspell Stemmer:On the Medline dataset, the Hunspell Stemmer performed worse than the default Standard Analyzer. We are again looking at the biggest F1-Score difference favoring the Standard Analyzer to find examples that explain the poor performance of the Hunspell Stemmer.
Query 1 exhibits the biggest F1-Score difference between the Standard Analyzer and the Hunspell Stemmer.
The query text of query 1 is:
the crystalline lens in vertebrates, including humans.
The first document in the disjoint true positive set of query 1 is document 170
.
The text of document 170 is:
identification of species-specific and organ-specific antigens in lens proteins. the species-specific and organ-specific antigens of lens were investigated by gel diffusion and immunoelectrophoresis techniques. it was found that rabbit antiserum to bovine lens showed cross reaction with other bovine tissues. these cross-reacting antigens were the b- - and y-crystallins . there were two major and a minor organ-specific antigen in lens . both the major antigens had a mobility and were identified as the a-crystallin of lens .
The Standard Analyzer returns this document at position 9 with the following highlighted matched keywords:
identification of species-specific and organ-specific antigens in lens proteins. the species-specific and organ-specific antigens of lens were investigated by gel diffusion reaction with other bovine tissues. these cross-reacting antigens were the b- - and y-crystallins there were two major and a minor organ-specific antigen in lens . both the major antigens had a mobility and were identified as the a-crystallin of lens.
The overall document score with the Standard Analyzer was 6.22 and was the sum of the following term scores:
Term | Score | Term Frequency | Document Frequency |
---|---|---|---|
the | 0.022 | 4 | 2020 |
lens | 6.13 | 5 | 41 |
in | 0.07 | 2 | 987 |
The Hunspell Stemmer instead positioned the document 170
at position 25 for query 1 and returned the following highlighted text:
Identification of species-specific and organ-specific antigens in lens proteins. The species-specific and organ-specific antigens of lens were investigated by gel diffusion and immunoelectrophoresis techniques. It was found that rabbit antiserum to bovine lens showed cross reaction with other bovine organ-specific antigen in lens.Both the major antigens had a mobility and were identified as the a-crystallin of lens.
The overall document score with the Hunspell Stemmer was 5.378082 which is slightly lower than the Standard Analyzer score:
Term | Score | Term Frequency | Document Frequency |
---|---|---|---|
the | 0.026 | 4 | 1018 |
l or lens | 5.27 | 8 | 75 |
in | 0.07 | 2 | 986 |
We can see that nearly all the score differences come from the different weighting of the hits for the term lens. The Hunspell Stemmer searches both with the original term lens and with the letter l. This seems to come from the Hunspell dictionary that contains the following entry:
l/SDXTGJ
This entry is combined with the suffix rule X:
SFX X Y 3SFX X e ions eSFX X y ications ySFX X 0 ens [^ey]
Where the fourth line SFX X 0 ens [^ey]
leads to a match with the word lens.
Together, they lead to the dictionary entry l and the entry lens/MS
matching.
The term l has a higher document frequency of 75 than the term lens where the document frequency is 41.
Elasticsearch creates a synonym query for both terms and picks the highest document frequency for the scoring when these terms match.
This is another indication that the freely available LibreOffice dictionaries that we are using for the Hunspell Token Filter are not optimized for the stemming use-case.
#
4. ConclusionFor all our tested datasets, algorithmic stemming showed considerable increases in the F1-Score gains, ranging from 1.4% - 21.9%. As algorithmic stemming is built into the language analyzers provided by Elasticsearch, it is fairly easy to use. Due to its rule-based nature, it does not come with considerable processing or memory overhead.
Dictionary-based stemming, as provided by the Hunspell Token Filter, is not especially easy to use since you need a good dictionary that is tailored to the stemming tasks. The analysis in our discussion section indicated that the freely available Hunspell dictionaries that we used from the LibreOffice project are not really suitable for stemming. Hunspell dictionary stemming improved over the Standard Analyzer in all but one dataset, but it could not beat the Algorithmic Stemmer with the dictionaries we tested.
The overall conclusions and recommendations from our experiment are:
- To get a quick and cheap boost in relevancy for your search use-case, use the Language Analyzers provided by Elasticsearch.
- Only consider the Hunspell Token Filter if you have access to a good dictionary and are sure that it will improve relevancy for your use-case.
- Don't choose the Hunspell Token Filter with the freely available LibreOffice dictionaries over the algorithmic stemmers.
#
Acknowledgements:Thanks to Kenny Hall and Miriam Rupprecht for proofreading this article.