How to Use the Elasticsearch Ranking Evaluation API

1. Introduction#

To evaluate our experiments we used the built-in Ranking Evaluation API by Elasticsearch. In order to adjust the Ranking Evaluation API to your needs this guide will give you some short explanations of the operators and metrics we used during our experiments.

Before we start, the Ranking Evaluation API requires three inputs in order to provide accurate results:

Document(s) to be evaluated
A few typical queries for said document(s)
Relevance assessments that represent document ratings regarding the queries

First, we need to save the desired document(s) to an Elasticsearch index. Detailed instructions on how to do this can be found in our Elasticsearch Setup Guide.

Next, you will need to ensure that the search queries are properly formatted for use with the Ranking Evaluation API. Please see our Parsing Guide for detailed instructions on formatting raw data

After parsing the queries, they should have the following format:

{      "requests": [    {      "id": "Query_1",                                        "request": {                                                        "query": { "match": { "text": "\nwhat are the structural and aeroelastic problems associated with flight\nof high speed aircraft .\n" } }      },      "ratings": [                                                      { "_index": "cranfield-corpus", "_id": "184", "rating": 1 },        { "_index": "cranfield-corpus", "_id": "29", "rating": 1 },        { "_index": "cranfield-corpus", "_id": "31", "rating": 1 }      ]    },    {      "id": "Query_2",      "request": {        "query": { "match": { "text": "\nwhat problems of heat conduction in composite slabs have been solved so\nfar .\n" } }      },      "ratings": [        { "_index": "cranfield-corpus", "_id": "12", "rating": 1 }      ]    }  ]  "metric": {    "precision": {      "k": 20,      "relevant_rating_threshold": 1,      "ignore_unlabeled": false    }  }}

According to the Python API Documentation website for the Ranking Evaluation API this is called the “body” so from here we will refer to this as the “evaluation body”. After saving the body as a variable called eval_body we send it to the Ranking Evaluation API using json:

eval_body = json.dumps(create_function)# our elasticsearch instance is stored in the variable esresult = es.rank_eval(eval_body,index_we_want_to_search_on)# print the evaluation in readable formatprint(json.dumps(result, indent=4, sort_keys=True))

What is "create_function"?

The argument create_function is a function we wrote to parse our preprocessed data into the evaluation body format. You can find details in our Parsing Guide.

It is important to specify which index you would like the Ranking Evaluation API to search, otherwise it will search all available indexes.

You can find further information about the Ranking Evaluation API on the Elasticsearch website.

2. Query Metrics#

Which query operator you choose to use will depend on the type of data you wish to process. Below we will provide a brief overview of the three operators we have tested thus far.

2.1. Match Query Operator#

The Match Query operator is the standard search operator in Elasticsearch. It provides a full-text search of the given fields of the document(s) and includes standard tokenization and an analyzer.

To use the Match Query operator in the Ranking Evaluation API we first need to add it to every query request in the field "query": and also add the query text to the field "text":. The operator tries to match the string given in "text": with the documents in the given index. Here are some example queries:

{      "requests": [    {      "id": "Query_1",                                        "request": {                                                        "query": { "match": { "text": "\nwhat are the structural and aeroelastic problems associated with flight\nof high speed aircraft .\n" } }      },      "ratings": [                                                      ...      ]    },    {      "id": "Query_2",      "request": {        "query": { "match": { "text": "\nwhat problems of heat conduction in composite slabs have been solved so\nfar .\n" } }      },      "ratings": [        ...  ]  "metric": {    ...  }}

2.2. Simple Query String Operator#

The Simple Query String operator uses a parser with a limited error-tolerant syntax to search for the given query string in the given fields. It splits and analyzes the query terms separately for the search before returning the relevant documents. It ignores invalid parts of the query string.

Here is an example request with the simple query string operator searching for the query in the 'text' and 'title' fields:

{  "requests": [    {      "id": "Query_1",                                        "request": {                                                        "query": { "simple_query_string" : {               "query": "\nwhat are the structural and aeroelastic problems associated with flight\nof high speed aircraft .\n",               "fields": ["title", "text"]      }}},      "ratings": [                                                      ...      ]    },    {      "id": "Query_2",      "request": {        "query": { "simple_query_string" : {               "query": "\nwhat problems of heat conduction in composite slabs have been solved so\nfar .\n",               "fields": ["title", "text"]      }}},      "ratings": [        ...  ]  "metric": {    ...  }}

2.3. Multi-Match Query Operator#

The Multi-Match Query operator builds on the match query operator and allows multi-line matches while searching. While it does work similarly to the simple string query it rates the documents a bit differently.

If no specific fields are provided it will search all available fields. We chose "title" and "text" since they contain the most information:

{  "requests": [    {      "id": "Query_1",                                        "request": {                                                        "query": { "multi_match" : {               "query": "\nwhat are the structural and aeroelastic problems associated with flight\nof high speed aircraft .\n",               "fields": ["title", "text"]      }}},      "ratings": [                                                      ...      ]    },    {      "id": "Query_2",      "request": {        "query": { "multi_match" : {               "query": "\nwhat problems of heat conduction in composite slabs have been solved so\nfar .\n",               "fields": ["title", "text"]      }}},      "ratings": [        ...  ]  "metric": {    ...  }}

This was also the operator we chose for our Standard Search to Stemming Comparison, which you can read in our First Experiment.

3. Rating Metric#

With our rating metric we are able to rank more relevant documents higher than less relevant ones:for example, if we search for ice cream truck we could rate documents with ice cream truck higher than documents with ice cream or truck while still labeling all of those documents as relevant. A document’s rating is represented by an integer; the higher the number the more relevant the document. To use this scale in your search you must first adjust the relevant_rating_threshold variable to the highest rated number. The default would be 1, which means the different rated documents would all be irrelevant.

We can only make this adjustment if we have annotated data that provides the right metrics or if we are willing to rate the relevant documents manually.

Therefore, if we do not have any information other than that a document is relevant or irrelevant, we can just use the binary scale; 0 is irrelevant and 1 is relevant.

This is an example with higher scaled ratings:

{      "requests": [    {      "id": "Query_1",                                        "request": {                                                        ...      },      "ratings": [                                                      { "_index": "cranfield-corpus", "_id": "184", "rating": 2 },        { "_index": "cranfield-corpus", "_id": "29", "rating": 3 },        { "_index": "cranfield-corpus", "_id": "31", "rating": 1 }      ]    },    {      "id": "Query_2",      "request": {        ...      },      "ratings": [        { "_index": "cranfield-corpus", "_id": "12", "rating": 1 }      ]    }  ]  "metric": {    "precision": {      "k": 20,      "relevant_rating_threshold": 3,      "ignore_unlabeled": false    }  }}

Since the Cranfield corpus was the only one which provided scaled ratings we only used the binary rating. We also didn't rate any documents with 0, since any unrated document receives a 0 by default.

4. Evaluation Metrics#

In addition to the metrics we can adjust by changing the request, there are some metrics that influence how the evaluation results will be computed.

4.1. Precision at k#

By setting the metric to precision and defining k, we can measure how many relevant results are returned in the top k results.

Since a user typically only looks at the first few results of a search query, it is not necessary to set k higher than 20. Reducing the number of top results to 10 or 5 could give an even better idea of the performance.

But this metric only recognizes documents as relevant if they are labeled so it is possible that documents that would be relevant, but aren't labeled, would actually worsen the results.. Also, the position is not taken in account either. Documents in position 1 and position 10 are considered as equally important.

We have already explained the relevant_rating_threshold parameter above. With the ignore_unlabeled parameter you are able to take the unlabeled documents into account. The default value is false, which treats unlabeled documents as irrelevant. But if set to true, those documents will not be counted.

An example for the Precision metric:

{      "requests": [    ...  ]  "metric": {    "precision": {      "k": 20,      "relevant_rating_threshold": 3,      "ignore_unlabeled": false    }  }}

What is "Precision"?

Precision measures the probability that retrieved documents are relevant to the search query. Therefore the number of all retrieved relevant documents is divided by the number of all retrieved documents.
For example: We retrieve 10 search results and only 5 are relevant for our search. The Precision measure would be: 5/10 = 0.5
To measure the Precision it is necessary to have the relevant documents labeled as such. Precision only looks at the documents that are retrieved and does not take into account any documents which could have been retrieved.

4.2. Recall at k#

The metric recall works very similarly to the precision metric. k defines how many relevant results are returned in the top k results and taken into account. This metric has its own problems, though, if documents are not labeled. The position is not taken into account either, which can be problematic since the Recall metric does not account for any retrieved irrelevant documents.

The relevant_rating_threshold parameter is explained above.

An example for the Recall metric:

{      "requests": [    ...  ]  "metric": {    "recall": {      "k": 20,      "relevant_rating_threshold": 3    }  }}

What is "Recall"?

Recall measures the probability that relevant documents are retrieved. Therefore the number of all retrieved relevant documents is divided by the number of all documents that are labeled as relevant.
For example: We search 10 documents, 8 of them are relevant and 4 of the relevant ones are retrieved. The Recall measure would be: 4/8 = 0.5
To measure the Recall it is necessary to have the relevant documents labeled. Recall only looks at the documents that could be retrieved and does not take into account if any irrelevant documents are retrieved aside from the relevant ones.

Acknowledgements:
Thanks to Kenny Hall for proofreading this article.

Written by Miriam Rupprecht, November 2020