Sentence Embeddings Experiment

For our second experiment we connected a Colab Notebook to an Elasticsearch instance and compared the standard Elasticsearch operator with our first embedding approach. To do this, we first transformed the documents into vector representations by using a BERT base model (uncased) for Sentence Embeddings ("bert-base-nli-mean-tokens" from the sentence transformer repository) and indexed them. Next, we searched in the transformed documents with a vector representation of the query. Since the "bert-base-nli-mean-tokens" model is designed to work on sentences, we searched both the document titles, as well as the entire text, and compared the results.

We have already described how the standard Elasticsearch analyzer works and how we configured it in our first experiment.

Need a short summary of the Multi-Match Query?

The standard Elasticsearch method - Multi-Match Query - is a full-text search which by default uses the highest matching score from the given fields to rank the documents; in our case "text" and "title". If no fields are provided, the Multi-Match Query uses all existing fields as a default. To gather our data, we only looked at the first 20 retrieved documents for any given search.

1. What Is BERT?#

BERT is short for Bi-directional Encoder Representation from Transformer and is an approach which uses large, pre-trained, neural networks with some exceptional solutions to get vector representations (embeddings) from texts. Using these embeddings we can use similarity metrics - such as cosine similarity - to compare the meaning of the texts.

(By the way, these networks are frequently used as a backbone or part of models to solve some NLP tasks like Question Answering, Ranking, Named Entity Recognition, etc.)

"I'm brave enough to read the paper on BERT"

We also try to explain BERT in our Guide About Transformers and Embeddings.

2. Preparing The BERT Model#

For our experiment, we decided on a Sentence Transformer BERT model which was optimized to work on Semantic Textual Similarity; "bert-base-nli-mean-tokens". The selected pre-trained BERT model was then used to map the desired search fields from the documents and the search query into embeddings. To get comparable values we ranked the search results based on the cosine similarity between the sentence embeddings and query embeddings.

What is cosine similarity?

The cosine similarity is used to measure the difference between two vectors irrespective of their size. In NLP they often represent words, sentences, or even whole documents. To get the cosine similarity the angle between two vectors - which are projected in a multi-dimensional space - is calculated. The smaller the angle, the higher the cosine similarity. Therefore, if the cosine similarity is 1, the vectors are identical. Read more details on our Basic Definition page.

Where can you find a fitting BERT model for your project?

There are a lot of different pretrained BERT models. The Sentence Transformers Documentation can help you a lot to find a fitting model, even more models can be found on the Hugging Face website.

We then tried to improve the results by re-ranking the top 100 results of a normal multi-match query using the cosine similarity.

Initialize A Pre-trained BERT Model

Before we could start experimenting, we had to choose a pre-trained model from many available ones out of the Sentence Transformers library(TU Darmstadt), download, and initialize it. Our first choice was the BERT model "bert-base-nli-mean-tokens" for Semantic Textual Similarity, this model was developed to generate sentence embeddings. (For more information on this model you can read the paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks)

To initialize the model we ran the following code:

pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')

3. Experimenting With Sentence Embeddings From BERT#

After initializing the BERT model, we converted the documents and the search queries into sentence embeddings. Then - similarly to our last experiment - we calculated comparable values using the Ranking Evaluation API. We decided to compare three different approaches:

The Multi-Match Query with a standard analyzer
A Script Score Query to get the cosine similarity for the BERT embeddings
A more complex Rescore Score Query - to combine the first two approaches - for re-ranking with the cosine similarity of the top 100 documents that are retrieved with a Multi-Match Query We run each of those approaches separately on the "title" and the "text" field. For the re-ranking we also tried a combination of both fields.

3.1. Preparing The Documents#

Now that the model has been integrated into our Notebook, we could start transforming our texts with BERT. To be able to compare the documents and the search query using cosine similarity, we first had to convert the desired search fields into embeddings and feed them into Elasticsearch.

Title Embeddings
For our first approach, we tried to calculate the similarity between the search queries and the titles of the documents. To do this, we set title_vector as a container for the computed BERT embedding in addition to the field title, which should be the container for the text of the title:

# Settings for BERT title searchsettings_title = {  "mappings": {      "properties": {          "title": {              "type": "text"              },          "title_vector": {           "type": "dense_vector",           "dims": 768            }    }  }}

While indexing, the code iterates over the preprocessed corpus and encodes the title for each document using the "bert-base-nli-mean-tokens" model:

for ID, doc_data in adi_txt_data.items():# index for BERT title search    es.index(      index=adi_index_title,       id=ID,       body={          'title_vector': model.encode(adi_txt_data[ID]['title']),          'title': adi_txt_data[ID]['title'],          }    )

Note: Not every corpus has titles, that is why we excluded the Medline, NPL and Time corpus for the title search.

BERT on title field	ADI	CACM	CISI	Cranfield	LISA
Recall	0.459	0.109	0.039	0.24	0.139
Precision	0.459	0.07	0.089	0.082	0.056
F-1-Score	0.459	0.085	0.054	0.123	0.08

Text Embeddings
The procedure for working with the whole text was a bit more complicated as we first had to split the text into sentences. Otherwise the embeddings would have been far too long or even cut off. In addition, we suspect that individual sentences are more likely to be similar to the title than the whole text. Therefore, we created a new category inside the corpus dictionaries to store the split sentences for easy access:

# transform text to sentences for BERT text search
import nltkfrom nltk import tokenizefrom nltk.tokenize import sent_tokenizenltk.download('punkt')
def text_to_sentences(string):  sentences = tokenize.sent_tokenize(string)  return sentences
for ID, doc in adi_txt_data.items():  text = adi_txt_data[ID]['text']  adi_txt_data[ID]['sentences'] = text_to_sentences(text)

Since a text is represented by more than one sentence, we needed to create a container to store all of the sentence embeddings. Therefore, we initialized text_vector as seen below, which is a nested container. The whole, unsplit text is stored in the container text:

# Settings for BERT text searchsettings_text = {  "mappings": {    "properties": {      "text_vector": {        "type": "nested",        "properties": {          "vector": {            "type": "dense_vector",            "dims": 768           }        }      },     "text": {        "type": "keyword"    },    }  }}

Now we can index the text as a whole and as a list of embeddings:

for ID, doc_data in adi_txt_data.items():# index for BERT text search     es.index(      index=adi_index_text,       id=ID,       body={          'text_vector': [{"vector": model.encode(text)} for text in adi_txt_data[ID]['sentences']],          'text': adi_txt_data[ID]['text']          }    )

Note: Not every document has text, so for some corpora we needed to check that while indexing, e.g. the CACM corpus.

BERT on text field	ADI	CACM	CISI	Cranfield	LISA	Medline	NPL	Time
Recall	0.528	0.114	0.067	0.242	0.129	0.171	0.099	0.58
Precision	0.12	0.062	0.096	0.084	0.043	0.21	0.089	0.117
F-1-Score	0.196	0.081	0.079	0.124	0.064	0.189	0.094	0.195

3.2 Search With Cosine Similarity#

Once we had prepared all the documents, the only thing left was to compute the search queries before we could pass them to the Ranking Evaluation API. For the search on the title fields a script_score request for each query was enough to get results. Our script_score calculates the cosine similarity between each document title and the search query before returning the top 20 results:

"query" : {  "script_score": {    "query": {"match_all": {}},    "script": {      "source": "cosineSimilarity(params.query_vector, doc['title_vector']) + 1.0",      "params": {"query_vector": list(model.encode(query_txt['question']).astype(float))}    }  }  }    }

For the search on the text_vector embeddings it was a bit more complicated. Since the embeddings were saved as a list, we had to create a nested query which calculates the cosine similarity between each sentence and the query. The best cosine similarity value becomes the value for the entire document:

"query": {            "nested": {                "path": "text_vector",                "score_mode": "max",                 "query": {                    "function_score": {                        "script_score": {                            "script": {                                "source": "1.0 + cosineSimilarity(params.query_vector, 'text_vector.vector')",                                "params": {"query_vector": list(model.encode(query_txt['question']).astype(float))}                                }                                }                                }                          }                       }                  }                  }

3.3. Search with Re-ranking#

Since the text embeddings doesn't seem to be working as good as the standard Elasticsearch analyzer, we decided to try a re-ranking on the documents. Re-ranking is the act of ranking something differently. In Elasticsearch the function to achieve this is called Rescoring.

Therefore, we run a normal multi-match query on the given field ("title" or "text") and rescore the top 100 results from that query with our cosine similarity BERT search.

In this example we use our search on the "text"-field and the cosine similarity search on the sentence embeddings:

{  "query": {    "multi_match" : {                  "query" : query_txt['question'],                  "fields" : ['text']            }  },  "rescore" : {      "window_size" : 100,      "query": {          "rescore_query" : {            "nested": {                "path": "text_vector",                "score_mode": "max",                 "query": {                    "function_score": {                        "script_score": {                            "script": {                                "source": "1.0 + cosineSimilarity(params.query_vector, 'text_vector.vector')",                                "params": {"query_vector": list(model.encode(query_txt['question']).astype(float))}                                }                                }                                }                          }                       }            }                  }}    }

Re-Ranking on title field	ADI	CACM	CISI	Cranfield	LISA
Recall	0.472	0.185	0.059	0.405	0.306
Precision	0.111	0.11	0.094	0.147	0.141
F1-Score	0.18	0.138	0.072	0.216	0.193

Re-ranking on text field	ADI	CACM	CISI	Cranfield	LISA	Medline	NPL	Time
Recall	0.537	0.272	0.112	0.515	0.355	0.335	0.233	0.774
Precision	0.113	0.13	0.155	0.183	0.143	0.38	0.218	0.151
F1-Score	0.187	0.176	0.13	0.27	0.204	0.356	0.225	0.253

3.4. Combining those Approaches#

To get even better results we then tried to combine all the previous approaches into a more complex Elasticsearch query. Therefore, we searched on both - "title" and "text"-field - and reranked afterwards using the cosine similarity of the title embedding and the sentence embeddings of the text. Since it's more likely to have a meaningful title we boosted the value of the cosine similarity for the title embeddings. The boosting can be adjusted by changing the values of the variables "query_weight" and "rescore_query_weight" using a float().

This is the code we used:

{  "query": {    "multi_match" : {                  "query" : query_txt['question'],                  "fields" : ['text','title']            }  },  "rescore" :  [                {      "window_size" : 100,      "query" : {         "rescore_query" : {              "script_score": {    "query": {"match_all": {}},    "script": {      "source": "cosineSimilarity(params.query_vector, doc['title_vector']) + 1.0",      "params": {"query_vector": list(model.encode(query_txt['question']).astype(float))}    }            }      },         "query_weight" : query_weight,         "rescore_query_weight" : rescore_query_weight   }},                {      "window_size" : 100,      "query": {          "rescore_query" : {            "nested": {                "path": "text_vector",                "score_mode": "max",                 "query": {                    "function_score": {                        "script_score": {                            "script": {                                "source": "1.0 + cosineSimilarity(params.query_vector, 'text_vector.vector')",                                "params": {"query_vector": list(model.encode(query_txt['question']).astype(float))}                                }                                }                                }                          }                       }            }                  }}    ]}

Complex re-ranking query	ADI	CACM	CISI	Cranfield	LISA
Recall	0.556	0.294	0.105	0.516	0.383
Precision	0.126	0.145	0.159	0.184	0.167
F1-Score	0.205	0.194	0.127	0.271	0.233

4. Results#

In order to get suitable comparison values, we decided to carry out the standard analyzer in the same way as the BERT based search on the title or text field. In addition to that we tried out how effective re-ranking can be on the BERT based results and also built a more complex query for Elasticsearch to combine everything we evaluated before. To compare our results properly we measured Recall, Precision and F1-Score for every method on every corpus. We divided the results into the search on "title"-field, on "text"-field and with our more complex query, to avoid confusion.

What is "Recall"?

Recall measures the probability that relevant documents are retrieved. Therefore, the number of all retrieved relevant documents is divided by the number of all documents that are labeled as relevant. For example, if we were to search 10 documents, 8 of which are relevant and 4 of these are retrieved, then the Recall measure would be 4/8 = 0.5. To measure the Recall it is necessary to have the relevant documents labeled. Recall only looks at the documents that could be retrieved and does not take into account any irrelevant documents which may have been retrieved.

What is "Precision"?

Precision measures the probability that retrieved documents are relevant to the search query. Therefore, the number of all retrieved relevant documents is divided by the number of all retrieved documents. For example, if we retrieve 10 search results and only 5 are relevant for our search, then the Precision measure would be: 5/10 = 0.5. To measure the Precision it is necessary to have the relevant documents labeled as such. Precision only looks at the documents that are retrieved and does not account for relevant documents which were not retrieved.

What is "F1-Score"?

The F1-Score measures a harmonic mean between Precision and Recall. Therefore, we multiply Precision and Recall by two and divide it by the sum of Precision and Recall:
`F1-Score=(2*Precision*Recall)/(Precision+Recall)` This is the simplest way to balance both Precision and Recall, there are also other common options to weight them differently.

4.1. Search on the title field#

These are the results for the search on the "title"-field which compare the standard analyzer, a BERT based search and a re-ranking on the top 100 results using BERT. Since the data sets Medline, NPL and Time do not have any titles, the title search was left out for these corpora as with the last experiment.

Recall

Precision

F1-Score

4.2. Search on the text field#

These are the results for the search on the "text"-field which compare the standard analyzer, a BERT search and a re-ranking on the top 100 results using BERT.

Recall

Precision

F1-Score

4.3. Search with a complex re-ranking query#

These are the results for the search with a more complex query which searches on both fields "title" and "text" and reranks these results afterwards by computing the cosine similarity for the BERT vectors on title and text. The title field is boosted while reranking. To see the difference these results are compared with the standard analyzer on the "title"- and "text"-field, as with the BERT based search on "title" and "text".

Recall

Precision

F1-Score

For a more detailed analysis check out our Embedding Comparison.

Acknowledgements:
Thanks to Pavel Prokopev for the pre-work on the BERT Embeddings and Kenny Hall for proofreading this article.

Written by Miriam Rupprecht, December 2020