Basic Terms
With this guide, we want to provide you with some quick definitions on the basic Natural Language Processing (NLP) terms used in our experiments. If there are terms that you think should be explained as well, please let us know.
#
NLP - Natural Language ProcessingFirst, let's clarify what NLP is. NLP stands for “Natural Language Processing” and is a sub-field of Computer Science which works with natural language. Natural language (for example English, German, Japanese) is a language that was naturally created by people to communicate and has evolved throughout use and repetition. As opposed to artificial languages or computer code, there is not a large degree of planning or optimization involved when the rules of a natural language somehow change. Since Natural Language sometimes appears arbitrary (in comparison with artificial languages) and does not always follow strict rules, it is very difficult to process it automatically. Therefore, NLP began to get more important in the 1950s as an intersection of artificial intelligence and linguistics. Nowadays, it is a separate field of research.
#
Corpus (pl. Corpora)A text corpus (plural corpora) is a collection of texts which have been collected to be used in Natural Language Processing. Often there are not only texts in a corpus but also some useful information; for example on the author or the publishing date. A corpus can contain texts from one language (monolingual corpus), or texts from several languages (multilingual corpus). Corpora are usually created according to the application, which is why not every corpus is suitable for every task; e.g. 'The Stanford Question Answering Dataset', which was collected specially to train question and answering tasks. Every corpus is structured differently, which is why analyzing a corpus alone can take a lot of time.
#
Tokens vs. TypesThe definition of “word” is not very clear in linguistics. For our purposes we require a better and more specific definition. When we automatically process language and talk about "words", we usually distinguish between Types and Tokens. The Tokens are all the "words" in a running text that are separated through punctuation and spaces. On the other hand, the Types are the distinct classes of Tokens in a sentence or in a text. As an example:
Sentence 1: A rose is a rose
This sentence has 3 Types (a
,rose
,is
) and 5 Tokens (a
,rose
,is
,a
,rose
). Since a
and rose
are repeated twice, they only count as 1 Type each.
If we replace the last token rose
- the number of Types changes:
Sentence 2: A rose is a roses
(Please note: This sentence is ungrammatical on purpose, because we wanted to show how small differences can influence the Type ratio)
Now we have 4 distinct Types (a
,rose
,is
,roses
) and still 5 Tokens (a
,rose
,is
,a
,roses
).
Note that even if the word is the same, but has a different ending (for example roses
is the plural of rose
), it counts as a separate Type.
#
TokenizationTokenization is the process of splitting a long sequence of symbols (like a sentence or a text) into Tokens. Tokens are defined in the # Tokens vs.Types
paragraph above. In many languages, words are separated by spaces and punctuation marks. Therefore, many tokenizers split sentences into words at these markers.
For example, the sentence A rose is a rose.
can be split into A
, rose
, is
, a
, rose
, .
.
Depending on the tokenizer the dot (.
) could be stripped away as well.
But some cases make it difficult (or incorrect) to tokenize in this manner. Some symbol sequences do not make sense when they get split up by simply following these rules.
For example, proper names like U.K.
, San Francisco
, or splitting at apostrophe - like in the case of s'more
.
That's why a tokenizer often needs information about the language to split the sentence in a way that makes sense. For other languages such as Japanese or Chinese, you can't split at spaces at all.
#
LemmatizationWith lemmatization, we try to connect the words in the text to their basic form, their Lemma. The Lemma is the root - or the dictionary form - of a word. Lemmatization is done by removing all prefixes and suffixes, according to the morphological rules of the language.
In this way, we reduce the number of Tokens of a text and get more Types. A lemmatizer would map all the words to their common lemma.
For example, the lemmatizer will recognize that mice
is a form of mouse
and froze
& frozen
are forms of freeze
.
There are some languages, such as Arabic, which cannot be processed properly without lemmatization.
But since a lemmatizer needs a lot of information about the morphology of a language, this usually slows down the search process significantly and leads to an enormous amount of data. However, if lemmatization is well adapted to the data, it can significantly improve the search results.
#
StemmingStemming is a simpler version of lemmatization, which does not rely on full morphological analysis. Depending on the method used it mainly strips off prefixes or suffixes (from lists of such) from the beginning or the end of a word to reduce it to its Stem. In English, the Stem is generally the part of the word that doesn't change when you apply grammatical rules. Stemming is one of the most commonly used methods when dealing with search engines since it is easier than lemmatization.
The goal of a stemmer is to remove all morphological features from a word and create truncated, ambiguous Stems.
For example, learning
changes to learn
after stemming.
In most cases, the search query is only improved by a stemmer if the query is not too long. Otherwise, there is a risk that too many irrelevant results will be returned.
But, if there are short queries, stemming can be very helpful as small grammatical deviations can be included in the search. However, one must be careful with too much stemming. Sometimes, stemming may produce a part of the word which is not linguistically valid. If the stemmer cuts off too much information, the word could become too short and lose semantic meaning. This is called overstemming. This occurs, for example, when the stemmer reduces saw
to s
.
Always keep in mind that the quality of a stemmer varies greatly from language to language because some languages have more morphological derivations than others.
#
Stop WordsIn most languages, the most frequent words are also those that do not carry much (or any) meaning. In English such words include is
, a
, and
, or the
. To only get relevant search queries, we try to remove those words. These words are called stop words since they don't retrieve any useful information while still being processed. Ignoring stop words is a way to make the search more efficient and to get more relevant data.
To ignore or remove them we can simply consider most of the 10-100 most frequent words as stop words or use some of the already existing stop word lists depending on the language we want to search.
#
Cosine SimilarityIn NLP, language can be represented as a vector of features; so called embeddings. These can represent words, sentences, or even whole documents. To use these embeddings in, for example, Information Retrieval, it is necessary to have a way of computing the similarity between them. The most common way is to measure the difference between those vectors. This can be achieved by computing the cosine similarity. Hereby the angle between two vectors is calculated. This is irrespective of their size, which makes it perfect for NLP tasks since, in language, embeddings rarely have the same size. The smaller the angle between the vectors, the more similar they are. Which means if the angle is 0, the vectors are identical, therefore the cosine similarity is 1.
Acknowledgements:
Thanks to Kenny Hall and Irina Temnikova for proofreading this article.