Defining an ElasticSearch Schema for Relevant Search

May 30, 2019
By John T.

Introduction

ElasticSearch is a powerful search engine that provides a Search Relevance Engineer with many useful tools for presenting appropriate search results for users. Yet, it’s also easy to configure the search index’s settings in such a way that a user’s query will not return relevant results, or possibly not return any results at all! In this blog, I will show you how to create an ElasticSearch schema to obtain relevant results, and how to avoid the pitfalls that can prevent users from seeing relevant results. ElasticSearch is based on Lucene like Solr, another popular open source search engine.

Creating a Search Index

ElasticSearch is easy to use once you get the hang of it. Pretty much anything you need to do can be accomplished by sending an HTTP request. There exist libraries for Python, Java, and most of your favorite programming languages that can be used to interact with the ElasticSearch API, but these are really just convenient wrappers for sending HTTP requests. I find it helps to illustrate these requests using curl, because it is a tool that is familiar to pretty much anyone who has worked in a Unix-like environment.

To create a search index, we can send a request like this (assuming curl is available, e.g. on Linux):

curl -XPUT "http://localhost:9200/my_movie_index" -H 'Content-Type: application/json' -d

{

  "settings": {

    "number_of_shards": 1

  },

  "mappings": {

    "movie": {

      "properties": {

        "title": {

          "type": "text"

        },

        "overview": {

          "type": "text",

          "analyzer": "english"

        }

      }

    }

  }

}

The name “my_movie_index” in the request path is the name of the index we are creating. The ‘Content-Type: application/json’ header here is necessary because, if left out, curl will make the request using ‘Content-Type: application/x-www-form-urlencoded’. Of course, when using an ElasticSearch library in your favorite programming language, low-level details, such as this issue are already taken care of for you. Now, coming to the JSON body of the request: under the ‘settings’ key, one can set ‘number_of_shards’ and ‘number_of_replicas’. These are important settings for ElasticSearch clustering because they tell ElasticSearch how many partitions to split the index into and how many copies of those partitions to make. For our purposes, we’re just working with one ElasticSearch instance, or node, so we don’t have to concern ourselves with configuring multiple shards.

Underneath the ‘mappings’ key is where we define our datatypes and our schema settings. This example defines just one datatype: “movie,” as movies are what we will be ingesting. And, for this movie datatype, we have defined two fields: “title” and “overview.” When we add movies to our index, our movie objects can actually contain more than just those two fields, such as “directors,” “cast,” “genres,” etc. but by not defining those in the schema, we are letting ElasticSearch handle those fields in its default manner rather than controlling that behavior ourselves.

How much control do we have over ElasticSearch’s behavior when it comes to ingesting the text in these fields? For example, take the “overview” field in the movie datatype schema. It has an “english” analyzer applied. The “english” analyzer is just shorthand for this structure:

{

  "settings": {

    "analysis": {

      "filter": {

        "english_stop": {

          "type":       "stop",

          "stopwords":  "_english_"

        },

        "english_possessive_stemmer": {

          "type":       "stemmer",

          "language":   "possessive_english"

        }

        "english_stemmer": {

          "type":       "stemmer",

          "language":   "english"

        },

        "english_keywords": {

          "type":       "keyword_marker",

          "keywords": ["exclude-from-stemming"]

        },

      },

      "analyzer": {

        "my_english": {

          "tokenizer":  "standard",

          "filter": [

            "english_possessive_stemmer",

            "lowercase",

            "english_stop",

            "english_keywords",

            "english_stemmer"

          ]

        }

      }

    }

  }

}

This tells ElasticSearch a lot about what to do with the text in this field at index time. First, we see that a stopwords filter is defined. “_english_” is shorthand for ElasticSearch’s default English stopwords list:

a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with

These are mostly articles, conjunctions, pronouns, and prepositions, which are very common in text, in general, and don’t convey much meaning by themselves. These words are filtered out so that they do not become part of the search index.

Next, the english_possessive_filter is defined. This is used to remove ’s from possessive words so that phrases such as “airplane’s landing gear” will be indexed the same as “airplane landing gear.” This is useful for search relevance because if a user wants to query the search index for “airplane,” He or she doesn’t want to miss relevant search results just because the word that appeared in the original text was the possessive variant, “airplane’s.” For my purposes, these two words should be treated as the same token in the search index.

Next, the english_stemmer is defined. ElasticSearch uses the Porter stemming algorithm for the English language, but this can be overridden. Stemming, in general, transforms word variants to their common root word. When run through a stemmer, words like “revive”, “reviving”, and “revival” will be mapped to the same root (“reviv”) and will be entered into the search index as such. Of course, it is important to note that stemming will also be applied to the query string at query time, because otherwise there wouldn’t be a match between the tokens in the query and the tokens in the search index.

Finally, a “keyword_marker” is defined. Any keywords specified here will be ignored by the stemmer. This can be very important for search relevance because it can sometimes make mistakes if the stemmer just uses an algorithm instead of an English dictionary. In a document about wines, for example, you wouldn’t want a word like “Riesling” to be stemmed to the nonsensical root word “Riesl” just because the stemming algorithm determined that it looked like a present participle verb conjugation. So, a keyword marker is used to avoid accidentally stemming known keywords.

The “my_english” analyzer we have defined applies these filters, and a built-in “lowercase” filter, in a particular order: first possessive stemming is applied, then all characters are made lowercase, stopwords are removed, keywords are marked to be ignored by the stemmer, and finally the Porter stemming algorithm is applied. The tokens that are output from this pipeline will be placed into the search index.

How Adjusting the Schema Settings Can Affect Search Results

Suppose for a moment that we set up our schema like so:

 "mappings": {

    "movie": {

      "properties": {

        "title": {

          "type": "text",

          "analyzer": "english"

        }

      }

    }

  }

Now the pipeline of filters will be applied to the “title” field of the movies we add to the search index. Suppose now that we add a movie using the following command:

 curl -XPOST "http://localhost:9200/my_movie_index/movie" -H 'Content-Type: application/json' -d'

{

  "title": "This Is It",

  "overview": "A compilation of interviews, rehearsals and backstage footage ...",

  "directors": ["Kenny Ortega"],

  "cast": "Michael Jackson, Kenny Ortega, ...",

  "genres": ["Music", "Documentary"],

  "release_date": "2009-10-28T00:00:00Z"

}'

When we attempt to search for this movie by title, we will not get any results! The reason we won’t get any results is that the movie’s title, “This Is It,” is composed entirely of stop words. And, since we are now applying our filter chain to the title field, we have essentially stripped the field of all text, leaving ElasticSearch with nothing to add to the index! These are the types of pitfalls that a search relevance engineer will want to avoid. The main takeaway is to always think about the structure of your data, think about the words in it, and set up your analyzers, stemmers, stopwords, etc. appropriately. Ask yourself about the nature of the text that appears in each field—is it long, is it free-form text with many keywords, or is it just a small amount of text where even stopwords have meaning?

Conclusion

ElasticSearch provides many useful tools for performing the text analysis to help construct a search index that facilitates retrieval of relevant results for a user’s queries. However, a Search Relevance Engineer must know the nature of the data being ingested—not only which tools to apply to a particular field to help queries retrieve relevant results, but also which tools not to apply because it will hinder the retrieval of relevant results. Will applying a stemming algorithm in my filter chain help? Or, will it hinder it by causing too many irrelevant results to be returned? When eliminating stopwords, he or she must decide—will that ensure that only relevant keywords exist in the search index, or will it cause relevant text in some fields to be entirely eliminated? The answers to these questions must be considered on a case-by-case basis with a good understanding of the data being ingested. We hope this provides you with a good introduction into ElasticSearch, and its usefulness for search retrieval. We are planning to have more search engine focused blogs in the future since Artemis provides a lot of enterprise search implementations. Learn more here!