Optimizing the Elasticsearch Index | Artemis Consulting, Inc.

November 10, 2020
By Sai K.

Elasticsearch (ES) is a highly scalable open-source full-text search and analytics engine used for Enterprise search at many large organizations. The basic building block of Elasticsearch is an index, a collection of documents of similar characteristics.

More often than not, we hear the comment that the “search is slow.” For one of our client projects, we looked into ways to make our search perform better, and this task begins with optimizing the Elasticsearch index. After an initial analysis, we identified a couple of issues with our current indexing model that were highly disruptive to the search performance.

Challenges

Field Mapping: The documents were indexed using dynamic field mapping, so lots of text fields were mapped as both “Text” and “Keyword.” “Text” fields are tokenized and analyzed in ES, which requires more keys to be reverse-indexed for the document. In turn, more elements have to be looked-up during the search. Examples include UUID, City and State fields that need to be searched as a whole. Running Text analyzer and tokenizer transforms on these fields is time consuming both for the indexing phase and the search phase.

Complex Nested Structure: Our project’s document structure can contain a lot of nested child elements. The search performance of documents with lots of nested objects is slow because there are a lot of internal “joins” at search time. Also, nested documents are indexed as separate documents; they can only be accessed within the scope of the nested query, the nested/reverse_nested or nested inner hits.

Solution

We Made Changes to Field Mappings: The Index mapping was not defined before indexing, so Dynamic field mapping was applied. This means the following:

All String fields were mapped to both “Text” and “keyword.”
All Numeric fields were mapped to “long.”

  “city":{

      "type":"text",

      "fields":{

         "keyword":{

            "type":"keyword",

            "ignore_above":256

         }

      }

   },

—

   "organizationId":{

       "type":"long"

   },

——

   "foreignAddress":{

      "type":"boolean"

   },

We made changes in the Index Mapping, which included:

Mapping some of the fields, such as “Keyword”, since not all string data needs to be of type text. “Text” fields are tokenized and analyzed in ES, requiring more keys to be reverse indexed for the document, which, in turn, means that more elements have to be looked-up during search operations.
Mapping numeric fields to short, integer, or long based on the maximum value permitted for that field. Using the smallest type improved the indexing and search efficiency.
Mapping all numeric fields to “keyword,” which are IDs and never used in range queries like <, > or between. ES optimizes numeric fields for “range” queries and these ID fields will always be term based.

We Flattened the Document: The performance issue with nested structure can be reduced by flattening the document before indexing. There are two approaches to flattening the document.

Ex: Consider a JSON object

{

  "field1": "RF1",

  "field2": "RF2",

  "interObject": [

    {

           "io-field1": "IOF11",

      "io-field2": "IOF12"

    },

    {

      "io-field1": "IOF21",

      "io-field2": "IOF22"

    }

  ]

}

```
Flattening Approach 1
```

Flatten each child into separate fields thus the relevance of the Child objects is maintained.

{

"field1": "RF1",

"field2": "RF2",

"interObject[0].io-field1": "IOF11",

"interObject[0].io-field2": "IOF12",

"interObject[1].io-field1": "IOF21",

"interObject[1].io-field2": "IOF22"

}

Pro: The object structure can be related – index(x) of the interObject[x] would represent the relativity of the fields.

Con: The number of fields will increase based on the number of child elements and more specifically the size of any array within the document. Per Elastic’s guidelines, too many fields in an index can cause a mapping explosion, which can then cause out-of-memory errors and difficult situations to recover from.

Flattening Approach 2

Flatten child by path and accumulate all values as an array

{

  "rfield1": "RF1",

  "rfield2": "RF2",

  "interObjectFirsts.ioF1": [

    "IOF11",

    "IOF21"

  ],

  "interObjectFirsts.ioF2": [

    "IOF12",

    "IOF22"

  ]

}

Pro: Number of fields is predictable by the number of distinct elements in the indexed document.

Con: Relationships of the objects is lost by this approach.

Given that the requirement was to be able to search across all the fields and that maintaining object relationships is not as important. We picked Flattening Approach 2.

Approach 2 Implementation Details

To flatten the document a “recursive programming” approach was implemented in the code.

“addKeys” is a recursive method that accepts a Node, Path and Map<Key,Value>. The initial value would be the root node, path and map will be empty. As the control traverses through each child under the node, there are three paths:

If the child node is a value node, then the value is added to the Map with path as key and actual node value as Value.
If the child node is a JSON node then:
- Update the path with node name
- Call the “self” method for each child element
If the child node is an array type, then call itself recursively (path will not get updated)

With this approach, the complete Nested JSON is flattened where each key represents the parent-child hierarchy, and the values will be String, Integer or Array.

// Processing begins here -

//  The method accepts a JSON String and returns Map where each key

// represents the elements from JSON

public Map<String, Object> flattenToMap(String json) {

   Map<String, Object> map = new LinkedHashMap<>();

   try {

     //Call recursive method - providing the root element

     addKeys("", new ObjectMapper().readTree(json), map, false);

   } catch (IOException e) {

     e.printStackTrace();

   }

   return map;

}

// Recursive call method to build the flatten Map

private void addKeys(String currentPath, JsonNode jsonNode,

    Map<String, Object> map, boolean isParentArray)

{

   if (jsonNode.isObject()) {

     ObjectNode objectNode = (ObjectNode) jsonNode;

     Iterator<Entry<String, JsonNode>> iter = objectNode.fields();

     String pathPrefix = currentPath.isEmpty() ? "" : currentPath + ".";

     while (iter.hasNext()) {

       Entry<String, JsonNode> entry = iter.next();

       addKeys(pathPrefix + entry.getKey(),

entry.getValue(), map, isParentArray || false);

     }

   }

   else if (jsonNode.isArray()) {

     ArrayNode arrayNode = (ArrayNode) jsonNode;

     for (int i = 0; i < arrayNode.size(); i++) {

       addKeys(currentPath, arrayNode.get(i), map, true);

     }

   }

   else if (jsonNode.isValueNode()) {

     ValueNode valueNode = (ValueNode) jsonNode;

     if (map.containsKey(currentPath)) {

         ((LinkedHashSet) map.get(curentPath)).add(valueNode);

     }

     else if (isParentArray) {

       Set<Object> temp = new LinkedHashSet<>();

       temp.add(valueNode);

       map.put(currentPath, temp);

     }

     else {

       map.put(currentPath, valueNode);

     }

   }

}

In summary, nested objects cause a performance problem when using Elasticsearch at a large scale, so flattening the nested objects will make things faster for both indexing and searching. We have been able to improve the performance of our search through optimizing the Elasticsearch index by defining the mapping and assigning keywords to the fields that will not be used for text searches and by flattening the index as described above. We will cover other Enterprise search tuning techniques in future blogs.