Troubleshooting Hard-To-Find Solr Search Engine Scalability Errors

16july2019

While working on a web application that provides digitized versions of old classical song collections, we realized there was a memory leak occurring that was causing our web page to be down several times daily.  For this large audio content, we are migrating expanded data sets and have introduced new records into our Solr system.  Although we use Solr to retrieve the most relevant search results, we did not understand the cause of the error.  The confusion was because the Solr Out-of-Memory error was for less than 60 rows; whereas, we had records with more than 200 rows working perfectly fine. We decided to explain our step-by-step process of troubleshooting this problem to illustrate how we resolved it.

We started by looking at the SOLR logs, which didn’t help much. With fields of 200 records running successfully, it was odd that the error was for a field of only 66 records. Initially, our thought was that it might have resulted from our addition of facet fields to our existing set of facets since we noticed the problem soon after.  

To resolve it, we began by brainstorming different approaches and decided to attempt conducting each one of the troubleshooting tasks until one worked.  We first started with the standard trial-and-error method that includes making one change at a time and then seeing the impact of the change. We deleted all facets and checked if the error still occurred, which it did. Then, we added one facet at a time to see if the problem continued after each.   During this time, we tried permutations and combinations of those facets by trying to limit the facets. First, we tried it during the extract of the ETL process. However, since this collection contained more than 10,000 items, each extract took more than seven hours.  To save time, we tried it during the pre-transform phase, which takes less than 30 minutes. All our efforts failed, and we were unable to find the solution using this method. 

Next, we used Siege, the open source software for regression tests, to try and reproduce the error. It takes one or more URLs and simulated users as user inputs to stress them simultaneously. The output provided is the number of hits created by the user, bytes transferred, URL response time, concurrency, and return status code.  Every time we conducted Siege with different permutations and combinations, it always returned with the HTTP status code: “200 OK,” taking us back to where we started.  Our next solution was to implement Bender.  Bender, a software tool written by one of our technical leads, utilizes a robot to continuously crawl an application and interact with it as needed. The end goal is to generate a continuous level of background noise for the monitoring / metrics stack to be exercised, triggering alerts for errors, system load, etc.  Bender is built on top of Selenium, the automated testing tool, so full-page resources are loaded and executed, such as hits to the CSS and JavaScript images. This particularly useful interaction is spotty at best, so logs are also generated sporadically. Once again, using Bender did not give us the desired result of fixing the problem in our application. 

In the end, to solve the issue, we increased the Java heap size from six gigabytes to eight gigabytes, whereby twelve gigabytes was the outer limit.  Increasing the Java heap size decreased the overhead of the memory swapping and enhanced our performance. The request-response time has also been improved significantly, and we are able to communicate with the server more smoothly. While this is a short-term solution for our problem, we wanted to implement a better permanent solution. Our current plan is to implement DocValues, a feature of Apache Solr.  DocValues stores the values necessary for faceting/sorting/grouping during index time, thus, avoiding the need to store them on heap.  It is one of the priority tasks lined up in our pipeline. This tool would reduce the overhead created on heap and also display our search results faster.  

We wanted to illustrate the process our team took on the Solr Out-of-Memory error we faced on our project.  It’s helpful to understand how every organization takes a different approach to finding solutions for problems they encounter.  Our goal is to show you one of many possible methods that could have been used and provide ideas in attempting to solve similar types of issues.