Complex searches with Python and Solr


As Python-Solr users, we are often interested in displaying powerful, efficient, and relevant search results, high-performance indexing and appropriate links. Our Python client library allows us to add documents to the Solr instances, perform queries, and gather search results from Solr using Python. We then use Python to model our single and multi-valued collection fields and facets as field-value dictionaries. When we present multiple values to Solr, the value in the dictionary is given as an input list.  Most of our values are in string, boolean and date format.  Our Python SearchHandler provides access to named searches. In each search, we pass parameters for querying, faceting and highlighting as parameters to select method.  We make use of Solr’s “inverted index” feature looks for keyword in the index than actual content. 

Solr is a popular enterprise search platform used for search and indexing of large volumes of content. It lies at the core of our project’s search implementation. We use Solr’s data, search services, suggestion services, faceting services, and recommended links/top picks services that are in turn used by the Django web application serving the front-end. 

We are able to unravel powerful features of Solr while building a fast search application and scale it appropriately for large use. We use solrpy as the Python client for Solr which helps us in reading and writing using JSON.  Solrpy allows us to add documents to a Solr instance, and to perform queries and gather search results from Solr using Python.  We make use of Solr’s replication features by having our data replicated in our own instances of Solr.  These are called Lite and Web:

  • Lite: A schema-less version of Solr for quick access. We use this as our first staging server to stage incoming content.  We index documents, store submitted fields, use the staging index to enable certain types of enhancements and do-overs without having to re-crawl or re-load from source data.
  • Web: Two running instances, “offline” and “live”. The “offline” instance is updated with the replicated Web index, warmed with warming queries, and then swap with the “live” instance through modification of the proxies. Our schema index contains the field and data used for search and display.

In addition to Lite and Web, we have implemented an ingest process using solrpy which reads records from the staging index, performs data scrubbing and normalization tasks and posts records to the offline core of the multi-core web index.  We use HTML Mappings by using Meta and Link tags from our HTML pages to improve our Search Engine Optimization. 

We use facets (a.k.a. filters) for field sources. Every collection item is tied up with relevant facets. We make use of our “ispartof” Python function to tie the source data and collection framework together, thus displaying the items with appropriate facets. Our Solr fields are described in our schema. During the Transform phase of the ETL processing, the fields get populated, queried and displayed in our web application.

In the near future, we would like the project to handle Solr’s deep paging problem. Deep paging refers to the slow response when requesting search results with a high “start” value. When the start value is higher than a few million, it can be particularly bad. Though it might be difficult for us to provide stateless dynamic paging, the CursorMark approach may be appropriate for our large-scale processing, to get faster responses to requests from the UI layer. Along with handling Solr’s deep paging problem, the project’s goals include improving our relevancy ranking system.

We hope that this gives you glimpse into how to build a highly scalable, fault tolerant search platform that can be easily configured, monitored and optimized for high volume traffic.

Contact Artemis if your organization has the need for Scalable Enterprise Open-Source Search solution.