Table of Contents

    As published on Medium.

    Every Data Engineer who uses Elasticsearch as a documents store, knows that there are many parameters that affect the queries latency, throughput, and eventually the Queries Per Second (AKA — QPS).

    In one of our Projects at Explorium, we have an Elasticsearch cluster, hosted in AWS with 14 nodes of m5.4xlarge.elasticsearch.

    In the cluster we have around 20M documents which are 10 GB.

    This cluster receives tons of queries, rapidly and very frequently, and requires a very high concurrency of queries.

    While the usage of the database got bigger and bigger, we noticed two things:

    • Some queries were rejected by our Elasticsearch cluster due to a full internal Elasticsearch queue.
    • Our QPS was around 100 and we wanted to achieve a much higher rate.

    We decided to get deeper into Elasticsearch internals in order to solve these two issues.

    In this article I will introduce the changes we have made in order to increase our QPS by 7X and even support higher concurrency.


    In our architecture, when new data is ingested, it is inserted into a dedicated daily index.
    We decided to go with this strategy since we want to be able to rollback new indexed data.

    If we would index the data into a single index, it would require us to submit hard deletes over the index in order to rollback a new indexed data (an action which we want to avoid).

    During the performance tests, we had 20 daily indexes.


    Behind each Elasticsearch index there are Shards. Each shard is a single Lucene index and it holds part of the index data.

    When you create an Elasticsearch index, you can specify how many shards it will contain.

    Most people are looking for a magic number that will work the best for them, but actually, there isn’t.

    The amount of wanted shards is affected by multiple parameters, such as:

    • Index size
    • Document size
    • Amount of documents
    • Queries patterns
    • Data Structure

    In our use case, we had 10 shards per index.

    Read Replicas

    A read replica is a copy of the original Primary shard.

    When creating an Elasticsearch index you can specify how many read replicas you want to create along with the index.

    Read replicas are used in case of hardware failures and are also useful with serving the data (searches) in a high concurrency.

    In our case, we had 1 read replica per shard.



    As mentioned above, we had 20 daily indexes. Let’s multiply this by 10 (the original amount of shards per index).

    This means that in each search, we submitted the query to 200 shards!

    It made us think that we should decrease the number of shards that are being queried.

    Since we had small indexes, we decided to start with reducing the number of indexes in our alias.

    We implemented a mechanism that merges daily indexes into a large single index.

    At the end of the process we ended with 1 Large index with 10 shards.


    When we talk about a single “query” over Elasticsearch, we need to split it into two tasks, query and fetch.

    The query part contains sending the request from the client to the coordinating node and then sending the query to each shard in the index.

    The fetch part is when each data node executes the query locally(among its shards), returns the results to the coordinating node and then the final result is returned to the client.

    We found out that most of the time each submitted query to Elasticsearch is being spent on query time and not on fetch time.

    This means that most of the time is being spent on sending the queries to the shards and not on the actual search.

    You can find your index stats by using the _stats api:

    GET index_name/_stats

    You will get a document that includes statistics over the requested index.

    If you look at the “_all_ field and then “search” field you will find a JSON in this structure:

    Here you can see the “fetch_time_in_millis” and the “query_time_in_millis” so you can have an idea where most of the time is being spent in your queries over the index.

    We realized that we should take immediate action: Reducing the number of shards per index to 6.

    Why 6? We tried 3, 6, and 10, and the results of 6 were the best, both in terms of QPS and indexing latency.

    Read Replicas

    As I wrote above, having Read Replicas can help when trying to serve data in a high concurrency.

    We decided to increase our Read Replicas to 5 per index (as I said before, there are no magic numbers, you have to test your use case and find the best number that suits it).

    After applying the new changes, our QPS was more than 700 and we were able to support 3X concurrent queries to our Cluster.

    If you have an Elasticsearch cluster and you want to maximize the QPS and concurrency, I strongly recommend that you run a benchmark test that includes increasing and decreasing the number of shards per index / read replicas/indexes.

    Ron Flomin is an expert data and machine learning engineer who is a member of the data team of Explorium. Explorium offers the industry’s first automated External Data Platform for Advanced Analytics and Machine Learning. Explorium empowers data scientists and business leaders to drive decision-making by eliminating the barrier to acquiring and integrating the right external data and dramatically decreasing the time to superior predictive power. Learn more at