skip to Main Content

I have a dataset containing 128 million records stored on a small system hosting an instance of MongoDB.

The records are all approximatly 3 kB in size, and they are all of the same shape. (They all contain the same fields.)

I am trying to query the dataset to obtain all the records with a particular field. I estimate this will return about 1 % of the total number of records. That is about 25 GB of data.

I do not yet have any indexes defined on the dataset. I do not think this affects the query performance significantly, I will explain why later.

What is likely more important is that the 25 GB of records to return does not fit into the server memory. So these records must presumably be cached to disk somewhere before being returned.

I should note that I know nothing at all about the internal workings of MongoDB. I can only make educated guesses at its archetecture.

The system querying the dataset is a Python (pymongo) script. The query looks like this:

query = {'some_flag': False}
documents = collection.find(query)

I can’t use a projection to reduce the amount of data returned.

If I understand correctly, this code doesn’t return any data to the system initiating the find query until all the data has been found. In other words, this behaves synchronously: The query parameters are sent to MongoDB. MongoDB then queries its dataset and collects results to be returned. It then returns those results, in this case, via a network connection. Once all the results are returned, data processing can continue.

If I added an index on this flag, then presumably MongoDB would be able to find all the records with a particular true or false value more quickly. However I do not expect the difference to be significant to doing a full scan, because the flag is a boolean value, and the number of records with the value true is large. (Estimated around 1 %, might be a bit higher than this.) In other words, the records with flag=true are not rare.

This leads me to suspect the main bottleneck to performance is due to the query operating synchronously. All the data has to be ready before it can be sent and then processing of the data can begin.

Is there a way to do this query differently so that individual records can be returned as soon as they are available, so that the client side can start processing them, rather than waiting for all records to be available?

The client side code will be much slower than the time taken to find new records satisfying the query. This is another reason why having an index is not likely to make a significant difference to the overall "whole system" performance.

2

Answers


  1. Chosen as BEST ANSWER

    There is a python package called "motor" which provides asynchronous io support for MongoDB.

    https://motor.readthedocs.io/en/stable/

    Here's a short example which I managed to put together.

    def main():
        asyncio.run(async_driver())
    
    
    async def async_driver():
    
        client = motor.motor_asyncio.AsyncIOMotorClient('mongodb://192.168.0.10:27017')
    
        database = client['db_name']
        collection = database['collection_name']
    
        query = {'some_flag': False}
        async for document in collection.find(query):
    
            # process_document
            process_document(document)
    

    This seems to do exactly what is required. Rather than waiting for 30 minutes for the records to arrive in a single batch, this can process batches of records asynchronously as they are found by the MongoDB instance.

    Adding an index on the flag may now increase the throughput further as all the records with flag=true will already be indexed by Mongo.

    However, going async is the most substantial improvement, because the network still needs to stream millions of records.


  2. You’re correct in your understanding of how the find operation works in MongoDB. The find method in PyMongo, by default, returns a cursor, and the data is fetched as needed. In your case, with no indexes and the dataset not fitting into memory, MongoDB has to perform a collection scan to find the relevant documents, which can be a slow process for large datasets.

    Adding an index on the some_flag field can significantly improve the query performance, especially when dealing with a large dataset. While you mentioned that the some_flag is a boolean value and the number of records with true is large, having an index on a boolean field can still provide performance benefits. MongoDB uses a B-tree structure for indexes, and even for boolean fields, this structure can be advantageous.

    Now, regarding your concern about the synchronous behavior and wanting to process records as soon as they are available, you can achieve this by using the cursor returned by the find method and iterating over it. By default, the cursor will fetch a batch of documents at a time. This can help you start processing the results before the entire dataset is retrieved. Here’s an example:

    query = {'some_flag': False}
    cursor = collection.find(query)
    
    for document in cursor:
        # Process each document individually
        # Your processing logic here

    This way, you process each document as it’s retrieved, rather than waiting for the entire result set. Keep in mind that the performance of this approach will still be influenced by the time it takes for MongoDB to retrieve the data based on your query conditions.

    While adding an index on the some_flag field is likely to improve performance, you can also consider strategies like pagination to further optimize the retrieval process and manage memory usage, especially when dealing with large result sets

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search