skip to Main Content

My app has a User collection. Each document in the collection averages about .04 MB. At worst case, a document may slightly exceed .1 MB. Needless to say, these are small documents. However, each document could potentially have over 1000 fields.

There are 3 types of read queries on this User collection:

  1. findOne by _id
  2. findOne by email
  3. findMany users by a set of filters coming from an arbitrary combination of the 1000 fields.

The third type of query is slow. So the plan is to divide and conquer, ie partition the User collection into shards, where each shard will have on average of 10,000 documents. So if my app scales to 100 million users, there will be 10,000 shards each having about 10,000 documents for this User collection. To ensure even distribution, the _id will be used as the shard key. The only time when a findOne by email is used is during user log in, which happens infrequently. The findOne by _id will be targeted.

The third type of query, findMany by arbitrary combination of filters, will be scatter and gather. And that is precisely the intention. The only way to scale the 3rd type of query is to divide and conquer.

Questions:

  1. does this strategy seem correct? Am I on the right or wrong path. There are many shards, 10,000. However, each shard only contains 10,000 small documents.
  2. Will the third type of query, findMany by arbitrary combination of 1000 filters, scale as I explain it? What am I missing if it won’t scale?
  3. Does the overhead of the mongos orchestrating communication between 10,000 nodes undo the speed increase from each machine only querying a collection of 10,000 documents?

2

Answers


  1. I am reading your question and your proposed strategy and I am impressed with how you’re thinking about it. Your proposed solution seems solid but there are a few things you should watch out for.

    This findMany users by a set of filters works but it’ll depend on the nature of your data and the filters coming from an arbitrary combination of these fields.

    I think that if the filters are highly selective and can efficiently narrow down the result set, then the scatter-and-gather approach should scale reasonably well. However, if the filters are not selective enough or if they result in a large number of documents being examined across multiple shards, you may encounter performance issues due to the overhead of gathering and processing data from multiple shards.

    don’t forget to analyze the types of queries your application will be performing to be sure that the sharding strategy aligns with the expected workload.

    Additionally, consider optimizing your query patterns and indexes to minimize the number of documents that need to be examined across shards.

    Ok, the next thing on my mind is the concern you raised regarding overhead of the mongos orchestrating communication between 10,000 nodes.

    The overhead of mongos orchestrating communication between shards can be significant, I know this because I have experienced this problem in a large project. Especially with a large number of shards.

    I solved this problem by keeping the individual shard sizes relatively small and the queries are well-targeted. this made the overhead manageable. I was also forced to optimize my queries, indexing appropriately, and leverage caching mechanisms which also helped mitigate this overhead.

    // to check shard distribution and balancing status
    sh.status();
    
    // to manually rebalance shards if needed
    sh.rebalanceCollection("database.collection");
    
    /*
     Hey man, don't forget to regularly monitor the 
     distribution of data across shards and manually 
     rebalance them if necessary to ensure even data 
     distribution and optimal performance.
    */
    

    I want to talk about indexing next but I know you already know this.

    Index. you can’t do this enough. Indexing is the quickest way to solve the performance problem embedded in distributed computing.

    Analyze the typical queries performed by each call and create indexes that efficiently cover the fields involved in those queries.

    Sparse indexes can also be useful for queries that only target a subset of fields within your documents.

    don’t forget to regularly review and optimize your indexes as your application evolves to ensure optimal query performance across all shards.

    // you can create indexes on fields you suspect will 
    // be used commonly
    db.users.createIndex({ field1: 1 });
    
    db.users.createIndex({ field2: 1 });
    

    The last point is personal for me because I paid a steep price for neglecting it or at least for not implementing it properly.

    MONITORING

    you must monitor and optimize continuously to be able to maintain the performance and scalability of your sharded cluster.

    Please be sure to track key performance metrics: things like query latency, throughput, and resource utilization across all shards. you can then use this data to identify performance bottlenecks, hotspots, and areas for optimization.

    Regularly review and analyze query execution plans, index usage, and system metrics to identify opportunities for improvement. You should also stay updated about updates and best practices for MongoDB sharding because the team at Mongo is always making improvements to the system. Here’s a little something for you. You may or may not need it.

    // Enable profiling to capture query performance data
    db.setProfilingLevel(2);
    
    // Analyze slow queries to identify performance 
    //bottlenecks
    db.system.profile.find({ millis: { $gt: 100 } 
    }).sort({ 
    ts: -1 }).limit(10);
    

    Ok. Not sure if this helps but those are my thoughts.

    Login or Signup to reply.
  2. You may review your data design. As usual you don’t provide any valid samples, so we can just guess. Assume you plan to write some dating app. Each user has a set of common fields, e.g. name, birthday and gender. Then a user may define up to 1000 additional fields, some user have more, some have less (otherwise you would not ask questions like `$exists: true` appears to be unable to use index). Put these special field names into an array and create an index for it. A user document may look like this:

    {
       name: "Wernfried Domscheit",
       birthday: ISODate("2020-03-23"),
       gender: "male",
       specialFields: ["fetish", "zodiac"],
       fetish: "pircings",
       zodiac: "libra"
    }
    

    Then create index on the common fields, e.g. { birthday: 1 } and/or on { specialFields: 1 }

    Only few users will define "fetish" in their profile, thus a filter { specialFields: "fetish" } will return only a few users who defined this attribute. Scanning these few documents on { fetish: "wet socks" } will be fast.

    Don’t try to make an index to return always a unique result. Use the index to return a subset of all documents. Performance wise it does not matter if you need to make a full scan over a small subset of documents.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search