skip to Main Content

Here is a snapshot of my code.

exports.getTestCases = async (bucketName, startTimeStamp, endTimeStamp) => {
  let results = [];
  let startTime = parseInt(startTimeStamp);
  let endTime = parseInt(endTimeStamp);
  const maxDataLimit = 1500;
  const maxYearsBack = 10;
  const oneYearInMilliseconds = 31536000000;

  let query = {
    executiondate: { $gte: startTime, $lte: endTime } // --> 'executiondate' is indexed in DB
  };

  const projection = {'Some of my required fields'};

  const DynamicModel = getModelForCollection(bucketName);

  while (results.length < maxDataLimit) {
    console.log(`Fetching data: Start Time: ${new Date(startTime)}, End Time: ${new Date(endTime)}`);
    try {
      results = await DynamicModel.find(query, projection)
        .sort({ executiondate: -1 })
        .limit(maxDataLimit)
        .lean()
        .exec();
      console.log(`Total records fetched: ${results.length}`);
      if (results.length >= maxDataLimit) break;
    } catch (error) {
      console.error("Failed to fetch data:", error);
      throw error;
    }

    endTime -= oneYearInMilliseconds;
    query.executiondate.$gte = endTime;

    if ((startTime - endTime) > maxYearsBack * oneYearInMilliseconds) {
      break;
    }
  }

  return results.slice(0, maxDataLimit);
};

What the piece of code is doing is fetching a total 10 years of data if max limit is not met. Increasing it one year at a time. The problem is it’s taking minimum 7-8 seconds to fetch let say 600 data as it’s iterating. bucketName is collection for MongoDB

Is there anyway to improve the fetch? Maybe any different approach to re-write the whole logic? I can not think of alternate of loop which is the culprit I am guessing. I am already using lean(). A slight improvement but not satisfactory. New to node.js and mongo db so, any suggestion will be helpful.

2

Answers


  1. Your query for the executiondate is doing a full scan of the database every time then sorting the results. Think of it: how can the engine do the compares without looking at every document?
    The best practice for query sequence is "ESR": find Exact match items, then Sort, then Range. You are doing the sort after range.
    Also, if you haven’t done so, create an index for executiondate. I hope this helps.

    Login or Signup to reply.
  2. That query is properly indexed, but the loop seems a bit inefficient.

    If we consider a run that requires reading back to 10 years, what will happen is:

    1. request the most recent 1 year worth of data
    2. server consults the index, and fetches the appropriate documents
    3. server applies the projection so only the selected fields are present
    4. entire result set is transferred over the network to the client
    5. client checks the array length, finds it lacking

    Then

    1. request the most recent 2 years worth of data
    2. server consults the index, and fetches the appropriate documents
    3. server applies the projection so only the selected fields are present
    4. entire result set is transferred over the network to the client
    5. client checks the array length, finds it lacking

    Then repeat, again, 8 more times.

    This means that if you have 100 documents per year for the last 10 years, this full process will return 1000 documents, but the documents for the first year will have been fetched and transferred over the network 10 times, the second year 9 times, and so on, for a total of 5500 documents fetched and transferred over the network.

    I can think of 2 different ways to improve this loop:

    • build the result set incrementally
      Initialize results to an empty array before the loop starts, then inside each loop query only for the current year’s worth of data, i.e. $gte: endTime, $lt: lastEndTime, and combine the returned documents with the results array
    • count first, then fetch
      Counting documents with a query on an indexed field can use a very efficient COUNT_SCAN, so instead of returning documents in the loop, just ask the server for the count of documents in the range. A COUNT_SCAN will not need to read any documents, and will return only a number over the network, so 10 count commands might even complete faster than a single fetch. After the loop, you will know which start time will result in enough documents to satisfy the requirement, so a single find query after the loop can get all the necessary documents in a single go.
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search