I have a collection with 10M records, with the following schema:
{
"_id": ObjectId
"uid": string
}
uid
is a random string for testing purposes, generated with unique values.
There are 2 indexes: {"_id": 1}
and {"uid": 1}
.
The following two queries produce very inefficient query plans, which scan/fetch all 10M documents and take several seconds to run:
db.test.find({}).sort({"uid": 1, "_id": 1}).limit(10).explain()
(...)
"queryPlanner" : {
"namespace" : "test.test",
"indexFilterSet" : false,
"parsedQuery" : {
},
"queryHash" : "EEBF6C62",
"planCacheKey" : "BC5E0BD8",
"maxIndexedOrSolutionsReached" : false,
"maxIndexedAndSolutionsReached" : false,
"maxScansToExplodeReached" : false,
"winningPlan" : {
"stage" : "SORT",
"sortPattern" : {
"uid" : 1,
"_id" : 1
},
"memLimit" : 104857600,
"limitAmount" : 10,
"type" : "simple",
"inputStage" : {
"stage" : "COLLSCAN",
"direction" : "forward"
}
},
"rejectedPlans" : [ ]
},
(...)
db.test.find({"uid": {"$gt": "a"}}).sort({"uid": 1, "_id": 1}).limit(10).explain()
(..)
"queryPlanner" : {
"namespace" : "test.test",
"indexFilterSet" : false,
"parsedQuery" : {
"uid" : {
"$gt" : "a"
}
},
"queryHash" : "F4847212",
"planCacheKey" : "8F79D766",
"maxIndexedOrSolutionsReached" : false,
"maxIndexedAndSolutionsReached" : false,
"maxScansToExplodeReached" : false,
"winningPlan" : {
"stage" : "SORT",
"sortPattern" : {
"uid" : 1,
"_id" : 1
},
"memLimit" : 104857600,
"limitAmount" : 10,
"type" : "simple",
"inputStage" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"uid" : 1
},
"indexName" : "uid_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"uid" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"uid" : [
"("a", {})"
]
}
}
}
},
"rejectedPlans" : [ ]
},
(...)
The analogous datamodel and queries using PostgreSQL produce very efficient query plans than run in <1ms.
explain (analyze, buffers)
select * from test
order by uid, id
limit 10;
Limit (cost=0.65..1.87 rows=10 width=69) (actual time=0.477..0.480 rows=10 loops=1)
Buffers: shared hit=15
-> Incremental Sort (cost=0.65..1221670.34 rows=10000000 width=69) (actual time=0.477..0.478 rows=10 loops=1)
Sort Key: uid, id
Presorted Key: uid
Full-sort Groups: 1 Sort Method: quicksort Average Memory: 25kB Peak Memory: 25kB
Buffers: shared hit=15
-> Index Scan using uid on test (cost=0.56..771670.34 rows=10000000 width=69) (actual time=0.085..0.190 rows=11 loops=1)
Buffers: shared hit=15
Planning Time: 0.400 ms
Execution Time: 0.836 ms
explain (analyze, buffers)
select * from test
where uid > 'a'
order by uid, id
limit 10;
Limit (cost=0.65..1.90 rows=10 width=69) (actual time=0.289..0.292 rows=10 loops=1)
Buffers: shared hit=15
-> Incremental Sort (cost=0.65..1246670.34 rows=10000000 width=69) (actual time=0.287..0.288 rows=10 loops=1)
Sort Key: uid, id
Presorted Key: uid
Full-sort Groups: 1 Sort Method: quicksort Average Memory: 25kB Peak Memory: 25kB
Buffers: shared hit=15
-> Index Scan using uid on test (cost=0.56..796670.34 rows=10000000 width=69) (actual time=0.062..0.229 rows=11 loops=1)
Index Cond: ((uid)::text > 'a'::text)
Buffers: shared hit=15
Planning Time: 4.809 ms
Execution Time: 0.347 ms
Intuitively, since there is an index on uid
, it should be very straightforward to just fetch N rows, already sorted by uid
, and then sort also by _id
within the same uid
(which in this example is useless, since uid
is unique). This seems exactly what PostgreSQL is doing.
Are the inefficient query plans just a limitation on the MongoDB query planner, or am I doing something wrong?
2
Answers
I am not exactly sure, whether to term it a limitation, but yeah in the first case, for the query:
MongoDB doesn’t use any index, because the filter object is empty, and there is no index present, which is sorted in the order as expected. That’s why you see the stage
COLLSCAN
. If you simply try running these, queriesOR
You’ll see the
IXSCAN
stage coming into play because there are indexes present, which can satisfy the sort order directly. To bring the index into play, you can simply add a filter like:For the second query,
MongoDB is behaving the same as PostgresSQL, it is using the
uid
index to filter the elements, and then sort the documents by_id
, as the index is already sorted byuid
.For the best case scenario, if you need to sort both by
uid
and_id
, in most cases, you can simply create a composite index consisting of bothuid
and_id
, like this:After creating this index, both of your queries will use this index, regardless of whether the filters are present or not.
The fact that MongoDB is not doing an incremental sort like PostgreSQL appears to be a limitation of the system. It doesn’t look like you are doing anything directly wrong in that regard. There are still two observations/recommendations that can be made here.
Compound Index
As @Charchit Kapoor mentions, a compound index on
{uid: 1, _id: 1}
will allow MongoDB to efficiently perform this query as written (and avoid a blocking sort altogether).Use a Single Field
You mention the following:
This leads to two questions:
_id
value)?Looking at the latter question, if the
uid
field is unique then appending_id
to the sort does not change the results in any way. There is only a single document with a givenuid
value so there is no additional ordering that the trailing_id
field provides.Assuming that you want to keep both fields, then simply dropping
_id
from the sort should keep the result set the same while allowing MongoDB to efficiently use the{ uid: 1 }
index to perform the operation.