I am trying to benchmark MongoDB using berlinMOD data and queries. To represent a single vehicle’s trip (trajectory), I’m using multiple documents (point based representation). a Sample:
[{
"_id": 3,
"tripid": 11,
"geom_lnglat": {
"type": "Point",
"coordinates": [
4.306468999999999,
50.873466799999974
]
},
"vehid": 2,
}, ...,
{
"_id": 0,
"tripid": 1,
"geom_lnglat": {
"type": "Point",
"coordinates": [
4.4579241,
50.88904930000001
]
},
"vehid": 1,
}]
The query I’m trying to answer is: "Find minimum distance between places, where a vehicle with a licence from querylicences
and a vehicle with a licence from querylicences
have been". Meaning compare trajectory of one vehicle with the trajectory of all the other vehicles and find the minimum distance between them.
Output would documents which have ids of 2 vehicle and the minimum distance between them E.g:
{
vehid: 3
vehid2: 1
distance: 9586.571537357966
}
The Issue: I’m facing is that I can ran a query successfully on a small subset of the data but when I run it on a collection with 2.5 million documents the query behaves abnormally. The query progressively takes more storage, and after 4 days it took around 1 TB, exhausting the storage, and then crashed giving storage error. PS. I had 1 TB storage drive.
The query I currently have is:
[{ $lookup: {
// self-join with tmp_trips
from: "tmp_trips",
let: {
cur_tripid: "$tripid",
cur_vehid: "$vehid",
cur_point: "$geom_lnglat",
},
as: "result",
pipeline: [
{
$geoNear: {
near: "$$cur_point",
distanceField: "dist_from_point",
query: {
$expr: {
// Exclude documents with the same vehid
$lt: ["$vehid", "$$cur_vehid"],
},
},
spherical: true,
}, }, ], },
},
{ $unwind: {
path: "$result",
preserveNullAndEmptyArrays: false,
},
},
{ $sort:
{
"result.dist_from_point": 1,
vehid: 1,
"result.vehid": 1,
},
},
{ $group:
{
_id: {
vehid: "$vehid",
vehid2: "$result.vehid",
},
distance: {
$first: "$result.dist_from_point",
}, }, },]
It works on the small collection: tmp_trips
but when I move it to a bigger collection it doesn’t converge.
I ran another version of the query where the constraint was to find cars within 10 KMs of each other, and that query returned results after 26 minutes. In the this modified query I specify the field maxDistance: 10
in the $geoNear
operator. My guess is that this dramatically reduces the collections that need to be returned in the $results
array of the $lookup
operator.
So my question is can I improve this query somehow or make it work in someway where the entire query is managed by the database?
I think maybe if I write some for loop in python I might be able to do this query, but is there a better way?
For reference an equivalent Postgres query, with PostGIS extension is:
SELECT L1.Licence AS Licence1, L2.Licence AS Licence2,
MIN(ST_Distance(T1.geom_point, T2.geom_point)) AS MinDist
FROM trip_postgis T1 INNER JOIN Querylicences L1 ON T1.vehid = L1.vehid,
trip_postgis T2 INNER JOIN Querylicences L2 ON T2.vehid = L2.vehid
WHERE L1.licence < L2.licence
GROUP BY L1.Licence, L2.Licence
ORDER BY L1.Licence, L2.Licence;
2
Answers
The query that finally gave me the required result is:
The following image shows the output that I was looking for. Image showing the result of the query.
As @WernfriedDomscheit suggested the Pythagorean method could also be used instead of $geoNear. In my case the $geoNear looks cleaner but I'll try both to see which performs better.
I needed group because the requirement is to get the shortest distance between different sets of vehicles, simply using $limit in the $lookup returns shortest distance with only 1 vehicle, using $group on
vehid
fixes this issue.Below is playground with 15 data points and the query using
square_distance
, cuz $geoNear doesn't work without indexes.https://mongoplayground.net/p/38Lb4kDPcwd
Maybe try this one. In the lookup, there is no need to calculate the exact distance between two points, you just need the shortest distance, thus Pythagorean is sufficient. Then select the document with the shortest distance. By this you don’t need any
$unwind
nor$group
If needed calculate the actual distance in Meters
Or limit the number of documents already inside the
$lookup
:Mongo Playground
Or with
$geoNear
: