skip to Main Content

I am trying to benchmark MongoDB using berlinMOD data and queries. To represent a single vehicle’s trip (trajectory), I’m using multiple documents (point based representation). a Sample:

[{
  "_id": 3,
  "tripid": 11,
  "geom_lnglat": {
    "type": "Point",
    "coordinates": [
      4.306468999999999,
      50.873466799999974
    ]
  },
  "vehid": 2,
}, ...,
{
  "_id": 0,
  "tripid": 1,
  "geom_lnglat": {
    "type": "Point",
    "coordinates": [
      4.4579241,
      50.88904930000001
    ]
  },
  "vehid": 1,
}]

The query I’m trying to answer is: "Find minimum distance between places, where a vehicle with a licence from querylicences and a vehicle with a licence from querylicences have been". Meaning compare trajectory of one vehicle with the trajectory of all the other vehicles and find the minimum distance between them.

Output would documents which have ids of 2 vehicle and the minimum distance between them E.g:

{
vehid: 3
vehid2: 1
distance: 9586.571537357966
}

The Issue: I’m facing is that I can ran a query successfully on a small subset of the data but when I run it on a collection with 2.5 million documents the query behaves abnormally. The query progressively takes more storage, and after 4 days it took around 1 TB, exhausting the storage, and then crashed giving storage error. PS. I had 1 TB storage drive.

The query I currently have is:

[{ $lookup: {
     // self-join with tmp_trips
      from: "tmp_trips",
      let: {
        cur_tripid: "$tripid",
        cur_vehid: "$vehid",
        cur_point: "$geom_lnglat",
      },
      as: "result",
      pipeline: [
        {
          $geoNear: {
            near: "$$cur_point",
            distanceField: "dist_from_point",
            query: {
              $expr: {
                // Exclude documents with the same vehid
                $lt: ["$vehid", "$$cur_vehid"],
              },
            },
            spherical: true,
          }, }, ], },
  },
  { $unwind: {
      path: "$result",
      preserveNullAndEmptyArrays: false,
    },
  },
  { $sort:
      {
        "result.dist_from_point": 1,
        vehid: 1,
        "result.vehid": 1,
      },
  },
  { $group:
      {
        _id: {
          vehid: "$vehid",
          vehid2: "$result.vehid",
        },
        distance: {
          $first: "$result.dist_from_point",
        },  },  },]

It works on the small collection: tmp_trips but when I move it to a bigger collection it doesn’t converge.
I ran another version of the query where the constraint was to find cars within 10 KMs of each other, and that query returned results after 26 minutes. In the this modified query I specify the field maxDistance: 10 in the $geoNear operator. My guess is that this dramatically reduces the collections that need to be returned in the $results array of the $lookup operator.
So my question is can I improve this query somehow or make it work in someway where the entire query is managed by the database?

I think maybe if I write some for loop in python I might be able to do this query, but is there a better way?

For reference an equivalent Postgres query, with PostGIS extension is:

SELECT L1.Licence AS Licence1, L2.Licence AS Licence2, 
    MIN(ST_Distance(T1.geom_point, T2.geom_point)) AS MinDist
FROM trip_postgis T1 INNER JOIN Querylicences L1 ON T1.vehid = L1.vehid,
    trip_postgis T2 INNER JOIN Querylicences L2 ON T2.vehid = L2.vehid
WHERE L1.licence < L2.licence
GROUP BY L1.Licence, L2.Licence
ORDER BY L1.Licence, L2.Licence;

2

Answers


  1. Chosen as BEST ANSWER

    The query that finally gave me the required result is:

    db.tmp_trips.aggregate([
      {
        $lookup: {
          from: "tmp_trips",
          as: "result",
          let: {
            cur_vehid: "$vehid",
            coordinates: "$geom_lnglat.coordinates",
            cur_point: "$geom_lnglat",
          },
          pipeline: [
            {
              $geoNear: {
                near: "$$cur_point",
                distanceField: "dist_from_point",
                query: {
                  $expr: {
                    // Exclude documents with the same vehid
                    $lt: ["$vehid", "$$cur_vehid"],
                  },
                },
                spherical: true,
              },
            },
            {
              // for each distinct vehicleID in the other collection
              // only retain the closet point.
              $group: {
                _id: {
                  vehid: "$vehid",
                },
                distance: {
                  $first: "$dist_from_point",
                },
              },
            },
          ],
        },
      },
      {
        $unwind: {
          path: "$result",
        },
      },
      {
        $sort: {
          vehid: 1,
          "result.distance": 1,
        },
      },
      {
        $group: {
          _id: {
            vehid: "$vehid",
            vehid2: "$result._id.vehid",
          },
          distance: {
            $first: "$result.distance",
          },
        },
      },
    ])
    

    The following image shows the output that I was looking for. Image showing the result of the query.

    As @WernfriedDomscheit suggested the Pythagorean method could also be used instead of $geoNear. In my case the $geoNear looks cleaner but I'll try both to see which performs better.

    I needed group because the requirement is to get the shortest distance between different sets of vehicles, simply using $limit in the $lookup returns shortest distance with only 1 vehicle, using $group on vehid fixes this issue.

    Below is playground with 15 data points and the query using square_distance, cuz $geoNear doesn't work without indexes.

    https://mongoplayground.net/p/38Lb4kDPcwd


  2. Maybe try this one. In the lookup, there is no need to calculate the exact distance between two points, you just need the shortest distance, thus Pythagorean is sufficient. Then select the document with the shortest distance. By this you don’t need any $unwind nor $group

    If needed calculate the actual distance in Meters

    db.collection.aggregate([
      {
        $lookup: {
          from: "collection",
          as: "result",
          let: {
            vehid: "$vehid",
            coordinates: "$geom_lnglat.coordinates"
          },
          pipeline: [
            {
              $match: {
                $expr: {
                  $ne: [ "$vehid", "$$vehid" ]
                }
              }
            },
            {
              $set: {
                square_distance: {
                  // (lon_1 - lon_2)² + (lat_1 - lat_2)²
                  $add: [
                    {
                      $pow: [
                        {
                          $subtract: [
                            { $first: "$$coordinates" },
                            { $first: "$geom_lnglat.coordinates" }
                          ]
                        },
                        2
                      ]
                    },
                    {
                      $pow: [
                        {
                          $subtract: [
                            { $last: "$$coordinates" },
                            { $last: "$geom_lnglat.coordinates" }
                          ]
                        },
                        2
                      ]
                    }
                  ]
                }
              }
            }
          ]
        }
      },
      // Take the shortest one
      {
        $set: {
          result: {
            $first: {
              $sortArray: {
                input: "$result",
                sortBy: { square_distance: 1 }
              }
            }
          }
        }
      },
      // some cosmetic, e.g. calculate exact distance using Haversine formula as shown in my comment
      {
          $set: {
             distance: {
                $let: {
                   vars: {
                      dlon: { $degreesToRadians: { $subtract: [{ $first: "$geom_lnglat.coordinates" }, { $first: "$result.geom_lnglat.coordinates" }] } },
                      dlat: { $degreesToRadians: { $subtract: [{ $last: "$geom_lnglat.coordinates" }, { $last: "$result.geom_lnglat.coordinates" }] } },
                      lat1: { $degreesToRadians: { $last: "$geom_lnglat.coordinates" } },
                      lat2: { $degreesToRadians: { $last: "$result.geom_lnglat.coordinates" } }
                   },
                   in: {
                      // Haversine formula: sin²(dLat / 2) + sin²(dLon / 2) * cos(lat1) * cos(lat2);
                      $add: [
                         { $pow: [{ $sin: { $divide: ["$$dlat", 2] } }, 2] },
                         { $multiply: [{ $pow: [{ $sin: { $divide: ["$$dlon", 2] } }, 2] }, { $cos: "$$lat1" }, { $cos: "$$lat2" }] }
                      ]
                   }
                }
             }
          }
       },
       {
          $set: {
             distance: {
                // Distance in Meters given by "6378.1 * 1000"
                $multiply: [6378.1, 1000, 2, { $asin: { $sqrt: "$distance" } }]
             }
          }
       }
    ])
    

    Or limit the number of documents already inside the $lookup:

    db.collection.aggregate([
      {
        $lookup: {
          from: "collection",
          as: "result",
          let: {
            vehid: "$vehid",
            coordinates: "$geom_lnglat.coordinates"
          },
          pipeline: [
            {
              $match: {
                $expr: {
                  $ne: [ "$vehid", "$$vehid" ]
                }
              }
            },
            {
              $set: {
                square_distance: {
                  // (lon_1 - lon_2)² + (lat_1 - lat_2)²
                  $add: [
                    {
                      $pow: [
                        {
                          $subtract: [
                            { $first: "$$coordinates" },
                            { $first: "$geom_lnglat.coordinates" }
                          ]
                        },
                        2
                      ]
                    },
                    {
                      $pow: [
                        {
                          $subtract: [
                            { $last: "$$coordinates" },
                            { $last: "$geom_lnglat.coordinates" }
                          ]
                        },
                        2
                      ]
                    }
                  ]
                }
              }
            },
            { $sort: { square_distance: 1 } },
            { $limit: 1 }
          ]
        }
      },
      {
        $set: {
          result: { $first: "$result" }
        }
      }
      // same cosmetic as above
    ])
    

    Mongo Playground

    Or with $geoNear:

    db.collection.aggregate([
      {
        $lookup: {
          from: "tmp_trips",
          let: {
            cur_vehid: "$vehid",
            cur_point: "$geom_lnglat"
          },
          as: "result",
          pipeline: [
            {
              $geoNear: {
                near: "$$cur_point",
                distanceField: "dist_from_point",
                query: {
                  $expr: {
                    $ne: [ "$vehid", "$$cur_vehid" ]
                  }
                },
                spherical: true
              }
            },
            //{ $sort: { dist_from_point: 1 } }, -> not needed, distances are already sorted
            { $limit: 1 }
          ]
        }
      },
      {
        $set: {
          result: {
            $first: "$result"
          }
        }
      }
    ])
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search