Mongodb - Prevent lookups for simple fields - PhpOut

jgauffin
May 2, 2024
57 views
0 votes
2 Answers

Currently, I’m using a small child document to reference other collections:

{ name: 'xxx', id: 'xxx'}

.. instead of a manual reference.

So, for a forum post, I might have this:

{
    title: 'Some title',
    creator: { name: 'jgauffin', id: ObjectId(123445455) },
    posts: [...]
}

The reason is that I do not have to do any lookups in other collections each time I fetch a document, and names hardly change.

But since I’m new to this, are there any established design patterns for this? Or are you supposed to do $lookup even for a single field in other collections?

Answers

- aneroid
- May 2, 2024 at 12:06 pm
- 0 votes
0
Quick first thing: "{ name: 'xxx', id: 'xxx'} .. instead of a manual reference."
That is a manual reference, and that is the recommended way. ie a field which has an ObjectId, or whichever type you use as the _id in the target collection:

A manual reference is the practice of including one document’s _id field in another document. The application can then issue a second query to resolve the referenced fields as needed.

The other option is a DBRef, for which the docs say use manual refs unless you need to reference muliple collections.

Are there any established design patterns for this?

Yes, the main rule is "data which is accessed together should be stored together". So if you display forum posts with the title, username and link to their profile and this happens often, then that is the correct option.

I would add the full name (if displayed) and the profile link if it’s not being auto-generated in the HTML/JS directly from the username as /user/<name>, or from the ObjectId as /profile/<id.toString()>.

As for handling a user changing their username or fullname, that’s to be done in application code. So you’d update the user name and then do a updateMany() in all affected collections with:
```
forum_posts.updateMany(
  { creator.id: ObjectId(12345) },
  { creator.name: "newname" }
)
```
As you’ve already said, "names hardly change", which is what makes it okay.

In case of huge volumes size or Event-Driven architecture, you could handle those updates in serverless functions, or separate backend-services, and in chunks.

And do the $lookup when you need it but avoid patterns where you need it often. When used correctly, it’s okay, as long as not over-used.

Arrays of Sub-documents:

The next part is the posts: [...] field. Having unbounded arrays is an anti-pattern. The recommendation is to limit arrays of sub-documents to 200. And then have the remaining in a separate collection more_posts each storing 200. There are variations based on usage. You could have all_posts as 200-sized docs and then store the "hottest" 50 in the forum collection, or the most recent 20; as you actually use it.

Schema as per Usage, not as per Data

In MongoDB & other NoSQL DB’s, your schema should match your usage pattern, not what is the "ideal normalised form". This includes duplicating some data in multiple places if it’s frequently accessed together. If you have RDBMS/SQL experience, it can be very tempting to go for 3NF/5NF etc. right away. This is the most common anti-pattern. The other is the opposite: storing all data together rather than how it’s most accessed.

Official Documentation

I would recommend the official docs since opinion-based answers just like this one 😂 can become inconsistent:
1. Building with Patterns: A Summary
  - links to official posts about each of those patterns
2. Data Model Examples and Patterns
3. Avoid Unbounded Arrays
4. Storing data which is accessed together & Array size (youtube)
  - from that time stamp, the vid mentions limiting array sizes and the 200 sub-docs recommended limit
  - looking for official written material which mentions the 200 recommendation
5. Practical MongoDB Aggregations Book, online/ebook mentioned frequently by the MongoDB team
Login or Signup to reply.

- WedothebestforYou
- May 2, 2024 at 12:26 pm
- 0 votes
0
Short answer:
Yes, this approach is in line with the guiding principle – the data accessed together is stored together. This principle is realised by creating data aggregates. A single data aggregate represents a single transaction wholly. The sample document which you have shared and the details enclosed are the clear instances of aligning towards modelling data aggregates.

Detailed answer:
As we know, while data is normalised, it gives better data maintenance, at the same time it challenges data accessing. The denormalised data works in the opposite way, it challenges data maintenance; still facilitates data accessing. Therefore it is entirely based on the context each approach is adopted. Suppose the context is in favour of denormalised data, then we have got the match to discuss further.

Using a Relational data model, how much we can denormalise data. Certainly there is a limit since a row in a relational data model can store only simple data, it cannot contain nested data structure. Moreover a row does not represent a real world transaction. A real world transaction involves many rows of data. These inherent limitations in a Relational data model, leads to aggregate oriented data models.

In order to take the best of data denormalisation, it is needed to create the best possible data aggregates. A data aggregate is a complete representation of a real world transaction. On a single accessing, the entire data related with a transaction becomes available. As mentioned in the beginning, the guiding principle to create data aggregate is that – “data accessed together is stored together”. So in place of a row or a set of rows, a single data aggregate has become a unit of data. This is the same unit of data for storage, retrieval and even for distribution or sharding. The atom of data becomes the user defined data aggregate. The document and column-family data models are based on the data aggregation approach.

Further on data aggregates, it takes away the impedances mismatch as the developers can see and interact with data in the same form or structure as it is presented to end users. In a sense, data aggregates should reduce the overhead of using ORM tools. Data aggregate makes cluster computing practical resulting in horizontal scaling of the computing resources.

In the nutshell of data aggregation approach, the data atom becomes the user defined data aggregate.

impedance mismatch

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.