skip to Main Content

Consider the following table, recipe, in a PostgreSQL database.

Column Type Collation Nullable Default
name text not null
date date not null
ingredients text[] not null
calories integer not null

Assume a user wanted to add a new recipe to the table. The user does not want duplicates in the table.

What is the best practice for inserting a new row into this table without duplicates?

Solution 1

Introduce an identification scheme before adding any recipes. We can say any two recipes with the same name and ingredients are duplicates. The goal is to prevent duplicates from arising so a unique ID based on the recipe name and ingredients is generated (somehow), and the column is added to the database.

Column Type Collation Nullable Default
id text not null
name text not null
date date not null
ingredients text[] not null
calories integer not null

We can then use this ID has a unique primary key in the table.

One (bad) possible way of generating this unique ID is to concatenate the name and ingredients of any new recipe, hash that, and get a unique 256-bit string out of it.

If I find a recipe online and insert the "new" recipe into my table and find that the ID already exists, I know this is not actually a new recipe and I can safely abort the insertion.

Solution 2

Create a composite primary key comprised of the name and ingredients columns.

The following example is pretty bad but bear with me.

If I run some search online for recipes and add recipes to my table, old recipes that have been updated to have more accurate calories will be updated.

Solution 3

Upon every insert to the table, check the new recipe against the name and ingredients of all existing recipes. If there’s a collision, update the calories column of the recipe in the table.

If my recipe example is upscaled to thousands or millions of rows, solutions 1 and 3 break.

Risks and Constraints

Solution 1 breaks because hashing collisions become a real possibility at large scale.

Solution 3 "breaks" comparing every single new entry against every single row in a very large table would be horribly inefficient.

3

Answers


  1. Which fields make up a "unique identity" is a business requirement, not a technical one. For example, reasonable and customary would be to impose that the Name of a Recipe must be unique and that’s it, no need for more complication, and no sense in saying two recipes are distinct just because I wrote 815 calories and you wrote 810, the rest being the same. Then to create a variant/slightly different version of a recipe that is already in, just give it a new, unique, name that is enough to distinguish what is what when a user is browsing the list of Recipe names in the catalogue… Or, e.g. for a Customers entity in an accounting system, one would maybe want a VAT if those are nationally unique (they aren’t in all countries and some added element is needed); while, for an ecommerce site, an email is usually enough since that’s enough for individual legal responsibility should the case arise, while a person with multiple accounts (under different emails) is not per se a problem. And so on, those are just examples of the line of reasoning.

    OTOH, the technical problem, as to how to implement any such uniqueness constraints, is in fact just that: either make a primary key out of those fields, or (more common) put an ID or a GUID in the table for a PK, but also put a uniqueness constraint on the field or group of fields that must be unique.

    Finally, on the performance concerns: if you use the DBMS appropriately, there are no such problems, in fact the DBMS supposedly implements most optimized functionality in that sense. In particular, a uniqueness constraint is based on an underlying unique index (similarly for a PK), and no full scan of the rows of a table is involved in the enforcing of these constraints (with possibly some caveats that depend on the specific DBMS).

    Login or Signup to reply.
  2. In RDBMS, using unique indexes is the way to go to detect duplication. It has the following benefits: (1)-No programmer involvement (2) guaranteed to work since the DB does it (3)-keeps your code clean and easier to maintain. One thing you need to worry about is case sensitivity. You need two things to get this to work: (A) have not-null on each participating column on the index and (B) ensure that that index is built with the same case. You can use upper/lower (as you like) on the index create statement. This key will now be unique.
    The problem with this approach is that it breaks first normal form. However, in my opinion it is a harmless trade-off since it does not lead to anomalies as long as you use the concatenated column in FKs and not parts of it. Another point to be aware of, is that long text indexes are not as efficient if you have huge tables. I believe this is rarely a concern in your case.

    Login or Signup to reply.
  3. Assuming that you don’t need to keep the recipe name unique but rather identify if the recipe name and the ingredients match accordingly you have two options.

    1. If your ingredients list is not too long

    Create the recipes table and create index for name and ingredients tables.

    You can search the indexed table if the results exist and then add the ones that don’t. Latest Postgres has a feature to combine indexes to make it faster to query too.

    1. If you ingredients list is larger

    Create the recipes table, ingredients table and a recipe_ingredients table for many-to-many relationships.

    The recipe_ingredients table can either be made with foreign keys or regular indexes and its up to your preference.

    If you are using indexes properly searching against the database before you insert new row would not be a hurdle.

    If you just want to keep the name unique and do not care about ingredients then just make the name field unique and it will suffice.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search