skip to Main Content

What is better, performance and convention wise, for checking if a value exists when inserting into a database.

I want to be able to insert a unique "random" string in with every record as a pseudo id of sorts. My first thought was to query the database for the string to determine if it exists first, but at a scale of 200,000,000 records at the moment, this has become quite CPU intensive.
This random string is generated in Python and passed to Postgres

(Records are processed at 100s a second, so you end up with many reads on the table per second)

My second thought was to just allow the unique constraint on the column to fail and if it fails, regenrate a new random string. Failures are not expected to be frequent, but it seems kind of dirty to do it that way.

Does anyone with more experience know what would be the best, fastest, or most correct way to do this?

2

Answers


  1. allow the unique constraint on the column to fail and if it fails, regenrate a new random string. Failures are not expected to be frequent, but it seems kind of dirty to do it that way.

    I don’t think there’s anything inherently dirty about unique violation exception handling, or using exception handling to accommodate your app logic in general. Things like explicit locks (select..for update nowait, pg_try_advisory_lock()) and serializable commits are even designed so that you catch and re-try.

    That’s not to say I’m completely disregarding the DontUseExceptionsForFlowControl idea – my point here is it has its place when you communicate between two separate systems, your app and the db.

    That being said, you might consider insert..on conflict do nothing and reading the status/command tag/label of the result which should tell you how many rows were successfully inserted – if it says 0, generate a new one and try again. Logically, it’s the same as handling the exception, but it saves the overhead of an actual exception so it should perform a bit better.
    If you don’t want to deal with the command tag, you can count(*) rows you get from insert..returning to get the row count, in an actual result row:
    demo at db<>fiddle

    with inserted as (
      insert into your_table (your_unique_column)
      values (value_from_the_app)
      on conflict do nothing
      returning 1)
    select count(*) from inserted;
    

    As @Belayer pointed out, probing with select will effectively cost you pretty much the same as letting an insert fail (except the cost of raising the exception, if you don’t just read the 0 off the command tag).
    If the select finds no conflicting value and gives you the green light to run your insert, that subsequent insert will re-check it anyways – so you might as well let the insert handle both.

    Login or Signup to reply.
  2. What is better, performance wise…

    If you check then insert PostgreSQL has to check twice. Once when you select, and once to check for uniqueness when you insert. However, this is using an index so the size of the table does not matter. And you’re doing two queries vs one, queries come with a lot of overhead.

    But, there’s a more important consideration than performance. Read on.

    My second thought was to just allow the unique constraint on the column to fail and if it fails, regenrate a new random string. Failures are not expected to be frequent, but it seems kind of dirty to do it that way.

    That’s exactly how you do it, except you use a random number big enough to basically be impossible to have a conflict.


    Querying then inserting invites a race condition. Consider what happens when two processes try to query and insert the same ID at the same time.

    Process 1 Process 2
    select x
    select x
    not found
    not found
    insert x
    insert x
    success
    conflict

    It’s better to simply insert the value, let the database check for uniqueness, and try again if it fails with a unique constraint violation. You want checking and inserting to be one single operation so nobody else can do anything between checking and inserting.


    Incidentally, if you want a unique identifier that’s all but guaranteed not to conflict, use a UUIDv4. This is basically a 122 bit random number or 5316911983139663491615228241121378304 possibilities. It’s so large a conflict is essentially impossible.

    PostgreSQL has a UUID type and a function to make a UUIDv4. Use that for your primary key, id uuid primary key default gen_random_uuid().

    This is better than inserting a random number from Python, now anyone working with the database can safely insert rows.

    Demonstration.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search