What is better, performance and convention wise, for checking if a value exists when inserting into a database.
I want to be able to insert a unique "random" string in with every record as a pseudo id of sorts. My first thought was to query the database for the string to determine if it exists first, but at a scale of 200,000,000 records at the moment, this has become quite CPU intensive.
This random string is generated in Python and passed to Postgres
(Records are processed at 100s a second, so you end up with many reads on the table per second)
My second thought was to just allow the unique constraint on the column to fail and if it fails, regenrate a new random string. Failures are not expected to be frequent, but it seems kind of dirty to do it that way.
Does anyone with more experience know what would be the best, fastest, or most correct way to do this?
2
Answers
I don’t think there’s anything inherently dirty about unique violation exception handling, or using exception handling to accommodate your app logic in general. Things like explicit locks (
select..for update nowait
,pg_try_advisory_lock()
) andserializable
commits are even designed so that you catch and re-try.That’s not to say I’m completely disregarding the DontUseExceptionsForFlowControl idea – my point here is it has its place when you communicate between two separate systems, your app and the db.
That being said, you might consider
insert..on conflict do nothing
and reading the status/command tag/label of the result which should tell you how many rows were successfully inserted – if it says0
, generate a new one and try again. Logically, it’s the same as handling the exception, but it saves the overhead of an actual exception so it should perform a bit better.If you don’t want to deal with the command tag, you can
count(*)
rows you get frominsert..returning
to get the row count, in an actual result row:demo at db<>fiddle
As @Belayer pointed out, probing with
select
will effectively cost you pretty much the same as letting aninsert
fail (except the cost of raising the exception, if you don’t just read the0
off the command tag).If the
select
finds no conflicting value and gives you the green light to run yourinsert
, that subsequentinsert
will re-check it anyways – so you might as well let theinsert
handle both.If you check then insert PostgreSQL has to check twice. Once when you select, and once to check for uniqueness when you insert. However, this is using an index so the size of the table does not matter. And you’re doing two queries vs one, queries come with a lot of overhead.
But, there’s a more important consideration than performance. Read on.
That’s exactly how you do it, except you use a random number big enough to basically be impossible to have a conflict.
Querying then inserting invites a race condition. Consider what happens when two processes try to query and insert the same ID at the same time.
It’s better to simply insert the value, let the database check for uniqueness, and try again if it fails with a unique constraint violation. You want checking and inserting to be one single operation so nobody else can do anything between checking and inserting.
Incidentally, if you want a unique identifier that’s all but guaranteed not to conflict, use a UUIDv4. This is basically a 122 bit random number or 5316911983139663491615228241121378304 possibilities. It’s so large a conflict is essentially impossible.
PostgreSQL has a UUID type and a function to make a UUIDv4. Use that for your primary key,
id uuid primary key default gen_random_uuid()
.This is better than inserting a random number from Python, now anyone working with the database can safely insert rows.
Demonstration.