The problem
There is currently a system that reads unique identifiers (id
) and their attached datasets from an SQL table into an unordered map. These datasets used to start with id
1, but datasets are added and deleted in a matter of ~10ms. Note that not all datasets are loaded into RAM at all times.
When the program is started, it reads SELECT MAX(id)
from the database and continues adding +1 to a counter variable, which it is using as id
for any added dataset. The id
s of deleted datasets are not used anywhere any more. This inevitably leads to gaps in the id
sequence and an overflow someday.
I am looking for a performant way to fill the lowest gap of the id
value. While remaining consistent with the only partly loaded SQL table and the program’s memory. The data in the memory might not be saved to the SQL table for a few minutes or immediately.
The idea
One solution I came up with is to keep doing expensive queries at runtime for every created dataset to find the lowest gap in the SQL table, check if this id
exists as a value in the unordered map, and then again use the value from the counter variable as a backup to avoid endless queries for a free id
. This works exactly for the amount of 1 id
. Then it’s still free in SQL but taken in the unordered map until the memory is saved again.
I’ve also brainstormed doing queries for lists of free IDs into a vector and using them as id
for new datasets until the vector is empty, then (or frequently) doing a new query for more IDs. But I can’t think of a query to find X amount of gaps in a table, which may or may not have its id
column starting with 1.
I’ve come across How do I find a "gap" in running counter with SQL? , but I have two issues with the top answer: Apparently, it only finds a single gap, while I would need lots and I can’t make sense of the mo
and mi
it uses.
Let’s say I have a table named userdata
with the columns id
and dataset
, both 32-bit signed INT. How could I find an array of gaps in the id
column? As an example, i want the query to return 2,3,4,5,8 when the id
s in the table are 1,6,7,9.
Any pointers to an approach with a possible solution are appreciated as well.
2
Answers
If you have one database change each 10ms, then you have 100 changes per second. A signed
int
can hold about 2,147,483,648 values, or 21,474,846 seconds, which is about 8 months. After this, no available new ID is possible.The first solution is using a
64bit
type instead of aint
. This gives you about 13,600 years (for a signed 64b), which seems enough 🙂Other solution is having a vector with all possible IDs. The vector stores
bool
(ID is used/unused). Requesting a new ID is done by travelling the vector up to the first position marked as unused.This vector uses a lot of RAM, although std::vector has a specialized version for bools that requires less RAM.
A third solution is handling a linked list (maybe double linked) which stores the deleted (read: re-usable) IDs.
When a new ID is requested, the list offers its head, or the size of the table if the list is empty.
When a dataset is deleted its ID is properly inserted in the list, so the list is always sorted.
When a ID is reused it’s deleted from the list.
Deleting the last record in the table may also delete last nodes in the list, as they are useless (case ID > table size). That’s why I suggested using a double linked list, so as to delete quickly last nodes.
So, this list is quickly using "new" & "delete" on its nodes, and also frequently running up and down (for double linked) for inserting new nodes.
This is a bit slow, but I hope the list not to be too large, and the time required then is not that bad.
Also notice that this list gives you the array of gaps you’re asking for.
SELECT MAX(id)
— This technique will not be reliable if multiple concurrent processes are using it. LetAUTO_INCREMENT
do the work for you — faster, more efficient, and safe for concurrency."gaps in the id" — So?
Plan A: Use
BIGINT
— that has a limit so large that it would take many centuries to reach, even with the fastest machine today.Plan B: Let’s discuss the dataflow. There may be other techniques to avoid the gaps or deal with them. For example, here is code to avoid "burning ids" for batch Normalization
Plan C: As you say, filling in the gaps is too costly to even consider.