skip to Main Content

The problem
There is currently a system that reads unique identifiers (id) and their attached datasets from an SQL table into an unordered map. These datasets used to start with id 1, but datasets are added and deleted in a matter of ~10ms. Note that not all datasets are loaded into RAM at all times.

When the program is started, it reads SELECT MAX(id) from the database and continues adding +1 to a counter variable, which it is using as id for any added dataset. The ids of deleted datasets are not used anywhere any more. This inevitably leads to gaps in the id sequence and an overflow someday.

I am looking for a performant way to fill the lowest gap of the id value. While remaining consistent with the only partly loaded SQL table and the program’s memory. The data in the memory might not be saved to the SQL table for a few minutes or immediately.

The idea
One solution I came up with is to keep doing expensive queries at runtime for every created dataset to find the lowest gap in the SQL table, check if this id exists as a value in the unordered map, and then again use the value from the counter variable as a backup to avoid endless queries for a free id. This works exactly for the amount of 1 id. Then it’s still free in SQL but taken in the unordered map until the memory is saved again.

I’ve also brainstormed doing queries for lists of free IDs into a vector and using them as id for new datasets until the vector is empty, then (or frequently) doing a new query for more IDs. But I can’t think of a query to find X amount of gaps in a table, which may or may not have its id column starting with 1.

I’ve come across How do I find a "gap" in running counter with SQL? , but I have two issues with the top answer: Apparently, it only finds a single gap, while I would need lots and I can’t make sense of the mo and mi it uses.

Let’s say I have a table named userdata with the columns id and dataset, both 32-bit signed INT. How could I find an array of gaps in the id column? As an example, i want the query to return 2,3,4,5,8 when the ids in the table are 1,6,7,9.

Any pointers to an approach with a possible solution are appreciated as well.

2

Answers


  1. If you have one database change each 10ms, then you have 100 changes per second. A signed int can hold about 2,147,483,648 values, or 21,474,846 seconds, which is about 8 months. After this, no available new ID is possible.

    The first solution is using a 64bit type instead of a int. This gives you about 13,600 years (for a signed 64b), which seems enough 🙂


    Other solution is having a vector with all possible IDs. The vector stores bool (ID is used/unused). Requesting a new ID is done by travelling the vector up to the first position marked as unused.

    This vector uses a lot of RAM, although std::vector has a specialized version for bools that requires less RAM.


    A third solution is handling a linked list (maybe double linked) which stores the deleted (read: re-usable) IDs.

    When a new ID is requested, the list offers its head, or the size of the table if the list is empty.

    When a dataset is deleted its ID is properly inserted in the list, so the list is always sorted.

    When a ID is reused it’s deleted from the list.

    Deleting the last record in the table may also delete last nodes in the list, as they are useless (case ID > table size). That’s why I suggested using a double linked list, so as to delete quickly last nodes.

    So, this list is quickly using "new" & "delete" on its nodes, and also frequently running up and down (for double linked) for inserting new nodes.

    This is a bit slow, but I hope the list not to be too large, and the time required then is not that bad.

    Also notice that this list gives you the array of gaps you’re asking for.

    Login or Signup to reply.
  2. SELECT MAX(id) — This technique will not be reliable if multiple concurrent processes are using it. Let AUTO_INCREMENT do the work for you — faster, more efficient, and safe for concurrency.

    "gaps in the id" — So?

    Plan A: Use BIGINT — that has a limit so large that it would take many centuries to reach, even with the fastest machine today.

    Plan B: Let’s discuss the dataflow. There may be other techniques to avoid the gaps or deal with them. For example, here is code to avoid "burning ids" for batch Normalization

    Plan C: As you say, filling in the gaps is too costly to even consider.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search