Postgresql - Optimize query to insert random rows from tableA into tableB, without duplicates

EyeDunno
April 6, 2024
213 views
0 votes
2 Answers

I am still a novice to SQL, using PostgreSQL 16.2.

create table users(
    user_id serial primary key
);
create table task_pool(
    task_id serial primary key,   
    order_id integer,
    clicker_id integer,
    INDEX idx_order_id (order_id)
);
create table task_dispersal(
    id serial primary key,
    order_id integer,
    to_disp integer
);

There are 100,000 rows in users (so 100,000 user_ids);
1,000 rows in task_dispersal, each with to_disp as 10.
Starting with 0 rows in task_pool.

I want to iterate through rows in task_dispersal. For each row…

Get to_disp amount of random user_id’s from users table, that are not already in task_pool with iterated row’s related order_id.
For example if the a row in task_dispersal had order_id 1 and to_disp 10, it would check task_pool for all rows that have order_id 1, get the user_id’s from those rows and filter them from the users table, and then choose 10 random rows (user_id’s) to insert into task_pool.

I am using this query:

INSERT into task_pool(clicker_id, order_id)
SELECT U.user_id, TD.id
FROM task_dispersal TD
CROSS JOIN LATERAL (
    SELECT Us.user_id
    FROM users Us
    WHERE user_id NOT IN (
        SELECT clicker_id
        FROM task_pool
        WHERE order_id = TD.id
    )
    ORDER BY RANDOM()
    LIMIT TD.to_disp
) U

It works, and relatively fast, but I am still hoping there is a way to optimize this, because it takes 26 seconds on my (very weak) hosted database, and I am trying for around a few seconds.

Here is the query plan:

Insert on public.task_pool  (cost=6586.23..674875.51 rows=0 width=0) (actual time=26540.859..26540.860 rows=0 loops=1)
  Buffers: shared hit=32262 dirtied=17 written=18
  ->  Nested Loop  (cost=6586.23..674875.51 rows=500000 width=24) (actual time=1929.910..26299.971 rows=1000 loops=1)
        Output: nextval('task_pool_task_id_seq'::regclass), td.order_id, us.user_id, CURRENT_TIMESTAMP, 0
        Buffers: shared hit=28619
        ->  Seq Scan on public.task_dispersal td  (cost=0.00..3.00 rows=100 width=8) (actual time=0.007..0.140 rows=100 loops=1)
              Output: td.id, td.order_id, td.reset_time, td.disp_remaining, td.daily_disp, td.disp_interval, td.next_disp_time, td.expired_tasks, td.to_disp
              Buffers: shared hit=2
        ->  Limit  (cost=6586.23..6598.73 rows=5000 width=12) (actual time=249.542..249.543 rows=10 loops=100)
              Output: us.user_id, (random())
              Buffers: shared hit=27607
              ->  Sort  (cost=6586.23..6711.23 rows=50000 width=12) (actual time=249.538..249.539 rows=10 loops=100)
                    Output: us.user_id, (random())
                    Sort Key: (random())
                    Sort Method: top-N heapsort  Memory: 25kB
                    Buffers: shared hit=27607
                    ->  Index Only Scan using users_pkey on public.users us  (cost=2.96..2181.57 rows=50000 width=12) (actual time=3.846..138.345 rows=100000 loops=100)
                          Output: us.user_id, random()
                          Filter: (NOT (hashed SubPlan 1))
                          Heap Fetches: 0
                          Buffers: shared hit=27607
                          SubPlan 1
                            ->  Index Scan using idx_order_id on public.task_pool task_pool_1  (cost=0.15..2.63 rows=16 width=4) (actual time=0.003..0.003 rows=0 loops=100)
                                  Output: task_pool_1.clicker_id
                                  Index Cond: (task_pool_1.order_id = td.order_id)
                                  Buffers: shared hit=106
Settings: effective_io_concurrency = '200', random_page_cost = '1.1', effective_cache_size = '192MB', max_parallel_workers = '1', max_parallel_workers_per_gather = '1', work_mem = '1703kB'
Query Identifier: 3069500943293269296
Planning:
  Buffers: shared hit=18 read=2
Planning Time: 0.275 ms
JIT:
  Functions: 22
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 1.054 ms, Inlining 135.482 ms, Optimization 1098.077 ms, Emission 399.930 ms, Total 1634.542 ms
Execution Time: 26541.848 ms

I don’t really know how to read the query plan, but I am pretty sure that using ORDER BY RANDOM() on the result of user_ids that are not in task_pool with order_id is the most costly?

I don’t think I can use tablesample bernoulli to get random users from the users table though; as some/all the sample rows it chooses might already be in the task_pool, so it will insert less than the required amount(to_disp) of user_ids (this happens fairly often after performing the query a few times)

If I could use tablesample bernoulli on the RESULT of

WHERE user_id NOT IN (
        SELECT clicker_id
        FROM task_pool
        WHERE order_id = TD.order_id
    )

That would be awesome; unfortunately it can only be used on tables, temp tables, or materialized views, and can’t be used after WHERE.

I have tried making temp tables of the unused user_ids per row and then tablesample bernoulli (very slow when each row has up to 100,000 unused user_ids)

At this point, I don’t think there is any optimization that can be done, other than upgrading the server. Any better ideas?

Answers

Random rearrangement to avoid LATERAL, may help a bit…

WITH
   randomised AS
(
   SELECT
      U.user_id, TD.id, TD.to_disp
      ROW_NUMBER() OVER (
         PARTITION BY TD.id
             ORDER BY RANDOM()
      )
         AS selection_id
   FROM
      task_dispersal TD
   CROSS JOIN
      users          U
   WHERE
      NOT EXISTS (
         SELECT *
           FROM task_pool
          WHERE order_id   = TD.id
            AND clicker_id = U.user_id
      ) 
)
SELECT
  user_id, id
FROM
  randomised
WHERE
  selection_id <= to_disp

INDEX on task_pool (order_id, clicker_id)

- Nick
- April 7, 2024 at 12:35 am
- 0 votes
0
You may try to remove one subquery:
```
INSERT INTO task_pool(clicker_id, order_id)
SELECT U.user_id, TD.id
FROM task_dispersal TD
CROSS JOIN LATERAL (
    SELECT Us.user_id
    FROM users Us
    LEFT JOIN task_pool ON task_pool.clicker_id = Us.user_id and order_id = TD.id
    WHERE task_id IS NULL
    ORDER BY RANDOM()
    LIMIT to_disp
) U;
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – Optimize query to insert random rows from tableA into tableB, without duplicates

Answers