I am still a novice to SQL, using PostgreSQL 16.2.
create table users(
user_id serial primary key
);
create table task_pool(
task_id serial primary key,
order_id integer,
clicker_id integer,
INDEX idx_order_id (order_id)
);
create table task_dispersal(
id serial primary key,
order_id integer,
to_disp integer
);
There are 100,000 rows in users (so 100,000 user_ids);
1,000 rows in task_dispersal, each with to_disp as 10.
Starting with 0 rows in task_pool.
I want to iterate through rows in task_dispersal
. For each row…
-
Get to_disp amount of random user_id’s from
users
table, that are not already intask_pool
with iterated row’s related order_id. -
For example if the a row in
task_dispersal
had order_id 1 and to_disp 10, it would checktask_pool
for all rows that have order_id 1, get the user_id’s from those rows and filter them from the users table, and then choose 10 random rows (user_id’s) to insert intotask_pool
.
I am using this query:
INSERT into task_pool(clicker_id, order_id)
SELECT U.user_id, TD.id
FROM task_dispersal TD
CROSS JOIN LATERAL (
SELECT Us.user_id
FROM users Us
WHERE user_id NOT IN (
SELECT clicker_id
FROM task_pool
WHERE order_id = TD.id
)
ORDER BY RANDOM()
LIMIT TD.to_disp
) U
It works, and relatively fast, but I am still hoping there is a way to optimize this, because it takes 26 seconds on my (very weak) hosted database, and I am trying for around a few seconds.
Here is the query plan:
Insert on public.task_pool (cost=6586.23..674875.51 rows=0 width=0) (actual time=26540.859..26540.860 rows=0 loops=1)
Buffers: shared hit=32262 dirtied=17 written=18
-> Nested Loop (cost=6586.23..674875.51 rows=500000 width=24) (actual time=1929.910..26299.971 rows=1000 loops=1)
Output: nextval('task_pool_task_id_seq'::regclass), td.order_id, us.user_id, CURRENT_TIMESTAMP, 0
Buffers: shared hit=28619
-> Seq Scan on public.task_dispersal td (cost=0.00..3.00 rows=100 width=8) (actual time=0.007..0.140 rows=100 loops=1)
Output: td.id, td.order_id, td.reset_time, td.disp_remaining, td.daily_disp, td.disp_interval, td.next_disp_time, td.expired_tasks, td.to_disp
Buffers: shared hit=2
-> Limit (cost=6586.23..6598.73 rows=5000 width=12) (actual time=249.542..249.543 rows=10 loops=100)
Output: us.user_id, (random())
Buffers: shared hit=27607
-> Sort (cost=6586.23..6711.23 rows=50000 width=12) (actual time=249.538..249.539 rows=10 loops=100)
Output: us.user_id, (random())
Sort Key: (random())
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=27607
-> Index Only Scan using users_pkey on public.users us (cost=2.96..2181.57 rows=50000 width=12) (actual time=3.846..138.345 rows=100000 loops=100)
Output: us.user_id, random()
Filter: (NOT (hashed SubPlan 1))
Heap Fetches: 0
Buffers: shared hit=27607
SubPlan 1
-> Index Scan using idx_order_id on public.task_pool task_pool_1 (cost=0.15..2.63 rows=16 width=4) (actual time=0.003..0.003 rows=0 loops=100)
Output: task_pool_1.clicker_id
Index Cond: (task_pool_1.order_id = td.order_id)
Buffers: shared hit=106
Settings: effective_io_concurrency = '200', random_page_cost = '1.1', effective_cache_size = '192MB', max_parallel_workers = '1', max_parallel_workers_per_gather = '1', work_mem = '1703kB'
Query Identifier: 3069500943293269296
Planning:
Buffers: shared hit=18 read=2
Planning Time: 0.275 ms
JIT:
Functions: 22
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 1.054 ms, Inlining 135.482 ms, Optimization 1098.077 ms, Emission 399.930 ms, Total 1634.542 ms
Execution Time: 26541.848 ms
I don’t really know how to read the query plan, but I am pretty sure that using ORDER BY RANDOM()
on the result of user_ids
that are not in task_pool with order_id is the most costly?
I don’t think I can use tablesample bernoulli
to get random users from the users
table though; as some/all the sample rows it chooses might already be in the task_pool, so it will insert less than the required amount(to_disp) of user_ids (this happens fairly often after performing the query a few times)
If I could use tablesample bernoulli on the RESULT of
WHERE user_id NOT IN (
SELECT clicker_id
FROM task_pool
WHERE order_id = TD.order_id
)
That would be awesome; unfortunately it can only be used on tables, temp tables, or materialized views, and can’t be used after WHERE
.
I have tried making temp tables of the unused user_ids per row and then tablesample bernoulli
(very slow when each row has up to 100,000 unused user_ids)
At this point, I don’t think there is any optimization that can be done, other than upgrading the server. Any better ideas?
2
Answers
Random rearrangement to avoid LATERAL, may help a bit…
INDEX on
task_pool (order_id, clicker_id)
You may try to remove one subquery: