So I have a table of 4.7 billion records on which I want to perform a group by count expression in Postgres using PGadmin4.
Obviously this is gonna take a lot of time and I want to speed up the process as high as possible.
Example query
Update TT FROM Target_table TT, (Select col_1, count(col_1) cnt from
Very_Large_Table
group by col_1) as AA
set tt.qty = AA.cnt
where aa.col1 = tt.col1 and aa.cnt <> tt.qty;
I have freshly created/analyzed indexes on the column col_1 still the process takes 2 hours.
Trying parallel hints by adding /+ PARALLEL (very_large_table 6) +/ in select but it seems like a different syntax is required as explained plan still shows 2 workers.
It cannot create partitioning.
Any help is greatly appreciated as I am out of ideas now. This is choking the system, and other applications are getting impacted.
Edit: Thanks everyone for all the help but I am looking for ideas to mitigate the problem as I am quite sure anything I write/change directly on PGadmin would not help me here.
2
Answers
Sometimes there are situations where we don’t have any functionality or capabilities of DB to help us solve that problem. In these cases, we have to think of some logical solutions. For example, suppose we need the number of records in a table. If we don’t need an exact count, but an approximate count is enough, we can get very high performance by getting this count from information-schema. So, if calculating the count of a table with 4 billion records takes 1-5 minutes, with information-schema we can get it in 1 millisecond.
Now information-schema will not help us in this matter you wrote, because it only gives the count of records of the entire table, grouping prevents us here. But you can use materialized views, if you don’t need exactly count of records. Every day on the night you can refreshing this materialized using schedule and use it on day.
Again, after getting to know the issue in depth, knowing the business logic in detail, we can think of different alternative solutions. All I can say is that in all DBs the count command is slow running process on very large tables.
From your question, I guess you need to do queries on your
Target_Table
where each row of your result set shows how many rows fromVery_Large_Table
have the same value ofcol_1
as the present row. Your plan seems to be populating yourqty
column with that number.With respect, your present approach is impractical.
qty
.qty
, otherwise insertions, updates and deletions will interfere with the UPDATE query in your question. The locks interfere with other workloads on your dbms.qty
values will become stale between times you run your query. Therefore you must treat them as approximate values in your application.There are other ways to generate the
qty
value you need, when querying. For example, this retrieves yourqty
at the time of your query.It seems very likely that you always, or almost always, use a
WHERE something
filter clause when querying. I suspect it’s very rare to query the whole table. The presence of a filter, any filter, reduces the number ofcol_1
values for which you needqty
. If the filter is selective enough it reduces that number to a workable value. This approach has the benefit thatqty
is always up to date.This index accelerates the query I propose.
There’s another, more elaborate, approach. If you decide to go this direction, give it a try and ask another question if you need help.
Create a separate table
Col_1_Counts_From_Very_Large_Table
withcol_1
andqty
columns. Put indexes on(col_1, qty)
and(qty, col_1)
.Populate it, during some system downtime, with a query like the one in your question.
Create triggers for insert, update, and delete on your
Very_Large_Table
. Have those triggers update theqty
values in the appropriate rows inCol_1_Counts_From_Very_Large_Table
. This keeps your counts table up-to-date whenever somebody changes rows inVery_Large_Table
.Then when you run your query do it like this:
This approach will be accurate and fast at query time, at the cost of trigger complexity and small slowdowns on inserts, updates, and deletes to your very large table. It also lets you do things like find
MAX(qty)
orORDER BY qty DESC
without dimming the lights in your data center.