Postgresql - Get all data COUNT fastest from large table based on condition

MAK
December 11, 2024
127 views
1 vote
2 Answers

I have the table with 10TB of data.

I want to perform the COUNT(1)/COUNT(*) to check the EXACT COUNT between main table and archive table.

I tried following SQL’s but didn’t get optimal solution, queries keep running for more than 15min and counting…

My try:

Try 1: Not worked, query keeps running.

SELECT COUNT(*) FROM large_table WHERE column_date <= '2023-01-01 00:00:00';

Try 2: This works but how to apply condition and also not sure about EXACT count.

SELECT reltuples AS estimate FROM pg_class where relname = 'large_table';

Try 3: This too keep running, didn’t get result after running for more than 10min. Using EXPLAIN ANALYZE to get exact count.

do $$
declare 
r record; 
count integer; 
begin
FOR r IN EXECUTE 'EXPLAIN ANALYZE SELECT * FROM large_table where column_date <= 
''2023-01-01 00:00:00'';'
LOOP
    count := substring(r."QUERY PLAN" FROM ' rows=([[:digit:]]+) loops');
    EXIT WHEN count IS NOT NULL;
END LOOP;
raise info '%',count;
end;
$$

Answers

- LaurenzAlbe
- December 11, 2024 at 8:00 am
- 0 votes
0
There is no solution for this problem.

If you want to get the exact result set count, you have to compute the result set and count it. That can take a long time. EXPLAIN (ANALYZE) executes the statement, so it won’t be faster than a query that explicitly counts the rows.

My recommendation is that you don’t display the exact result set count. Either use the approximation provided by EXPLAIN (without ANALYZE), or at least don’t display the result set count automatically. Rather, show the estimate and provide a button "exact count". Whoever is ready to wait can push the button.

Login or Signup to reply.

As already established by @Laurenz Albe, Postgres unfortunately does need to really scan and count all that. If you’re not satisfied with the estimates found in pg_class and really need exact count, the sad answer is you’ll have to wait or collect your own.

You could use tally tables to speed this up at the price of some overhead on that table. Even if it’s hard to predict the ranges of column_date you’ll need to query, that’d allow you to retrieve exact counts for most of the range in near constant time, down to the resolution of your tally tables, then only count the remainder below that.

E.g. to count everything between two microsecond-precise timestamps, you can easily sum yearly/monthly/daily counts in between those, then add records between lower/upper bound and midnight on their dates.
_{demo at db<>fiddle}

create table large_table_tally_daily as 
select date_trunc('day',column_date) as column_date
      ,count(*) 
from large_table
group by 1;

create function trgf_large_table_tally_daily_delete_insert()returns trigger as $f$
begin 
  with cte as(
     select date_trunc('day',column_date) as column_date
           ,(case when TG_OP='DELETE' then -1 else 1 end) * count(*) as diff
     from difftab
     group by 1)
  update large_table_tally_daily as a
  set count=a.count+cte.diff
  from cte
  where a.column_date=cte.column_date;
return null;
end $f$ language plpgsql;

create trigger trg_large_table_tally_daily_delete
after delete on large_table
referencing old table as difftab
for each statement 
execute function trgf_large_table_tally_daily_delete_insert();

create trigger trg_large_table_tally_daily_insert
after insert on large_table
referencing new table as difftab
for each statement 
execute function trgf_large_table_tally_daily_delete_insert();
/*update, truncate*/

create view large_table_tally_yearly as
select date_trunc('year',column_date) as column_date
     , sum(count) 
from large_table_tally_daily 
group by 1 
order by 1;

table large_table_tally_yearly;

column_date	sum
2019-01-01 00:00:00+00	7554
2020-01-01 00:00:00+00	140586
2021-01-01 00:00:00+00	139782
2022-01-01 00:00:00+00	140437
2023-01-01 00:00:00+00	139971
2024-01-01 00:00:00+00	131670

insert into large_table select from generate_series(1,100);

table large_table_tally_yearly;

column_date	sum
2019-01-01 00:00:00+00	7555
2020-01-01 00:00:00+00	140613
2021-01-01 00:00:00+00	139803
2022-01-01 00:00:00+00	140451
2023-01-01 00:00:00+00	139988
2024-01-01 00:00:00+00	131690

delete from large_table 
where ctid in(select ctid 
              from large_table tablesample bernoulli(0.046)repeatable(.42) 
              limit 200);

table large_table_tally_yearly;

column_date	sum
2019-01-01 00:00:00+00	7552
2020-01-01 00:00:00+00	140564
2021-01-01 00:00:00+00	139765
2022-01-01 00:00:00+00	140418
2023-01-01 00:00:00+00	139949
2024-01-01 00:00:00+00	131652

With daily resolution, the demo shows this ugly thing can get you the answer about 100x faster (4x with yearly) in 0.9ms on 700k rows, compared to 98.9ms using direct count(*):

with cte as(select '2020-06-07 12:34:56.123456'::timestamptz as lower_bound
                  ,'2023-01-02 01:23:45.678901'::timestamptz as upper_bound)
select (select sum(count)
        from large_table_tally_daily cross join cte
        where column_date>=lower_bound
          and column_date+'1 day'::interval<upper_bound)
      +(select count(*) from large_table cross join cte
         where column_date>=greatest( date_trunc('day',upper_bound)
                                     ,lower_bound)
           and column_date<upper_bound )
      +(select count(*) from large_table cross join cte
         where column_date>=lower_bound
           and column_date<least( date_trunc('day',lower_bound)+'1 day'::interval
                                 ,upper_bound) );

Counts for the 937 days in between are retrieved in near constant time and queries adding the remainders have very high specificity.

It’s handy if you’re dealing with things like logs or sensor readings that just keep flowing in and never update. It’s not really worth it if the entire time range in those 10TB sustains lots of random traffic where latency is important. With more and more columns you need to take into account, it’s less and less maintainable.

As a side note, you can explain(format json) if you plan to parse its results:

do $d$
declare  r json; 
         count integer; 
begin
EXECUTE 'EXPLAIN(format json)SELECT*FROM large_table where column_date <= 
''2023-01-01 00:00:00'';' INTO r;
count := r#>>'{0,Plan,Plan Rows}';
raise info '%',count;
end $d$;

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – Get all data COUNT fastest from large table based on condition

Answers