skip to Main Content

I have the table with 10TB of data.

I want to perform the COUNT(1)/COUNT(*) to check the EXACT COUNT between main table and archive table.

I tried following SQL’s but didn’t get optimal solution, queries keep running for more than 15min and counting…

My try:

Try 1: Not worked, query keeps running.

SELECT COUNT(*) FROM large_table WHERE column_date <= '2023-01-01 00:00:00';

Try 2: This works but how to apply condition and also not sure about EXACT count.

SELECT reltuples AS estimate FROM pg_class where relname = 'large_table';

Try 3: This too keep running, didn’t get result after running for more than 10min. Using EXPLAIN ANALYZE to get exact count.

do $$
declare 
r record; 
count integer; 
begin
FOR r IN EXECUTE 'EXPLAIN ANALYZE SELECT * FROM large_table where column_date <= 
''2023-01-01 00:00:00'';'
LOOP
    count := substring(r."QUERY PLAN" FROM ' rows=([[:digit:]]+) loops');
    EXIT WHEN count IS NOT NULL;
END LOOP;
raise info '%',count;
end;
$$

2

Answers


  1. There is no solution for this problem.

    If you want to get the exact result set count, you have to compute the result set and count it. That can take a long time. EXPLAIN (ANALYZE) executes the statement, so it won’t be faster than a query that explicitly counts the rows.

    My recommendation is that you don’t display the exact result set count. Either use the approximation provided by EXPLAIN (without ANALYZE), or at least don’t display the result set count automatically. Rather, show the estimate and provide a button "exact count". Whoever is ready to wait can push the button.

    Login or Signup to reply.
  2. As already established by @Laurenz Albe, Postgres unfortunately does need to really scan and count all that. If you’re not satisfied with the estimates found in pg_class and really need exact count, the sad answer is you’ll have to wait or collect your own.

    You could use tally tables to speed this up at the price of some overhead on that table. Even if it’s hard to predict the ranges of column_date you’ll need to query, that’d allow you to retrieve exact counts for most of the range in near constant time, down to the resolution of your tally tables, then only count the remainder below that.

    E.g. to count everything between two microsecond-precise timestamps, you can easily sum yearly/monthly/daily counts in between those, then add records between lower/upper bound and midnight on their dates.
    demo at db<>fiddle

    create table large_table_tally_daily as 
    select date_trunc('day',column_date) as column_date
          ,count(*) 
    from large_table
    group by 1;
    
    create function trgf_large_table_tally_daily_delete_insert()returns trigger as $f$
    begin 
      with cte as(
         select date_trunc('day',column_date) as column_date
               ,(case when TG_OP='DELETE' then -1 else 1 end) * count(*) as diff
         from difftab
         group by 1)
      update large_table_tally_daily as a
      set count=a.count+cte.diff
      from cte
      where a.column_date=cte.column_date;
    return null;
    end $f$ language plpgsql;
    
    create trigger trg_large_table_tally_daily_delete
    after delete on large_table
    referencing old table as difftab
    for each statement 
    execute function trgf_large_table_tally_daily_delete_insert();
    
    create trigger trg_large_table_tally_daily_insert
    after insert on large_table
    referencing new table as difftab
    for each statement 
    execute function trgf_large_table_tally_daily_delete_insert();
    /*update, truncate*/
    
    create view large_table_tally_yearly as
    select date_trunc('year',column_date) as column_date
         , sum(count) 
    from large_table_tally_daily 
    group by 1 
    order by 1;
    
    table large_table_tally_yearly;
    
    column_date sum
    2019-01-01 00:00:00+00 7554
    2020-01-01 00:00:00+00 140586
    2021-01-01 00:00:00+00 139782
    2022-01-01 00:00:00+00 140437
    2023-01-01 00:00:00+00 139971
    2024-01-01 00:00:00+00 131670
    insert into large_table select from generate_series(1,100);
    
    table large_table_tally_yearly;
    
    column_date sum
    2019-01-01 00:00:00+00 7555
    2020-01-01 00:00:00+00 140613
    2021-01-01 00:00:00+00 139803
    2022-01-01 00:00:00+00 140451
    2023-01-01 00:00:00+00 139988
    2024-01-01 00:00:00+00 131690
    delete from large_table 
    where ctid in(select ctid 
                  from large_table tablesample bernoulli(0.046)repeatable(.42) 
                  limit 200);
    
    table large_table_tally_yearly;
    
    column_date sum
    2019-01-01 00:00:00+00 7552
    2020-01-01 00:00:00+00 140564
    2021-01-01 00:00:00+00 139765
    2022-01-01 00:00:00+00 140418
    2023-01-01 00:00:00+00 139949
    2024-01-01 00:00:00+00 131652

    With daily resolution, the demo shows this ugly thing can get you the answer about 100x faster (4x with yearly) in 0.9ms on 700k rows, compared to 98.9ms using direct count(*):

    with cte as(select '2020-06-07 12:34:56.123456'::timestamptz as lower_bound
                      ,'2023-01-02 01:23:45.678901'::timestamptz as upper_bound)
    select (select sum(count)
            from large_table_tally_daily cross join cte
            where column_date>=lower_bound
              and column_date+'1 day'::interval<upper_bound)
          +(select count(*) from large_table cross join cte
             where column_date>=greatest( date_trunc('day',upper_bound)
                                         ,lower_bound)
               and column_date<upper_bound )
          +(select count(*) from large_table cross join cte
             where column_date>=lower_bound
               and column_date<least( date_trunc('day',lower_bound)+'1 day'::interval
                                     ,upper_bound) );
    

    Counts for the 937 days in between are retrieved in near constant time and queries adding the remainders have very high specificity.

    It’s handy if you’re dealing with things like logs or sensor readings that just keep flowing in and never update. It’s not really worth it if the entire time range in those 10TB sustains lots of random traffic where latency is important. With more and more columns you need to take into account, it’s less and less maintainable.


    As a side note, you can explain(format json) if you plan to parse its results:

    do $d$
    declare  r json; 
             count integer; 
    begin
    EXECUTE 'EXPLAIN(format json)SELECT*FROM large_table where column_date <= 
    ''2023-01-01 00:00:00'';' INTO r;
    count := r#>>'{0,Plan,Plan Rows}';
    raise info '%',count;
    end $d$;
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search