I have the table with 10TB
of data.
I want to perform the COUNT(1)/COUNT(*)
to check the EXACT COUNT between main table and archive table.
I tried following SQL’s but didn’t get optimal solution, queries keep running for more than 15min and counting…
My try:
Try 1: Not worked, query keeps running.
SELECT COUNT(*) FROM large_table WHERE column_date <= '2023-01-01 00:00:00';
Try 2: This works but how to apply condition and also not sure about EXACT count.
SELECT reltuples AS estimate FROM pg_class where relname = 'large_table';
Try 3: This too keep running, didn’t get result after running for more than 10min
. Using EXPLAIN ANALYZE to get exact count.
do $$
declare
r record;
count integer;
begin
FOR r IN EXECUTE 'EXPLAIN ANALYZE SELECT * FROM large_table where column_date <=
''2023-01-01 00:00:00'';'
LOOP
count := substring(r."QUERY PLAN" FROM ' rows=([[:digit:]]+) loops');
EXIT WHEN count IS NOT NULL;
END LOOP;
raise info '%',count;
end;
$$
2
Answers
There is no solution for this problem.
If you want to get the exact result set count, you have to compute the result set and count it. That can take a long time.
EXPLAIN (ANALYZE)
executes the statement, so it won’t be faster than a query that explicitly counts the rows.My recommendation is that you don’t display the exact result set count. Either use the approximation provided by
EXPLAIN
(withoutANALYZE
), or at least don’t display the result set count automatically. Rather, show the estimate and provide a button "exact count". Whoever is ready to wait can push the button.As already established by @Laurenz Albe, Postgres unfortunately does need to really scan and count all that. If you’re not satisfied with the estimates found in
pg_class
and really need exact count, the sad answer is you’ll have to wait or collect your own.You could use tally tables to speed this up at the price of some overhead on that table. Even if it’s hard to predict the ranges of
column_date
you’ll need to query, that’d allow you to retrieve exact counts for most of the range in near constant time, down to the resolution of your tally tables, then only count the remainder below that.E.g. to count everything between two microsecond-precise timestamps, you can easily sum yearly/monthly/daily counts in between those, then add records between lower/upper bound and midnight on their dates.
demo at db<>fiddle
With daily resolution, the demo shows this ugly thing can get you the answer about 100x faster (4x with yearly) in
0.9ms
on 700k rows, compared to98.9ms
using directcount(*)
:Counts for the 937 days in between are retrieved in near constant time and queries adding the remainders have very high specificity.
It’s handy if you’re dealing with things like logs or sensor readings that just keep flowing in and never update. It’s not really worth it if the entire time range in those 10TB sustains lots of random traffic where latency is important. With more and more columns you need to take into account, it’s less and less maintainable.
As a side note, you can
explain(format json)
if you plan to parse its results: