My table is:
create table transactions
transaction_id integer not null,
transaction_timestamp integer not null,
input_index smallint,
output_index smallint not null,
from_id integer,
to_id integer not null,
input_value real,
output_value real not null,
constraint unique_transactions
unique (transaction_id, from_id, to_id)
);
with the following indexes:
create index idx_transactions_from_id_block_timestamp
on transactions (from_id asc, transaction_timestamp desc);
create index idx_transactions_to_id_block_timestamp
on transactions (to_id asc, transaction_timestamp desc);
create index idx_transactions_transaction_id
on transactions (transaction_id);
create index idx_transactions_block_timestamp
on transactions (transaction_timestamp desc);
The query I ideally would want to have is-
select distinct on (transaction_id,output_index) *
from transactions
where to_id = 1000
and transaction_timestamp between 1691193600 AND 1711929600
order by transaction_timestamp desc
limit 10
Giving me the most recent 10 unique (transaction_id,output_index) pairs (don’t care which from_id and input_index are selected to stay).
This straight forward way doesn’t work, since postgres requires the order by to first contain the distinct on columns.
ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions
doing so will reorder my rows, selecting the first 10 with the highest transaction_id, which i do not want.
Is there an efficient way to do this, using the low limit number to hopefully not have to go over the millions of rows in the table?
I have attempted the following queries, all ended up taking too long since they needed to work on the whole table, without using the small limit 10.
Query 1:
WITH RankedTransactions AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY transaction_id, output_index ORDER BY transaction_timestamp DESC) AS rn
FROM transactions
WHERE to_id = 1000
and transaction_timestamp between 1691193600 AND 1711929600
)
SELECT transaction_id,
input_index,
output_index,
transaction_timestamp,
from_id,
to_id,
input_value,
output_value
FROM RankedTransactions
WHERE rn = 1
ORDER BY transaction_timestamp DESC
LIMIT 10;
Query 2:
SELECT *
FROM (
SELECT DISTINCT ON (transaction_id, output_index) *
FROM transactions
WHERE to_id = 1000
and transaction_timestamp between 1691193600 AND 1711929600
ORDER BY transaction_id, output_index DESC
) AS latest_transactions
ORDER BY transaction_timestamp DESC
LIMIT 10;
2
Answers
This works:
I removed the distracting
DESC
fromoutput_index
in the sort order.The best query (and index) depend on how many rows (and duplicates within those) to expect per basic selection (after filtering by
WHERE to_id = 1000 AND transaction_timestamp BETWEEN 1691193600 AND 1711929600
).Backed by your (existing) index on
(to_id, transaction_timestamp DESC)
this query may just be it.With a large number of qualifying rows and/or a large number of duplicates on
(transaction_id, output_index)
, things get more complicated. Especially since you (1) already have a range condition in the base filterand (2) want a mixed sort order forwhich makes emulating an index skip scan hard …(transaction_id ASC, output_index DESC)
,Related:
Recursive query
If there are many qualifying rows, but not very many duplicates on
(transaction_id, output_index)
within that set, and while yourLIMIT
is very small (likeLIMIT 10
), this recursive query should be very fast.I collect the (few!) selected combinations of
(transaction_id, output_index)
in an array to use that as filter going forward. Create a matching composite type for the purpose once:Then:
This still only needs your existing index on
(to_id, transaction_timestamp DESC)
(whereDESC
does not matter).What I’ve done in cases like this is use a cursor to read the sorted data into the client (or app server, whatever is connected to the database) incrementally and populate a hash table with the rows until it obtained the size I wanted, then close the cursor. Of course this only works if duplicates (on the hash key) are rare enough that you obtain the target size without reading too much of the sorted data. You could probably do this all server-side with a set-returning user-defined function if you wanted. There is no fundamental reason this could not be implemented automatically, but I wouldn’t hold my breath for someone to do that. For one thing, it isn’t obvious what SQL syntax would be used to express this desired output clearly.
Another approach I’ve used is to read a defined number of rows in order (say 1000), then remove the duplicates, then put them back in order again and take the first 10. The problem is that there is no way to know that 1000 input rows is enough to lead to 10 output rows, as it is possible for the duplication rate to be very high. So the notion that 1000 is sufficient is a matter of your judgment, not something which can be determined mechanically.
Concretely, this would be:
This should be efficient with your existing index on
(to_id, transaction_timestamp DESC)