Postgresql - Efficient way to combine SELECT DISTINCT, ORDER BY, and a small LIMIT

IdoShimshi
April 14, 2024
104 views
0 votes
2 Answers

My table is:

create table transactions
    transaction_id  integer  not null,
    transaction_timestamp integer  not null,
    input_index     smallint,
    output_index    smallint not null,
    from_id         integer,
    to_id           integer  not null,
    input_value     real,
    output_value    real     not null,
    constraint unique_transactions
        unique (transaction_id, from_id, to_id)
);

with the following indexes:

create index idx_transactions_from_id_block_timestamp
    on transactions (from_id asc, transaction_timestamp desc);

create index idx_transactions_to_id_block_timestamp
    on transactions (to_id asc, transaction_timestamp desc);

create index idx_transactions_transaction_id
    on transactions (transaction_id);

create index idx_transactions_block_timestamp
    on transactions (transaction_timestamp desc);

The query I ideally would want to have is-

select distinct on (transaction_id,output_index) *
from transactions
where to_id = 1000
and transaction_timestamp between 1691193600 AND 1711929600
order by transaction_timestamp desc
limit 10

Giving me the most recent 10 unique (transaction_id,output_index) pairs (don’t care which from_id and input_index are selected to stay).

This straight forward way doesn’t work, since postgres requires the order by to first contain the distinct on columns.
ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions

doing so will reorder my rows, selecting the first 10 with the highest transaction_id, which i do not want.

Is there an efficient way to do this, using the low limit number to hopefully not have to go over the millions of rows in the table?

I have attempted the following queries, all ended up taking too long since they needed to work on the whole table, without using the small limit 10.

Query 1:

WITH RankedTransactions AS (
        SELECT *,
               ROW_NUMBER() OVER (PARTITION BY transaction_id, output_index ORDER BY transaction_timestamp DESC) AS rn
        FROM transactions
        WHERE to_id = 1000
          and transaction_timestamp between 1691193600 AND 1711929600
    )
    SELECT transaction_id,
           input_index,
           output_index,
           transaction_timestamp,
           from_id,
           to_id,
           input_value,
           output_value
    FROM RankedTransactions
    WHERE rn = 1
    ORDER BY transaction_timestamp DESC
    LIMIT 10;

Query 2:

SELECT *
FROM (
    SELECT DISTINCT ON (transaction_id, output_index) *
    FROM transactions
    WHERE to_id = 1000
    and transaction_timestamp between 1691193600 AND 1711929600
    ORDER BY transaction_id, output_index DESC
) AS latest_transactions
ORDER BY transaction_timestamp DESC
LIMIT 10;

Answers

- ErwinBrandstetter
- April 14, 2024 at 5:42 pm
- 0 votes
0
This works:
```
SELECT *
FROM  (
   SELECT DISTINCT ON (transaction_id, output_index) *
   FROM   transactions
   WHERE  to_id = 1000
   AND    transaction_timestamp BETWEEN 1691193600 AND 1711929600
   ORDER  BY transaction_id, output_index, transaction_timestamp DESC  -- !!!
   ) AS latest_transactions
ORDER  BY transaction_timestamp DESC
LIMIT  10;
```
I removed the distracting DESC from output_index in the sort order.

The best query (and index) depend on how many rows (and duplicates within those) to expect per basic selection (after filtering by WHERE to_id = 1000 AND transaction_timestamp BETWEEN 1691193600 AND 1711929600).

Backed by your (existing) index on (to_id, transaction_timestamp DESC) this query may just be it.

With a large number of qualifying rows and/or a large number of duplicates on (transaction_id, output_index), things get more complicated. Especially since you (1) already have a range condition in the base filter ~~and (2) want a mixed sort order for (transaction_id ASC, output_index DESC),~~ which makes emulating an index skip scan hard …

Related:
Recursive query

If there are many qualifying rows, but not very many duplicates on (transaction_id, output_index) within that set, and while your LIMIT is very small (like LIMIT 10), this recursive query should be very fast.

I collect the (few!) selected combinations of (transaction_id, output_index) in an array to use that as filter going forward. Create a matching composite type for the purpose once:
```
CREATE TYPE trans_out AS (transaction_id integer, output_index smallint);
```
Then:
```
WITH RECURSIVE cte AS (
   (                                -- parentheses required
   SELECT t AS my_row, ARRAY[(transaction_id, output_index)::trans_out] AS selected
   FROM   transactions t
   WHERE  to_id = 1000
   AND    transaction_timestamp >= 1691193600
   AND    transaction_timestamp <= 1711929600
   ORDER  BY transaction_timestamp DESC
   LIMIT  1
   )
   UNION ALL
   SELECT t.my_row, c.selected || t.sel1
   FROM   cte c
   CROSS  JOIN LATERAL (
      SELECT t AS my_row, (t.transaction_id, t.output_index)::trans_out AS sel1
      FROM   transactions t
      WHERE  t.to_id = 1000
      AND    t.transaction_timestamp >= 1691193600
      AND    t.transaction_timestamp <= (c.my_row).transaction_timestamp  -- parentheses required!
      AND   (t.transaction_id, t.output_index)::trans_out <> ALL (selected)
      ORDER  BY transaction_timestamp DESC
      LIMIT  1
      ) t
   )
SELECT (my_row).*  -- parentheses required!
FROM   cte;
```
This still only needs your existing index on (to_id, transaction_timestamp DESC) (where DESC does not matter).
Login or Signup to reply.

- jjanes
- April 14, 2024 at 6:01 pm
- 0 votes
0
What I’ve done in cases like this is use a cursor to read the sorted data into the client (or app server, whatever is connected to the database) incrementally and populate a hash table with the rows until it obtained the size I wanted, then close the cursor. Of course this only works if duplicates (on the hash key) are rare enough that you obtain the target size without reading too much of the sorted data. You could probably do this all server-side with a set-returning user-defined function if you wanted. There is no fundamental reason this could not be implemented automatically, but I wouldn’t hold my breath for someone to do that. For one thing, it isn’t obvious what SQL syntax would be used to express this desired output clearly.

Another approach I’ve used is to read a defined number of rows in order (say 1000), then remove the duplicates, then put them back in order again and take the first 10. The problem is that there is no way to know that 1000 input rows is enough to lead to 10 output rows, as it is possible for the duplication rate to be very high. So the notion that 1000 is sufficient is a matter of your judgment, not something which can be determined mechanically.

Concretely, this would be:
```
SELECT *
FROM (
    SELECT DISTINCT ON (transaction_id, output_index) *
    FROM (
        select * from transactions 
        WHERE to_id = 1000 and transaction_timestamp between 1691193600 AND 1711929600 
        order by transaction_timestamp desc limit 1000
    )
    ORDER BY transaction_id, output_index DESC, transaction_timestamp desc
) AS latest_transactions
ORDER BY transaction_timestamp DESC
LIMIT 10;
```
This should be efficient with your existing index on (to_id, transaction_timestamp DESC)
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – Efficient way to combine SELECT DISTINCT, ORDER BY, and a small LIMIT

Answers

Recursive query