skip to Main Content

I’m hoping to get some help on how to speed up an SQL query.
To give you some context, here’s an outline of the related db structure:

  1. table instruments – instruments have isin, trading_venue, start_trading_date, end_trading_date;
    one and the same instrument can trade on multiple venues, with varying start and end trading dates;
    has an index on isin and the trading dates;

  2. view instrument_start_end_dates – in this view, for each instrument isin I get the earliest start_trading_date and the latest end_trading date, across all trading venues;

  3. table trades – each trade has a date_time column with a timestamp and the isin of the traded instrument; has an index on the date_time column and the isin column;

What I’m trying to accomplish – get all trades, where the date of the trade is between the start and end trading date for the respective instrument.

Here is my query:

SELECT *
FROM execs_prop e
JOIN isin_start_end_dates ised on ised.isin = e.isin_cusip
WHERE DATE(e.date_time AT TIME ZONE 'UTC') = '2024-03-04'
  AND DATE(e.date_time AT TIME ZONE 'UTC') >= ised.start_trading_date 
  AND DATE(e.date_time AT TIME ZONE 'UTC') <= ised.end_trading_date
ORDER BY e.date_time;

The query is returning the desired results, but it’s slow – usually it takes around 8 minutes.

Here is the EXPLAIN ANALYZE:

"Gather Merge  (cost=1348478.40..1431695.31 rows=713238 width=159) (actual time=199969.600..199999.190 rows=26784 loops=1)"
"  Workers Planned: 2"
"  Workers Launched: 2"
"  ->  Sort  (cost=1347478.38..1348369.92 rows=356619 width=159) (actual time=199727.735..199728.562 rows=8928 loops=3)"
"        Sort Key: e.date_time"
"        Sort Method: quicksort  Memory: 2664kB"
"        Worker 0:  Sort Method: quicksort  Memory: 2949kB"
"        Worker 1:  Sort Method: quicksort  Memory: 2359kB"
"        ->  Hash Join  (cost=1191253.51..1258520.93 rows=356619 width=159) (actual time=189405.742..199719.816 rows=8928 loops=3)"
"              Hash Cond: ((e.isin_cusip)::bpchar = instruments.isin)"
"              Join Filter: ((date(timezone('UTC'::text, e.date_time)) >= (min(instruments.first_trdg_date))) AND (date(timezone('UTC'::text, e.date_time))   Parallel Seq Scan on execs_prop e  (cost=0.00..53271.59 rows=3742 width=138) (actual time=0.094..279.061 rows=14837 loops=3)"
"                    Filter: (date(timezone('UTC'::text, date_time)) = '2024-03-04'::date)"
"                    Rows Removed by Filter: 583789"
"              ->  Hash  (cost=1147913.49..1147913.49 rows=2360642 width=21) (actual time=189392.680..189392.683 rows=9132958 loops=3)"
"                    Buckets: 65536 (originally 65536)  Batches: 256 (originally 64)  Memory Usage: 3585kB"
"                    ->  GroupAggregate  (cost=0.56..1124307.07 rows=2360642 width=21) (actual time=1.361..184799.653 rows=9132958 loops=3)"
"                          Group Key: instruments.isin"
"                          ->  Index Only Scan using idx_instruments_isin_first_last_date on instruments  (cost=0.56..996661.48 rows=13871889 width=21) (actual time=1.156..179268.392 rows=13498015 loops=3)"
"                                Heap Fetches: 26547906"
"Planning Time: 0.409 ms"
"Execution Time: 200001.209 ms"

Any help or guidance would be greatly appreciated.

2

Answers


  1. I did notice one thing right away.

    Whenever you have a cast from DATETIME to DATE for a comparison and there was an index for the DATETIME column, you should explore whether you can write the query to avoid the cast. This is because the values resulting from the cast no longer match the values stored in the index, making the index effectively worthless for the query.

    For example, we see this:

     DATE(e.date_time AT TIME ZONE 'UTC') = '2024-03-04'
    

    But it will almost certainly be better to write the comparison to check a range covering the full day, using a "half open" exclusive upper bound for the next day, like this:

    e.date_time >= '2024-03-04' AND e.date_time < '2024-03-05'
    

    Now the original values in the column are unmodified, so the query can still use the index.

    The exception is I don’t have knowledge on how your values are stored relative to the AT TIME ZONE 'UTC' clause (and this is one reason you should generally store the values in the database in UTC in the first place… to avoid forcing yourself to write potentially index-voiding expressions).

    Login or Signup to reply.
  2. Please try my code :-

    SELECT e.*
    FROM execs_prop e
    JOIN (
        SELECT isin, MIN(start_trading_date) AS min_start_trading_date, MAX(end_trading_date) AS max_end_trading_date
        FROM instrument_start_end_dates
        GROUP BY isin
    ) AS ised ON ised.isin = e.isin_cusip
    WHERE DATE(e.date_time AT TIME ZONE 'UTC') = '2024-03-04'
      AND e.date_time AT TIME ZONE 'UTC' >= ised.min_start_trading_date
      AND e.date_time AT TIME ZONE 'UTC' <= ised.max_end_trading_date
    ORDER BY e.date_time;
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search