skip to Main Content

Within my db I have table prediction_fsd with about 5 million entries. The site table contains approx 3 million entries. I need to execute queries that look like

SELECT prediction_fsd.id AS prediction_fsd_id, 
       prediction_fsd.site_id AS prediction_fsd_site_id, 
       prediction_fsd.html_hash AS prediction_fsd_html_hash, 
       prediction_fsd.prediction AS prediction_fsd_prediction, 
       prediction_fsd.algorithm AS prediction_fsd_algorithm, 
       prediction_fsd.model_version AS prediction_fsd_model_version,
       prediction_fsd.timestamp AS prediction_fsd_timestamp, 
       site_1.id AS site_1_id, 
       site_1.url AS site_1_url, 
       site_1.status AS site_1_status 
  FROM prediction_fsd
  LEFT OUTER JOIN site AS site_1
         ON site_1.id = prediction_fsd.site_id 
 WHERE 95806 = prediction_fsd.site_id
   AND prediction_fsd.algorithm = 'xgboost'
 ORDER BY prediction_fsd.timestamp DESC 
 LIMIT 1

at the moment this query takes about ~4 seconds. I’d like to reduce that by introducing an index. Which tables and fields should I include in that index. I’m having troubles properly understanding the EXPLAIN ANALYZE output of Postgres

enter image description here

2

Answers


  1. Chosen as BEST ANSWER
    CREATE INDEX prediction_fsd_site_id_algorithm_timestamp
        ON public.prediction_fsd USING btree
        (site_id, algorithm, "timestamp" DESC)
        TABLESPACE pg_default;
    

    By introducing a combined index as suggested by Frank Heikens I was able to bring down the query execution time to 0.25s


  2. These three SQL lines point to a possible BTREE index to help you.

    WHERE 95806 = prediction_fsd.site_id
       AND prediction_fsd.algorithm = 'xgboost'
     ORDER BY prediction_fsd.timestamp DESC 
    

    You’re filtering the rows of the table by equality on two columns, and ordering by the third column. So try this index.

    CREATE INDEX site_alg_ts ON prediction_fsd 
          (site_id, algorithm, timestamp DESC);
    

    This BTREE index lets PostgreSQL random-access it to the first eligible row, which happens also to be the row you want with your ORDER BY ... LIMIT 1 clause.

    The query plan in your question says that PostgreSQL did an expensive Parallel Sequential Scan on all five megarows of that table. This index will almost certainly change that to a cheap index lookup.

    On the other table, it appears that you already look up rows in it via the primary key id. So you don’t need any other index for that one.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search