skip to Main Content

I am dealing with a PostgreSQL (v14) query of this kind

SELECT
    ...,
    EXISTS (
        SELECT 1
        FROM table2 t2
        WHERE t2.fk = t1.id
          AND LOWER(t2.table_name) = 't1'
    ) AS t2_record_exists
FROM table1 t1;

and was hoping to extract the logic in a function to be used in lateral join, in order to compute that field in a more readable way (as it is to be used in a view).

This is the resulting query

SELECT
    ...,
    t2_record_exists.t2_record_exists
FROM table1 t1
    LEFT JOIN LATERAL does_t2_record_exist(t1.id, 't1') t2_record_exists(t2_record_exists) ON TRUE;

That makes use of this function

CREATE OR REPLACE FUNCTION does_t2_record_exist(object_id int8, _table_name text)
    RETURNS bool
    LANGUAGE plpgsql
AS $function$
BEGIN
    RETURN EXISTS (
        SELECT 1
        FROM table2 t2
        WHERE t2.fk = object_id
            AND LOWER(t2.table_name) = _table_name
    );
END $function$;

The second query suffers a severe performance loss, as it executes in about 6000ms, while the first gets it done in 300ms.

I don’t know why this would happen, as I naively assumed that the very same operation (the EXISTS subquery) would be executed the same amount of times (once per row).

What is going wrong here? How can one foresee such performance issues beforehand?

EDIT: Here are the query plans (obtained with EXPLAIN ANALYZE) for my specific case

Query 1 (with subquery):

Index Only Scan using cos_table1 on table1 t1 (cost=0.43..3723311.61 rows=1481535 width=9) (actual time=56.299..247.674 rows=1477585 loops=1)
  Heap Fetches: 46760
  SubPlan 2
    ->  Seq Scan on table2 t2  (cost=0.00..15.27 rows=2 width=8) (actual time=11.300..11.454 rows=372 loops=1)
          Filter: (lower(table_name) = 't1'::text)
          Rows Removed by Filter: 113
Planning Time: 0.085 ms
JIT:
  Functions: 13
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 0.769 ms, Inlining 17.544 ms, Optimization 24.557 ms, Emission 13.652 ms, Total 56.524 ms
Execution Time: 276.285 ms

Query 2 (with LATERAL JOIN):

Nested Loop Left Join  (cost=0.68..56512.74 rows=1481535 width=9) (actual time=0.039..5978.865 rows=1477585 loops=1)
  ->  Index Only Scan using cos_table1 on table1 t1 (cost=0.43..26881.79 rows=1481535 width=8) (actual time=0.011..179.682 rows=1477585 loops=1)
        Heap Fetches: 46760
  ->  Function Scan on does_t2_record_exist t2_record_exist (cost=0.25..0.26 rows=1 width=1) (actual time=0.004..0.004 rows=1 loops=1477585)
Planning Time: 0.065 ms
Execution Time: 6024.267 ms

3

Answers


  1. When the EXISTS operation is directly in the query, then PostgreSQL can transform the subquery into a semantically equivalent JOIN; however, when the EXISTS operation is hidden behind a function call, the query engine no longer has the information it needs to perform such optimizations. Running EXPLAIN (ANALYZE, BUFFERS, VERBOSE) on each query should make the reason for the performance difference evident: wrapping the EXISTS in a function causes the subquery to be executed for each row instead of using a more efficient JOIN.

    Login or Signup to reply.
  2. You did not show actual query plans (output of EXPLAIN (ANALYZE, BUFFERS, SETTINGS)), but there are several performance traps in your function. Rewrite as:

    CREATE OR REPLACE FUNCTION does_t2_record_exist(_object_id int8, _table_name text)
      RETURNS bool
      LANGUAGE sql STABLE PARALLEL SAFE AS  -- !
    $func$
    SELECT EXISTS (
       SELECT FROM table2 t2
       WHERE  t2.fk = _object_id
       AND    lower(t2.table_name) = _table_name
       );
    $func$;
    

    And your query then is simply:

    SELECT ...
         , does_t2_record_exist(t1.id, 't1') AS t2_record_exists
    FROM   table1 t1
    

    No need for the LATERAL join, once you have wrapped the correlated subquery into the function.

    The expression lower(t2.table_name) is not sargable. Be sure to have a (default) B-tree multicolumn, expression index on table2(fk, lower(tablename)). (Or just on one of both index fields if that is selective enough on its own.)

    You don’t need PL/pgSQL for the simple function. Switching to LANGUAGE sql possibly allows function inlining.

    Add the function labels PARALLEL SAFE and STABLE (because they both actually apply!) to allow various performance optimizations.

    Still, if anything, the function can only be slower than putting the EXISTS expression with a correlated subquery into the SELECT list directly. Very little, though, if you do everything right.

    That said, a LATERAL join, or your function running a correlated subquery has to be executed once per qualifying row in the query. If that results in querying the majority of rows that a plain subquery would have to process, a join (not LATERAL) is still the fastest option. (Or your original query, which may result in the same query plan.) Like:

    SELECT ...
         , t2.fk IS NOT NULL AS t2_record_exists
    FROM   table1 t1
    LEFT   JOIN (
       SELECT DISTINCT t2.fk  -- do we even need DISTINCT?
       FROM   table2 t2
       WHERE  lower(t2.table_name) = 't1'
       ) t2 ON t2.fk = t1.id
    

    See:

    Added query plans reveal …

    Your table table2 only has 372 rows, 113 of which are filtered. That’s tiny. You won’t need an index! Postgres gets it done with a sequential scan in one fell swoop for the original query.
    Wrapping the functionality into the black box of a PL/pgSQL function forced Postgres to execute it loops=1477585 times. That’s bad.

    My rewritten version may fix this. Or revert to your original query. (Or my appended alternate.)

    Login or Signup to reply.
  3. This is not about the lateral join as such. The function acts as an optimization barrier. The function is being called once for each row, instead of being optimized away to be processed in bulk using hash tables or hash joins.

    (lateral joins can also act as optimization barriers, but do not in this particular case)

    If the view is generally used with a very restrictive WHERE clause which will be pushed into table1, this could actually turn out to be a benefit as the bulk operation can be inefficient when only a few rows will be needed.

    You could use some kind of macros system to simplify repetitive code. PostgreSQL does provide such, it would be up to your code repository or framework or whatever you use to manage your code.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search