skip to Main Content

I am doing a query on a very large data set and i am using WITH (CTE) syntax.. this seems to take a while and i was reading online that temp tables could be faster to use in these cases can someone advise me in which direction to go. In the CTE we join to a lot of tables then we filter on the CTE result..

Only interesting in postgres answers

2

Answers


  1. What version of PostgreSQL are you using? CTEs perform differently in PostgreSQL versions 11 and older than versions 12 and above.

    In PostgreSQL 11 and older, CTEs are optimization fences (outer query restrictions are not passed on to CTEs) and the database evaluates the query inside the CTE and caches the results (i.e., materialized results) and outer WHERE clauses are applied later when the outer query is processed, which means either a full table scan or a full index seek is performed and results in horrible performance for large tables. To avoid this, apply as much filters in the WHERE clause inside the CTE:

    WITH UserRecord AS (SELECT * FROM Users WHERE Id = 100)
    SELECT * FROM UserRecord;
    

    PostgreSQL 12 addresses this problem by introducing query optimizer hints to enable us to control if the CTE should be materialized or not: MATERIALIZED, NOT MATERIALIZED.

    WITH AllUsers AS NOT MATERIALIZED (SELECT * FROM Users)
    SELECT * FROM AllUsers WHERE Id = 100;
    

    Note: Text and code examples are taken from my book Migrating your SQL Server Workloads to PostgreSQL

    Summary:
    PostgreSQL 11 and older: Use Subquery

    PostgreSQL 12 and above: Use CTE with NOT MATERIALIZED clause

    Login or Signup to reply.
  2. My follow up comment is more than I can fit in a comment… so understand this may not be an answer to the OP per se.

    Take the following query, which uses a CTE:

    with sales as (
      select item, sum (qty) as sales_qty, sum (revenue) as sales_revenue
      from sales_data
      where country = 'USA'
      group by item
    ),
    inventory as (
      select item, sum (on_hand_qty) as inventory_qty
      from inventory_data
      where country = 'USA' and on_hand_qty != 0
      group by item
    )
    select
      a.item, a.description, s.sales_qty, s.sales_revenue,
      i.inventory_qty, i.inventory_qty * a.cost as inventory_cost
    from
      all_items a
      left join sales s on
        a.item = s.item
      left join inventory i on
        a.item = i.item
    

    There are times where I cannot explain why that the query runs slower than I would expect. Some times, simply materializing the CTEs makes it run better, as expected. Other times it does not, but when I do this:

    drop table if exists sales;
    drop table if exists inventory;
    
    create temporary table sales as
      select item, sum (qty) as sales_qty, sum (revenue) as sales_revenue
      from sales_data
      where country = 'USA'
      group by item;
    
    create temporary table inventory as
      select item, sum (on_hand_qty) as inventory_qty
      from inventory_data
      where country = 'USA' and on_hand_qty != 0
      group by item;
    
    select
      a.item, a.description, s.sales_qty, s.sales_revenue,
      i.inventory_qty, i.inventory_qty * a.cost as inventory_cost
    from
      all_items a
      left join sales s on
        a.item = s.item
      left join inventory i on
        a.item = i.item;
    

    Suddenly all is right in the world.

    Temp tables may persist across sessions, but to my knowledge the data in them will be session-based. I’m honestly not even sure if the structures persist, which is why to be safe I always drop:

    drop table if exists sales;
    

    And use "if exists" to avoid any errors about the object not existing.

    I rarely use these in common queries for the simple reason that they are not as portable as a simple SQL statement (you can’t give the final query to another user without having the temp tables). My most common use case is when I am processing within a procedure/function:

    create procedure sales_and_inventory()
    language plpgsql
    as
    $BODY$
      BEGIN
        create temp table sales...
        
        insert into sales_inventory
        select ...
        
        drop table sales;
      END;  
    $BODY$
    

    Hopefully this helps.

    Also, to answer your question on indexes… typically I don’t, but nothing says that’s always the right answer. If I put data into a temp table, I assume I’m going to use all or most of it. That said, if you plan to query it multiple times with conditions where an index makes sense, then by all means do it.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search