skip to Main Content

I’ve been trying to make a query more efficient by bringing together counts from multiple tables. I have a schema similar to this:

create table "impressions" (
    id serial primary key,
    ip inet,
    ad_id integer references ad(id),
    user uuid references user(id),
    created_at timestamp default now()
)

create table "purchases" (
    id serial primary key,
    ip inet,
    ad_id integer references ad(id),
    user uuid references user(id),
    // ... few other columns
    created_at timestamp default now()
)

create table "clicks" (
    id serial primary key,
    ip inet,
    ad_id integer references ad(id),
    user uuid references user(id),
    created_at timestamp default now()
)

And I’ve been trying to get a count(*) of impressions, purchases and clicks grouped by user. And later by ad_id. I’m currently doing 3 separate queries to the database and aggregating in application code, but wondered if there is a more efficient way of doing this.

2

Answers


  1. You can combine all three counts into a single query to make it a single round-trip to the DB server, which saves some overhead, and reduces I/O.

    OTOH, joining rows adds some cost (depending on the number of distinct users). But if you are going to do that anyway, it’s typically much cheaper in Postgres than in the client. Postgres is optimized for the task and can work with original data types.

    SELECT u.id, i.impressions_count, p.purchases_count, c.clicks_count
    FROM   "user" u
    LEFT   JOIN (
       SELECT "user" AS id, count(*) AS impressions_count
       FROM   impressions
       GROUP  BY 1
       ) i USING (id)
    LEFT   JOIN (
       SELECT "user" AS id, count(*) AS purchases_count
       FROM   purchases
       GROUP  BY 1
       ) p USING (id)
    LEFT   JOIN (
       SELECT "user" AS id, count(*) AS clicks_count
       FROM   clicks
       GROUP  BY 1
       ) c USING (id);
    

    By basing the query on the "user" table, users without any entries in the three subsidiary tables are listed, too. If you don’t want those, combine the subqueries with FULL [OUTER] JOIN instead:

    SELECT id, i.impressions_count, p.purchases_count, c.clicks_count
    FROM  (
       SELECT "user" AS id, count(*) AS impressions_count
       FROM   impressions
       GROUP  BY 1
       ) i
    FULL   JOIN (
       SELECT "user" AS id, count(*) AS purchases_count
       FROM   purchases
       GROUP  BY 1
       ) p USING (id)
    FULL   JOIN (
       SELECT "user" AS id, count(*) AS clicks_count
       FROM   clicks
       GROUP  BY 1
       ) c USING (id)
    

    Either query reports null where no rows are found. You may want to convert that to 0 with COALESCE(i.impressions_count, 0) AS impressions_count etc.

    If table rows are wide (and your tables are vacuum’ed enough) you can gain performance from an index ("user") on each subsidiary table. You probably have that anyway.

    Aside: "user" is a fully reserved word in SQL. I’d rather not use it as identifier, even if that’s allowed with double-quotes. Too error-prone.

    Login or Signup to reply.
  2. Maybe I’ve misunderstood, but is this what you want?

    WITH cte AS
    (
      SELECT
        _user, COUNT(_user) AS cnt
      FROM impressions
      GROUP BY _user
      UNION ALL
      SELECT
        _user, COUNT(_user)
      FROM purchases  
      GROUP BY _user
      UNION ALL
      SELECT
        _user, COUNT(_user)
      FROM clicks
      GROUP BY _user
    )
    SELECT _user, SUM(cnt) 
    FROM cte
    GROUP BY _user;
    

    See https://dbfiddle.uk/yGX0Pv0k

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search