skip to Main Content

I have a PostgreSQL query that aggregates data from multiple tables and generates a report. Currently it works perfectly fine and it is quite simple. Here is the query I’m using:

SELECT
  a.name AS account_name,
  c.name AS campaign_name,
  ad.name AS ad_name,
  g.cpm AS cpm,
  SUM(g.impressions) AS impressions,
  SUM(g.clicks) AS clicks,
  SUM(g.conversions) AS conversions,
  SUM(g.impressions) * g.cpm AS cost,
  SUM(g.revenue) AS order_revenue
FROM google_analytics g
JOIN ad ad ON ad.id = g.ad_id
JOIN campaign c ON c.id = ad.campaign_id
JOIN account a ON a.id = c.account_id
WHERE g.created_at >= ? AND g.created_at <= ?
GROUP BY 1, 2, 3, 4
ORDER BY 2

And my table schema:

CREATE TABLE public.google_analytics (
  created_at TIMESTAMP WITHOUT TIME ZONE NULL,
  ad_id INTEGER NULL,
  impressions INTEGER NULL,
  clicks INTEGER NULL,
  conversions INTEGER NULL,
  ecpm INTEGER NULL,
  commission SMALLINT NULL,
  revenue INTEGER NULL
);

The current report aggregates data properly with the SUM and GROUP BY clauses, and works as expected. However, the application code extends this report by handling additional calculations and aggregations, such as rollups, averaging certain fields, and building a hierarchical structure.

For example, the application generates a report like this:

account_name campaign_name ad_name cpm impressions clicks conversions cost order_revenue
Account A 1.75 CPM 2200 20 2 1.45 $150
Campaign A Ad 1 1.50 CPM 1000 10 1 0.50 $100
Campaign A Ad 2 2.00 CPM 1200 10 1 2.40 $50

In this structure:

  • Impressions, clicks, and revenue are summed
  • CPM and cost are averaged
  • The data is grouped hierarchically by account, campaign, and ad

The application code handling these calculations is very large and complex, making it difficult to maintain. I am planning to replace this application logic with a SQL-based solution to simplify maintenance and improve performance.

I have experimented with the ROLLUP and CUBE features in PostgreSQL, and I got excited about the capabilities they offer. These features seem to perform much of the grunt work that the application currently does. My initial results were very promising and I got the very same tree-like structure for summing up all the metrics.

However, I am unsure if PostgreSQL’s built-in rollup supports averaging and summing at different levels of the hierarchy. My goal is to use these features to replace the application code and handle the calculations directly within the database.

For instance:

  • I was unable to calculate the average instead of the sum: how can the rollup use average for a column, and sum for another column? It seems to always sum things
  • I was unable to calculate the sum of cost, since cost is already a calculated field

In order to contextualize even more, and sake of understanding the tree, this is the final JSON I am planning to generate:

[
  {
    "name": "Account A",
    "impressions": 236797,
    "cpm": 2.48,
    "children": [
      {
        "name": "Campaign A",
        "impressions": 236797,
        "cpm": 2.48,
        "children": [
          {
            "name": "Ad 1",
            "impressions": 236797,
            "cpm": 2.48,
            "children": []
          }
        ]
      }
    ]
  }
]

2

Answers


  1. I am unsure if PostgreSQL’s built-in rollup supports averaging and summing at different levels of the hierarchy.

    It does: there’s a grouping function that tells you which "level" you’re on, and based on that you can alternate between the AVG and SUM the way you wanted: demo at db<>fiddle

    SELECT
      grouping(a.name, c.name, ad.name) as grouping,
      grouping(a.name, c.name, ad.name)::int::bit(3) as grouping_bitmask,
      a.name AS account_name,
      c.name AS campaign_name,
      ad.name AS ad_name,
      CASE grouping(a.name, c.name, ad.name)<>0
        WHEN true THEN AVG(g.cpm)
        ELSE SUM(g.cpm) 
      END AS cpm,
      AVG(g.cpm) AS avg_cpm,
      SUM(g.cpm) AS sum_cpm,
      SUM(g.impressions) AS impressions,
      SUM(g.clicks) AS clicks,
      SUM(g.conversions) AS conversions,
      CASE grouping(a.name, c.name, ad.name)<>0
        WHEN true THEN AVG(g.impressions*g.cpm)
        ELSE SUM(g.impressions*g.cpm)
      END AS cost,
      AVG(g.impressions*g.cpm) AS avg_cost,
      SUM(g.impressions*g.cpm) AS sum_cost,
      SUM(g.revenue) AS order_revenue
    FROM google_analytics g
    JOIN ad ad ON ad.id = g.ad_id
    JOIN campaign c ON c.id = ad.campaign_id
    JOIN account a ON a.id = c.account_id
    WHERE g.created_at >= (now()-'1 month'::interval) 
      AND g.created_at <= (now()-'2 weeks'::interval) 
    GROUP BY ROLLUP(account_name, campaign_name, ad_name)
    ORDER BY account_name, campaign_name, ad_name;
    

    The avg_cpm and sum_cpm next to cpm (same for cost) shows you how that case alternates between them on different levels based on the grouping:

    grouping grouping_bitmask account_name campaign_name ad_name cpm avg_cpm sum_cpm impressions clicks conversions cost avg_cost sum_cost order_revenue
    0 000 account_1 campaign_1_account_1 ad_1_campaign_1_account_1 1600 1600.0 1600 7 19 1 11200 11200.0 11200 1000
    1 001 account_1 campaign_1_account_1 null 1600.0 1600.0 1600 7 19 1 11200.0 11200.0 11200 1000
    0 000 account_1 campaign_2_account_1 ad_1_campaign_2_account_1 800 800.0 800 15 11 19 12000 12000.0 12000 300
    0 000 account_1 campaign_2_account_1 ad_2_campaign_2_account_1 400 400.0 400 0 5 13 0 0.00000 0 200
    1 001 account_1 campaign_2_account_1 null 600.0 600.0 1200 15 16 32 6000.0 6000.0 12000 500
    3 011 account_1 null null 933.33 933.33 2800 22 35 33 7733.33 7733.33 23200 1500
    0 000 account_2 campaign_2_account_2 ad_1_campaign_2_account_2 2500 2500.0 2500 18 19 7 45000 45000.0 45000 400
    1 001 account_2 campaign_2_account_2 null 2500.0 2500.0 2500 18 19 7 45000.0 45000.0 45000 400
    3 011 account_2 null null 2500.0 2500.0 2500 18 19 7 45000.0 45000.0 45000 400
    7 111 null null null 1325.0 1325.0 5300 40 54 40 17050.0 17050.0 68200 1900

    That being said, it seems that you could just plug in a /count(*) to your SUMs: in individual rows, that’ll be 1, so you’ll keep getting the sum. On higher levels, it’ll jump up to how many rows got aggregated, and SUM divided by COUNT will result in an AVG:

    SELECT
      a.name AS account_name,
      c.name AS campaign_name,
      ad.name AS ad_name,
      SUM(g.cpm)/count(*) AS cpm,
      SUM(g.impressions) AS impressions,
      SUM(g.clicks) AS clicks,
      SUM(g.conversions) AS conversions,
      SUM(g.impressions*g.cpm)/count(*) AS cost,
      SUM(g.revenue) AS order_revenue
    FROM google_analytics g
    JOIN ad ad ON ad.id = g.ad_id
    JOIN campaign c ON c.id = ad.campaign_id
    JOIN account a ON a.id = c.account_id
    WHERE g.created_at >= (now()-'1 month'::interval) 
      AND g.created_at <= (now()-'2 weeks'::interval) 
    GROUP BY ROLLUP(account_name, campaign_name, ad_name)
    ORDER BY account_name, campaign_name, ad_name;
    
    Login or Signup to reply.
  2. Creating a SQL-based solution that involves different aggregation strategies such as summing some fields and averaging others while also producing hierarchical data is challenging.

    PostgreSQL’s ROLLUP can simplify some parts of this, but doesn’t directly support applying different aggregation functions on different columns within a single query.

    We can approach to accomplish the desired output:

    1. Using CTEs to calculate metrics separately:
    2. Combining results to handle hierarchical aggregation:
    WITH base_data AS (
      SELECT
        a.name AS account_name,
        c.name AS campaign_name,
        ad.name AS ad_name,
        g.cpm AS cpm,
        g.impressions,
        g.clicks,
        g.conversions,
        (g.impressions * g.cpm) AS cost,
        g.revenue AS order_revenue
      FROM google_analytics g
      JOIN ad ad ON ad.id = g.ad_id
      JOIN campaign c ON c.id = ad.campaign_id
      JOIN account a ON a.id = c.account_id
      WHERE g.created_at >= '2022-01-01' AND g.created_at <= '2022-12-31'
    ),
    agg_data AS (
      SELECT
        account_name,
        campaign_name,
        ad_name,
        SUM(impressions) AS impressions,
        SUM(clicks) AS clicks,
        SUM(conversions) AS conversions,
        AVG(cpm) AS avg_cpm,
        SUM(cost) AS total_cost,
        SUM(order_revenue) AS order_revenue
      FROM base_data
      GROUP BY ROLLUP(account_name, campaign_name, ad_name)
    )
    
    SELECT
      account_name,
      campaign_name,
      ad_name,
      avg_cpm,
      impressions,
      clicks,
      conversions,
      total_cost,
      order_revenue
    FROM agg_data
    ORDER BY account_name, campaign_name, ad_name;
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search