skip to Main Content
ID Amount Brand
1 10 NULL
1 20 NULL
2 30 Mazada
2 NULL BMW
3 40 NULL
3 40 KIA
4 NULL Honda
4 NULL Honda

Above is the original table, and my goal is to find out any difference within the same ID for each column.

In the same ID group, if there is any difference in the column, creating a new column and then specify which column has difference.

The outcome should be like this

ID Difference
1 Amount
2 Amount
2 Brand
3 Brand

My code for PostgreSQL

SELECT
    ID,
    'Amount' AS Difference
FROM
    table
GROUP BY
    ID
HAVING
    COUNT(DISTINCT amount) > 1
    OR (COUNT(amount) != COUNT(*) AND COUNT(DISTINCT amount) > 0)

UNION ALL 

SELECT
    ID,
    'Brand' AS Difference
FROM
    table
GROUP BY
    ID
HAVING
    COUNT(DISTINCT brand) > 1
    OR (COUNT(brand) != COUNT(*) AND COUNT(DISTINCT brand) > 0)

Be careful that there is NULL value in the record, so that the aggregation function would not count the NULL value. The NULL vs NULL view it as same, but NULL vs any value view it as difference.

Is there any better or faster query for this solution? Like lateral join or window function? Thanks

2

Answers


  1. Counting distinct values — including nulls — is easy. Add 1 to distinct count if you have at least one null. That being said, your query could as follows, making it easier to expand for multiple columns without doing union:

    with cte as (
        select id
             , count(distinct amount) + max(case when amount is null then 1 else 0 end) > 1 as a_is_different
             , count(distinct brand)  + max(case when brand  is null then 1 else 0 end) > 1 as b_is_different
        from t
        group by id
    )
    select id, column_name
    from cte, lateral (values
        ('amount', a_is_different),
        ('brand', b_is_different)
    ) as v(column_name, is_different)
    where is_different
    

    DB<>Fiddle

    Login or Signup to reply.
  2. A single pass over the table should be quite a bit faster:

    SELECT id, col
    FROM  (
       SELECT id
            , min(amount) <> max(amount) AS a_diff, bool_or(true) FILTER (WHERE amount IS NULL) AS a_null
            , min(brand)  <> max(brand)  AS b_diff, bool_or(true) FILTER (WHERE brand  IS NULL) AS b_null
       FROM   tbl
       GROUP  BY id
       ) t
    JOIN   LATERAL (
       VALUES
         ('Amount', a_diff, a_null)
       , ('Brand' , b_diff, b_null)
       ) x(col, diff, has_null) ON (diff OR NOT diff AND has_null)
    ORDER  BY id, col
    

    fiddle

    Plus, count(DISTINCT ...) is notoriously expensive. And we don’t actually need that count. We just need to know if the min value differs from the max (or does not, but there is also at least one NULL).

    Very similar case with more explanation:

    Looks like you have seen that one, but not yet applied to your case.
    Like I mentioned over there: there are faster ways for many rows per group …

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search