Postgresql - Inexact GROUP BY aggregation

AthanClark
August 4, 2023
100 views
1 vote
2 Answers

I’m trying to figure out a way to reason about a fairly simple problem:

A table foo has two columns, id with unique identifiers, and value with a random numeric value. Given a constant threshold value, find the groups of values where their value is within that threshold – the count of that group, and the group’s average value.

Let’s set the stage – here is the foo table (in PostgreSQL):

CREATE TABLE foo (
    id serial PRIMARY KEY,
    value numeric NOT NULL
);

The query should return counts and averages – entries cannot be double-counted, so I’d like the rows to be "cut into groups" based on some threshold "near-ness" value – you can think of the threshold value as a radius:

SELECT AVG(foo.value), COUNT(foo.id)
  FROM foo
  GROUP BY foo.value; -- where I'm stuck

The query above will only apply the aggregate functions to rows in which foo.value has exact duplicates – what I’d like is something like the following pseudo-SQL:

SELECT AVG(foo.value), COUNT(foo.id)
  FROM foo
  GROUP BY
   (row_being_grouped.value <= foo.value + threshold)
   AND
   (row_being_grouped.value >= foo.value - threshold);

I’m not sure if this makes any sense. I am wondering if I can get away with this without doing a sub-query – maybe "bucket" the rows somehow before finding the average within the buckets?

Answers

- ErwinBrandstetter
- August 4, 2023 at 12:51 am
- 0 votes
0
If "threshold" can be understood as the maximum gap allowed between values of the same group, then it’s a well defined gaps-and-islands problem.
Here is a solution:
```
SELECT grp, count(*) AS grp_count, round(avg(value), 2) AS grp_avg
FROM  (
   SELECT count(gap) OVER (ORDER BY value) AS grp, *
   FROM  (
      SELECT value
           , value - lag(value) OVER (ORDER BY value) > 150 OR null AS gap
      FROM   foo
      ) sub1
   ) sub2
GROUP  BY grp
ORDER  BY grp;
```
fiddle (with step-by-step demo)

Explanation and links to more:
- How to group timestamps into islands (based on arbitrary gap)?
For convenience and short code I use Boolean logic (true OR null → true, false OR null → null), and the fact that count() ignores null values. See:
- For absolute performance, is SUM faster or COUNT?
If this is more of a raster / cluster / granulation / grid problem, you need to define exactly which rows to pick as "focal points" or, alternatively, the exact nature of the independent raster / grid.
Login or Signup to reply.

I suppose that treshold value is 123.45… Then :

WITH 
threshold AS 
(SELECT 123.45 AS threshold_value, MIN(val) AS MI, MAX(val) AS MA, 
 FROM foo),
slices AS
(SELECT threshold_value, generate_series AS boundary  
 FROM   GENERATE_SERIES(MI - threshold_value / 2.0, MA + threshold_value / 2.0) AS boundaries
        CROSS JOIN threshold),
places AS 
(SELECT id, val, boundary
 FROM   foo
        JOIN slices ON val >= boundary AND val < boundary + threshold_value)
SELECT  COUNT(val) AS COUNT_VAL, boundary AS BOUND_LOW, boundary + threshold_value AS BOUND_HIGH
FROM    places
GROUP   BY boundary, boundary + threshold_value;

Not tested…

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – Inexact GROUP BY aggregation

Answers