skip to Main Content

I’m trying to figure out a way to reason about a fairly simple problem:

A table foo has two columns, id with unique identifiers, and value with a random numeric value. Given a constant threshold value, find the groups of values where their value is within that threshold – the count of that group, and the group’s average value.

Let’s set the stage – here is the foo table (in PostgreSQL):

CREATE TABLE foo (
    id serial PRIMARY KEY,
    value numeric NOT NULL
);

The query should return counts and averages – entries cannot be double-counted, so I’d like the rows to be "cut into groups" based on some threshold "near-ness" value – you can think of the threshold value as a radius:

SELECT AVG(foo.value), COUNT(foo.id)
  FROM foo
  GROUP BY foo.value; -- where I'm stuck

The query above will only apply the aggregate functions to rows in which foo.value has exact duplicates – what I’d like is something like the following pseudo-SQL:

SELECT AVG(foo.value), COUNT(foo.id)
  FROM foo
  GROUP BY
   (row_being_grouped.value <= foo.value + threshold)
   AND
   (row_being_grouped.value >= foo.value - threshold);

I’m not sure if this makes any sense. I am wondering if I can get away with this without doing a sub-query – maybe "bucket" the rows somehow before finding the average within the buckets?

2

Answers


  1. If "threshold" can be understood as the maximum gap allowed between values of the same group, then it’s a well defined problem.
    Here is a solution:

    SELECT grp, count(*) AS grp_count, round(avg(value), 2) AS grp_avg
    FROM  (
       SELECT count(gap) OVER (ORDER BY value) AS grp, *
       FROM  (
          SELECT value
               , value - lag(value) OVER (ORDER BY value) > 150 OR null AS gap
          FROM   foo
          ) sub1
       ) sub2
    GROUP  BY grp
    ORDER  BY grp;
    

    fiddle (with step-by-step demo)

    Explanation and links to more:

    For convenience and short code I use Boolean logic (true OR nulltrue, false OR nullnull), and the fact that count() ignores null values. See:

    If this is more of a raster / cluster / granulation / grid problem, you need to define exactly which rows to pick as "focal points" or, alternatively, the exact nature of the independent raster / grid.

    Login or Signup to reply.
  2. I suppose that treshold value is 123.45… Then :

    WITH 
    threshold AS 
    (SELECT 123.45 AS threshold_value, MIN(val) AS MI, MAX(val) AS MA, 
     FROM foo),
    slices AS
    (SELECT threshold_value, generate_series AS boundary  
     FROM   GENERATE_SERIES(MI - threshold_value / 2.0, MA + threshold_value / 2.0) AS boundaries
            CROSS JOIN threshold),
    places AS 
    (SELECT id, val, boundary
     FROM   foo
            JOIN slices ON val >= boundary AND val < boundary + threshold_value)
    SELECT  COUNT(val) AS COUNT_VAL, boundary AS BOUND_LOW, boundary + threshold_value AS BOUND_HIGH
    FROM    places
    GROUP   BY boundary, boundary + threshold_value;
    

    Not tested…

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search