I’m trying to figure out a way to reason about a fairly simple problem:
A table
foo
has two columns,id
with unique identifiers, andvalue
with a randomnumeric
value. Given a constantthreshold
value, find the groups of values where theirvalue
is within that threshold – the count of that group, and the group’s average value.
Let’s set the stage – here is the foo
table (in PostgreSQL):
CREATE TABLE foo (
id serial PRIMARY KEY,
value numeric NOT NULL
);
The query should return counts and averages – entries cannot be double-counted, so I’d like the rows to be "cut into groups" based on some threshold "near-ness" value – you can think of the threshold
value as a radius:
SELECT AVG(foo.value), COUNT(foo.id)
FROM foo
GROUP BY foo.value; -- where I'm stuck
The query above will only apply the aggregate functions to rows in which foo.value
has exact duplicates – what I’d like is something like the following pseudo-SQL:
SELECT AVG(foo.value), COUNT(foo.id)
FROM foo
GROUP BY
(row_being_grouped.value <= foo.value + threshold)
AND
(row_being_grouped.value >= foo.value - threshold);
I’m not sure if this makes any sense. I am wondering if I can get away with this without doing a sub-query – maybe "bucket" the rows somehow before finding the average within the buckets?
2
Answers
If "threshold" can be understood as the maximum gap allowed between values of the same group, then it’s a well defined gaps-and-islands problem.
Here is a solution:
fiddle (with step-by-step demo)
Explanation and links to more:
For convenience and short code I use Boolean logic (
true OR null
→true
,false OR null
→null
), and the fact thatcount()
ignoresnull
values. See:If this is more of a raster / cluster / granulation / grid problem, you need to define exactly which rows to pick as "focal points" or, alternatively, the exact nature of the independent raster / grid.
I suppose that treshold value is 123.45… Then :
Not tested…