Assign an age to a person based on known population average but no Date of birth - Postgresql

user2772056
January 6, 2023
165 views
1 vote
2 Answers

I would like to use Postgres SQL to assign an age category to a list of househoulds, where we don’t know the date of birth of any of the family members.

Dataset looks like:

household_id	household_size
x1	5
x2	1
x3	8
…	…

I then have a set of percentages for each age group with that dataset looking like:

age_group	percentage
0-18	30
19-30	40
31-100	30

I want the query to calculate overall what will make the whole dataset as close to the percentages as possible and if possible similar at the household level(not as important). the dataset will end up looking like:

household_id	household_size	0-18	19-30	31-100
x1	5	2	2	1
x2	1	0	1	0
x3	8	3	3	2
…	…	…	….	…

I have looked at the ntile function but any pointers as to how I could handle this with postgres would be really helpful.

Tags: postgresql sql

Answers

- Luuk
- January 6, 2023 at 5:34 pm
- 0 votes
0
A minimum start could be:
```
SELECT
  household_id,
  MIN(household_size) as size,
  ROUND(SUM(CASE WHEN agegroup_from=0 THEN g ELSE 0 END),1) as g1,
  ROUND(SUM(CASE WHEN agegroup_from=19 THEN g ELSE 0 END),1) as g2,
  ROUND(SUM(CASE WHEN agegroup_from=31 THEN g ELSE 0 END),1) as g3
FROM (
SELECT 
  h.household_id,
  h.household_size,
  p.agegroup_from,
  p.percentage/100.0 * h.household_size as g
FROM households h
CROSS JOIN PercPerAge p) x
GROUP BY household_id
ORDER BY household_id;
```
output:

household_id size g1 g2 g3

x1 5 1.5 2.0 1.5

x2 1 0.3 0.4 0.3

x3 8 2.4 3.2 2.4

see: DBFIDDLE

Notes:
- Of course you should round the columns g to whole numbers, taking into account the complete split (g1+g2+g3 = total)
- Because g1,g2 and g3 are based on percentages, their values can change (as long as the total split is OK…. (see, for more info: Return all possible combinations of values on columns in SQL )
Login or Signup to reply.

household_id	size	g1	g2	g3
x1	5	1.5	2.0	1.5
x2	1	0.3	0.4	0.3
x3	8	2.4	3.2	2.4

I didn’t want to post an answer with just a link so I figured I’ll give it a shot and see if I can simplify depesz weighted_random to plain sql. The result is this slower, less readable, worse version of it, but in shorter, plain sql:

CREATE FUNCTION weighted_random( IN p_choices ANYARRAY, IN p_weights float8[] ) 
  RETURNS ANYELEMENT language sql as $$
select  choice 
from
  ( select  case when (sum(weight) over (rows UNBOUNDED PRECEDING)) >= hit
            then choice end as choice
    from ( select unnest(p_choices) as choice, 
                  unnest(p_weights) as weight ) inputs, 
         ( select sum(weight)*random() as hit
           from unnest(p_weights) a(weight) ) as random_hit
    ) chances
where choice is not null
limit 1 
$$;

It’s not inlineable because of aggregate and window function calls. It’s faster if you assume weights will only be probabilities that sum up to 1.

The principle is that you provide any array of choices and an equal length array of weights (those can be percentages but don’t have to, nor do they have to sum up to any specific number):

update test_area t
set ("0-18",
     "19-30",
     "31-100")
=   (with cte AS (
      select weighted_random('{0-18,19-30,31-100}'::TEXT[], '{30,40,30}') 
                as age_group
      from generate_series(1,household_size,1))
     select count(*) filter (where age_group='0-18')   as "0-18",
            count(*) filter (where age_group='19-30')  as "19-30",
            count(*) filter (where age_group='31-100') as "31-100"
     from cte)
returning *;

Online demo showing that both his version and mine are statistically reliable.

Please signup or login to give your own answer.

Click here to cancel reply.

Assign an age to a person based on known population average but no Date of birth – Postgresql

Answers