skip to Main Content

I would like to use Postgres SQL to assign an age category to a list of househoulds, where we don’t know the date of birth of any of the family members.

Dataset looks like:

household_id household_size
x1 5
x2 1
x3 8

I then have a set of percentages for each age group with that dataset looking like:

age_group percentage
0-18 30
19-30 40
31-100 30

I want the query to calculate overall what will make the whole dataset as close to the percentages as possible and if possible similar at the household level(not as important). the dataset will end up looking like:

household_id household_size 0-18 19-30 31-100
x1 5 2 2 1
x2 1 0 1 0
x3 8 3 3 2
….

I have looked at the ntile function but any pointers as to how I could handle this with postgres would be really helpful.

2

Answers


  1. A minimum start could be:

    SELECT
      household_id,
      MIN(household_size) as size,
      ROUND(SUM(CASE WHEN agegroup_from=0 THEN g ELSE 0 END),1) as g1,
      ROUND(SUM(CASE WHEN agegroup_from=19 THEN g ELSE 0 END),1) as g2,
      ROUND(SUM(CASE WHEN agegroup_from=31 THEN g ELSE 0 END),1) as g3
    FROM (
    SELECT 
      h.household_id,
      h.household_size,
      p.agegroup_from,
      p.percentage/100.0 * h.household_size as g
    FROM households h
    CROSS JOIN PercPerAge p) x
    GROUP BY household_id
    ORDER BY household_id;
    

    output:

    household_id size g1 g2 g3
    x1 5 1.5 2.0 1.5
    x2 1 0.3 0.4 0.3
    x3 8 2.4 3.2 2.4

    see: DBFIDDLE

    Notes:

    • Of course you should round the columns g to whole numbers, taking into account the complete split (g1+g2+g3 = total)
    • Because g1,g2 and g3 are based on percentages, their values can change (as long as the total split is OK…. (see, for more info: Return all possible combinations of values on columns in SQL )
    Login or Signup to reply.
  2. I didn’t want to post an answer with just a link so I figured I’ll give it a shot and see if I can simplify depesz weighted_random to plain sql. The result is this slower, less readable, worse version of it, but in shorter, plain sql:

    CREATE FUNCTION weighted_random( IN p_choices ANYARRAY, IN p_weights float8[] ) 
      RETURNS ANYELEMENT language sql as $$
    select  choice 
    from
      ( select  case when (sum(weight) over (rows UNBOUNDED PRECEDING)) >= hit
                then choice end as choice
        from ( select unnest(p_choices) as choice, 
                      unnest(p_weights) as weight ) inputs, 
             ( select sum(weight)*random() as hit
               from unnest(p_weights) a(weight) ) as random_hit
        ) chances
    where choice is not null
    limit 1 
    $$;
    

    It’s not inlineable because of aggregate and window function calls. It’s faster if you assume weights will only be probabilities that sum up to 1.

    The principle is that you provide any array of choices and an equal length array of weights (those can be percentages but don’t have to, nor do they have to sum up to any specific number):

    update test_area t
    set ("0-18",
         "19-30",
         "31-100")
    =   (with cte AS (
          select weighted_random('{0-18,19-30,31-100}'::TEXT[], '{30,40,30}') 
                    as age_group
          from generate_series(1,household_size,1))
         select count(*) filter (where age_group='0-18')   as "0-18",
                count(*) filter (where age_group='19-30')  as "19-30",
                count(*) filter (where age_group='31-100') as "31-100"
         from cte)
    returning *;
    

    Online demo showing that both his version and mine are statistically reliable.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search