I would like to use Postgres SQL to assign an age category to a list of househoulds, where we don’t know the date of birth of any of the family members.
Dataset looks like:
household_id | household_size |
---|---|
x1 | 5 |
x2 | 1 |
x3 | 8 |
… | … |
I then have a set of percentages for each age group with that dataset looking like:
age_group | percentage |
---|---|
0-18 | 30 |
19-30 | 40 |
31-100 | 30 |
I want the query to calculate overall what will make the whole dataset as close to the percentages as possible and if possible similar at the household level(not as important). the dataset will end up looking like:
household_id | household_size | 0-18 | 19-30 | 31-100 |
---|---|---|---|---|
x1 | 5 | 2 | 2 | 1 |
x2 | 1 | 0 | 1 | 0 |
x3 | 8 | 3 | 3 | 2 |
… | … | … | …. | … |
I have looked at the ntile function but any pointers as to how I could handle this with postgres would be really helpful.
2
Answers
A minimum start could be:
output:
see: DBFIDDLE
Notes:
g
to whole numbers, taking into account the complete split (g1+g2+g3 = total)I didn’t want to post an answer with just a link so I figured I’ll give it a shot and see if I can simplify depesz
weighted_random
to plain sql. The result is this slower, less readable, worse version of it, but in shorter, plain sql:It’s not inlineable because of aggregate and window function calls. It’s faster if you assume weights will only be probabilities that sum up to
1
.The principle is that you provide any array of choices and an equal length array of weights (those can be percentages but don’t have to, nor do they have to sum up to any specific number):
Online demo showing that both his version and mine are statistically reliable.