I have a table containing data that has a column named id
that looks like below:
id | value 1 | value 2 | value 3 |
---|---|---|---|
1 | 244 | 550 | 1000 |
1 | 251 | 551 | 700 |
1 | 540 | 60 | 1200 |
… | … | … | … |
2 | 19 | 744 | 2000 |
2 | 10 | 903 | 100 |
2 | 44 | 231 | 600 |
2 | 120 | 910 | 1100 |
… | … | … | … |
I want to take 50 sample rows per id
that exists but if less than 50 exist for the group to simply take the entire set of data points.
For example I would like a maximum 50 data points randomly selected from id = 1
, id = 2
etc…
I cannot find any previous questions similar to this but have tried taking a stab at at least logically working through the solution where I could iterate and union all queries by id
and limit to 50:
SELECT * FROM (SELECT * FROM schema.table AS tbl WHERE tbl.id = X LIMIT 50) UNION ALL;
But it’s obvious that you cannot use this type of solution because UNION ALL
requires aggregating outputs from one id to the next and I do not have a list of id
values to use in place of X in tbl.id = X
.
Is there a way to accomplish this by gathering that list of unique id
values and union all results or is there a more optimal way this could be done?
3
Answers
In case you are looking to get 50 (or less) from each group of IDs then you can use windowing –
From question – "I want to take 50 sample rows per id that exists but if less than 50 exist for the group to simply take the entire set of data points."
Query –
Fiddle.
If you want to select a random sample for each
id
, then you need to randomize the rows somehow. Here is a way to do it:Example (limiting to 3, and some row number for each
id
so you can see the selection randomness):Output:
Your description of trying to get the UNION ALL without specifying all the branches ahead of time is aiming for a LATERAL join. And that is one way to solve the problem. But unless you have a table of all distinct ids, you would have to compute one on the fly. For example (using the same fiddle as Pankaj used):
This could be either slower or faster than the Window Function method, depending on your system and your data and your indexes. In my hands, it was quite a bit faster even with the need to dynamically compute the list of distinct ids.