We have a table where we want to divide the data in train, test and split. It should split like per type top 20% data in validation, next 20% data in test and remaining 60% in train.
For example, we have 10 rows of type 6069. We have to split like 20% rows i.e., first 2 rows of type 6069 in validation, 2 rows in test and 6 rows in train.
When there are odd count of types then we can round up and split accordingly.
Is there a way we can do that in SQL?
text | type |
---|---|
6131484974365 | 6069 |
6219352421449 | 6069 |
6088998688036 | 6069 |
6090518587777 | 6069 |
6093295054628 | 6069 |
6131484976666 | 6069 |
6219352421449 | 6069 |
608899867190 | 6069 |
6090518587635 | 6069 |
6093295054628 | 6069 |
6055705684714 | 5012 |
6064031434087 | 5012 |
6066932173638 | 5012 |
6058029881152 | 5012 |
6131484974333 | 6070 |
6219352421398 | 6070 |
6088998680253 | 6070 |
6090518587333 | 6022 |
6093295052838 | 6022 |
6055705682323 | 6022 |
6064031431111 | 6022 |
6066932173600 | 6022 |
6058029881444 | 6022 |
Expected output:
text | type | data |
---|---|---|
6131484974365 | 6069 | test |
6219352421449 | 6069 | test |
6088998688036 | 6069 | validation |
6090518587777 | 6069 | validation |
6093295054628 | 6069 | train |
6131484976666 | 6069 | train |
6219352421449 | 6069 | train |
608899867190 | 6069 | train |
6090518587635 | 6069 | train |
6093295054628 | 6069 | train |
6055705684714 | 5012 | test |
6064031434087 | 5012 | validation |
6066932173638 | 5012 | train |
6058029881152 | 5012 | train |
6131484974333 | 6070 | test |
6219352421398 | 6070 | validation |
6088998680253 | 6070 | train |
6090518587333 | 6022 | test |
6093295052838 | 6022 | validation |
6055705682323 | 6022 | train |
6064031431111 | 6022 | train |
6066932173600 | 6022 | train |
6058029881444 | 6022 | train |
2
Answers
You can select all the 6069 using where clause with keyword TOP and PERCENT in the select clause
And then select the TOP (20) PERCENT and remove the TOP (10) PERCENT by using DISTINCT (assuming they are different) which leave out the middle 10%.
Or you can do ROW_NUMBER over PARTITION BY type. And then select ROW_NUMBER from certain range in the where clause.
You could try to use this query
See demo here