skip to Main Content

We have a table where we want to divide the data in train, test and split. It should split like per type top 20% data in validation, next 20% data in test and remaining 60% in train.
For example, we have 10 rows of type 6069. We have to split like 20% rows i.e., first 2 rows of type 6069 in validation, 2 rows in test and 6 rows in train.
When there are odd count of types then we can round up and split accordingly.

Is there a way we can do that in SQL?

text type
6131484974365 6069
6219352421449 6069
6088998688036 6069
6090518587777 6069
6093295054628 6069
6131484976666 6069
6219352421449 6069
608899867190 6069
6090518587635 6069
6093295054628 6069
6055705684714 5012
6064031434087 5012
6066932173638 5012
6058029881152 5012
6131484974333 6070
6219352421398 6070
6088998680253 6070
6090518587333 6022
6093295052838 6022
6055705682323 6022
6064031431111 6022
6066932173600 6022
6058029881444 6022

Expected output:

text type data
6131484974365 6069 test
6219352421449 6069 test
6088998688036 6069 validation
6090518587777 6069 validation
6093295054628 6069 train
6131484976666 6069 train
6219352421449 6069 train
608899867190 6069 train
6090518587635 6069 train
6093295054628 6069 train
6055705684714 5012 test
6064031434087 5012 validation
6066932173638 5012 train
6058029881152 5012 train
6131484974333 6070 test
6219352421398 6070 validation
6088998680253 6070 train
6090518587333 6022 test
6093295052838 6022 validation
6055705682323 6022 train
6064031431111 6022 train
6066932173600 6022 train
6058029881444 6022 train

2

Answers


  1. You can select all the 6069 using where clause with keyword TOP and PERCENT in the select clause

    SELECT TOP (10) PERCENT
    text
    FROM TABLE
    WHERE type = 6069
    

    And then select the TOP (20) PERCENT and remove the TOP (10) PERCENT by using DISTINCT (assuming they are different) which leave out the middle 10%.

    Or you can do ROW_NUMBER over PARTITION BY type. And then select ROW_NUMBER from certain range in the where clause.

    Login or Signup to reply.
  2. You could try to use this query

    type_count AS
         (SELECT 
                type, COUNT(*) AS total
          FROM data
          GROUP BY type)
    SELECT d.text, d.type,
           CASE
             WHEN ROW_NUMBER() OVER (PARTITION BY d.type) <= CEIL(0.2*c.total) THEN 'test' 
             WHEN ROW_NUMBER() OVER (PARTITION BY d.type) <= CEIL(0.4*c.total) THEN 'validation'
             ELSE 'train'
           END AS data_type 
    FROM  data d INNER JOIN type_count c ON d.type = c.type
    

    See demo here

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search