skip to Main Content

Lets say there is a table with three columns and a million lines. How can I write a SQL statement where I get 20.000 lines, but with the restriction that a column (the second column) never has the same values?

Lets say this is the table:

Name – bpchar(17) Bday – date Wedding – date
Paul 1990-01-01 2010-01-01
Susi 1995-06-01 2020-01-01
Luis 1995-06-01 2022-06-15

Now lets assume there are a million rows in this table.

What I want to have now is:

  • up to 20.000 examples out of the million entries
  • but make sure that Bday is always unique in the result
  • values in the first and third column can appear multiple times in the result

3

Answers


  1. To get 20,000 rows from a table where each row has a unique value in a specific column (let’s call it column2).
    First, we give each row a unique number based on the values in column2. This ensures that each value in column2 gets a number, starting from 1.
    From these numbered rows, we keep only the first row for each unique value in column2. Finally, we select 20,000 rows from the results, making sure each row has a unique value in column2.

    SQL:

    WITH RankedRows AS (SELECT *, ROW_NUMBER() OVER (PARTITION BY column2 
    ORDER BY (SELECT NULL)) AS rn 
    FROM your_table), UniqueRows AS (SELECT * FROM RankedRows WHERE rn = 1) 
    SELECT * FROM UniqueRows LIMIT 20000;
    
    Login or Signup to reply.
  2. You could use Row_Number() Over() analytic function to enumerate rows per Bday(s) then select just the rows having Bday_rn = 1 and LIMIT number of rows returned:

    SELECT   Name, Bday, Wedding
    FROM   ( Select Name, Bday, Wedding, 
                    Row_Number() Over(Partition By Bday Order By Wedding) as Bday_rn 
             From   tbl
           )
    WHERE    Bday_rn = 1
    LIMIT    200000
    

    … OR using another Row_Number() Over() analytic function to control number of rows returned instead of LIMIT.

    SELECT   Name, Bday, Wedding
    FROM   ( SELECT   Row_Number() Over(Order By Bday, Wedding) as rn,
                      Name, Bday, Wedding
             FROM   ( Select Name, Bday, Wedding, 
                             Row_Number() Over(Partition By Bday Order By Wedding) as Bday_rn 
                      From   tbl
                    )
             WHERE    Bday_rn = 1
           )
    WHERE  rn <= 20000
    

    See the fiddle here.

    Login or Signup to reply.
  3. Postgres provides a distinct on() (see documentation) operation for specifically this type request. Something like: (see demo)

    select distinct on(bdate)
           *
      from <table>
     order by bdate;
    

    NOTE: You should avoid the type bpchar. This type specification stands for blank padded character. Using this type a query containing the predicate

    where  'Paul' = name
    

    would not find a match in he above data. Typically just use text or if you need to limit size varchar(17).

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search