skip to Main Content

I have input data

# | column1 | column2
--+---------+---------
1 |  test1  |  test2
2 |  test3  |  test4
3 |  test5  |  test4
4 |  test3  |  test2
5 |  test5  |  test2
6 |  test5  |  test6
7 |  test3  |  test6

and I need SQL query (Postgres) that finds all matches and uses each value only once on both sides.

For above example correct answer is:

# | column1 | column2
--+---------+---------
1 |  test1  |  test2
2 |  test3  |  test4
6 |  test5  |  test6

or

# | column1 | column2
--+---------+---------
1 |  test1  |  test2
3 |  test5  |  test4
7 |  test3  |  test6

With window functions, initial idea is sorting records by number of potential candidates to allow rows with fewer candidates to find a match.

select count(*) over (partition by column2), * 
from my_table
order by count(*) over (partition by column2);

I don’t know how to pass information that value from column2 is already used to next frames. Any ideas?

2

Answers


  1. You can use DENSE_RANK on both columns and fetch those rows where the two values equal.

    The function counts up the occurence of distinct values in a column. So it will increase whenever a value appears for the first time, otherwise the previous value will be set. Only in case both "rankings" equal, in both columns a new value appears. Then those rows should be selected.

    This query…

    SELECT 
      id,
      column1,
      column2,
      DENSE_RANK() OVER(ORDER BY column1) AS rn1,
      DENSE_RANK() OVER(ORDER BY column2) AS rn2
    FROM my_table;
    

    …produces following result for your sample data:

    id column1 column2 rn1 rn2
    1 test1 test2 1 1
    2 test3 test4 2 2
    4 test3 test2 2 1
    7 test3 test6 2 3
    3 test5 test4 3 2
    5 test5 test2 3 1
    6 test5 test6 3 3

    So following entire query…

    WITH sub AS
    (SELECT 
      id,
      column1,
      column2,
      DENSE_RANK() OVER(ORDER BY column1) AS rn1,
      DENSE_RANK() OVER(ORDER BY column2) AS rn2
    FROM my_table)
    SELECT 
      id, 
      column1, 
      column2 
    FROM sub 
    WHERE rn1 = rn2
    ORDER BY 
      column1, 
      column2;
    

    … will produce this result:

    id column1 column2
    1 test1 test2
    2 test3 test4
    6 test5 test6

    See this db<>fiddle with your sample data.

    Login or Signup to reply.
  2. You can find distinct of each column i.e column1 and column2 and then assign a rank to them.

    These ranks are then joined and id is fetched from the original input checking the rows that matched.

    Fiddle

    WITH distinct_column1 AS (
        SELECT column1, row_number() OVER (ORDER BY column1) AS rn
        FROM (SELECT DISTINCT column1 FROM sample_data) AS distinct_values1
    ),
    distinct_column2 AS (
        SELECT column2, row_number() OVER (ORDER BY column2) AS rn
        FROM (SELECT DISTINCT column2 FROM sample_data) AS distinct_values2
    )
    SELECT t.id AS id, d1.column1, d2.column2
    FROM distinct_column1 d1
    JOIN distinct_column2 d2
      ON d1.rn = d2.rn
    JOIN sample_data t
      ON t.column1 = d1.column1 AND t.column2 = d2.column2
    ORDER BY id; 
    

    Output

    id column1 column2
    1 test1 test2
    2 test3 test4
    6 test5 test6
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search