skip to Main Content

I have a table zipnames with four columns: id, zipcode, name_fi and name_sv. Both of the name_* columns are nullable. For each possible zipcode, I might have up to four rows: both names present, only name_fi present, only name_sv present, or both null.

How can I write a SELECT for this table so that I can get only one row back for each zipcode, prioritizing rows with both languages present, then rows with only one language, and only returning nulls for both languages if we really don’t have a row with either name?

If it makes a difference, I’m working with PostgreSQL v14, and using SQLAlchemy in Python. Feel free to answer either in plain SQL or as SQLAlchemy code.

Edit.
Since it was requested, here is a sample of the data in the table:

| id | zipcode | name_fi  | name_sv     |
|----|---------|----------|-------------|
|  1 | 00000   |          |             |
|  2 | 00100   | HELSINKI | HELSINGFORS |
|  3 | 00100   | HELSINKI |             |
|  4 | 00100   |          | HELSINGFORS |
|  5 | 00100   |          |             |
|  6 | 50100   | MIKKELI  |             |
|  7 | 50100   |          |             |
|  8 | 64250   |          | PJELAX      |

And here is what I would like the query to return:

| id | zipcode | name_fi  | name_sv     |
|----|---------|----------|-------------|
|  1 | 00000   |          |             |
|  2 | 00100   | HELSINKI | HELSINGFORS |
|  6 | 50100   | MIKKELI  |             |
|  8 | 64250   |          | PJELAX      |

2

Answers


  1. You can use DISTINCT ON combined with ORDER BY.

    Thus all other rows except the "most important" one per zipcode will be excluded:

    SELECT
      DISTINCT ON (zipcode) 
      id, zipcode,
      name_fi,
      name_sv
    FROM
      zipnames
    ORDER BY
      zipcode,
      name_fi,
      name_sv;
    

    NULL values will be sorted last after other rows per default. You can explicit add this option anyway if you prefer it for a better readability:

    SELECT
      DISTINCT ON (zipcode) 
      id, zipcode,
      name_fi,
      name_sv
    FROM
      zipnames
    ORDER BY
      zipcode,
      name_fi NULLS LAST,
      name_sv NULLS LAST;
    

    Both queries will produce following result for your sample data:

    id zipcode name_fi name_sv
    1 00000 null null
    2 00100 HELSINKI HELSINGFORS
    6 50100 MIKKELI null
    8 64250 null PJELAX

    See this db<>fiddle

    Note: Above queries will always return one single row per zipcode only. If multiple rows with both names != NULL appear having the same zipcode, only the first one according to the ORDER BY clause will be fetched. If this is not intended, please explain what to do in this case.

    Login or Signup to reply.
  2. You can go in two steps
    Step 1. Add Priority column which is exactly as you want: if both are non-nulls then 1 (highest), 2 and 3 for only one non-NULL and 4 – for both NULLs:

    SELECT id, zipcode, name_fi, name_sv,
        CASE 
        WHEN name_fi IS NOT NULL AND name_sv IS NOT NULL THEN 1
        WHEN name_fi IS NOT NULL AND name_sv IS NULL     THEN 2
        WHEN name_fi IS     NULL AND name_sv IS NOT NULL THEN 3
        ELSE 4
        END as [Priority]
    FROM zipnames
    

    Step 2. Use the first SELECT as a sub-select (I do not know the syntax in Postgre) choosing minimal priority for each group. Lets call the result of this query sub-select2:

    SELECT zipcode, MIN ([Priority]) AS [Priority]
    FROM sub-select-table
    GROUP BY zipcode
    
    

    Step 3. Get what you need finally

    SELECT t.id, t.zipcode, t.name_fi, t.name_sv
    FROM sub-select-table t
    JOIN sub-select2 t2 ON t2.zipcode = t.zipcode AND t2.[Priority] = t.[Priority]
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search