skip to Main Content

Below, I’m using the following code to query two databases and retrieve information.

select p.id, p.first, p.last, p.employer, a.city, a.state, a.county 
from "Person" p 
  LEFT JOIN "Address" a ON p.id = a.person 
  LEFT JOIN "Calllist" cl on p.id = cl.person

This is great and works well, with the exception that it will duplicate rows occasionally. This happens in the event where a person object is associated with multiple address objects — each duplicate will represent the identical person name with a different address. A much simpler example is below.

Name State
George Virginia
Alex Virginia
Alex Maryland
Sam West Virginia

Instead of that being the result, where there are two lines of output for Alex, I want the results to aggregate by id and only show one output line per id.

Name State
George Virginia
Alex Virginia
Sam West Virginia

I do not care which city or state shows up in Alex’s result. All I want is for his name to only show up once.

3

Answers


  1. You can limit the results of the joins further to only include a single match which will prevent the duplication on the prior joins.

    If you use tables like this

    CREATE TABLE person (
      id INT PRIMARY KEY,
      name VARCHAR(25) NOT NULL
    );
    
    CREATE TABLE address (
      id INT PRIMARY KEY,
      person INT NOT NULL,
      state VARCHAR(25) NOT NULL
    );
    
    INSERT INTO person VALUES
    (1, 'George'),
    (2, 'Alex'),
    (3, 'Sam');
    
    INSERT INTO address VALUES
    (1, 1, 'Virginia'), /* George */
    (2, 2, 'Virginia'), /* Alex */
    (3, 2, 'Maryland'), /* Alex */
    (4, 3, 'West Virginia'); /* Sam */
    

    Then you can limit the addresses results by doing something like this

    SELECT
      p.name, a.state
    FROM person AS p
    LEFT JOIN address AS a
      /* only include the first address */
      ON a.id = (SELECT MIN(id) FROM address WHERE person = p.id)
    ;
    

    Which will give the results

    | George | Virginia      |
    | Alex   | Virginia      |
    | Sam    | West Virginia |
    
    Login or Signup to reply.
  2. You can use PostgreSQL’s brilliant distinct on (you may see the answer by MatBailie).

    select DISTINCT ON (p.id)
           p.id, p.first, p.last, p.employer, a.city, a.state, a.county
    from "Person" p 
    LEFT JOIN "Address" a ON p.id = a.person 
    LEFT JOIN "Calllist" cl on p.id = cl.person;
    
    Login or Signup to reply.
  3. Don’t multiply rows in the outer SELECT to begin with. That’s expensive, especially when done repeatedly. See:

    De-duplicate early:

    SELECT p.id, p.first, p.last, p.employer, a.*
    FROM   "Person" p
    LEFT   JOIN (
       SELECT DISTINCT ON (1)
              a.person AS id, a.city, a.state, a.county 
       FROM  "Address" a   
       ) a USING (id)
    LEFT   JOIN "Calllist" cl ON p.id = cl.person;  -- seems unused?
    

    That’s assuming entries in "Calllist" are unique per person. Else, that table gets the same treatment.

    About DISTINCT ON:

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search