skip to Main Content

I have some tables with properties of Name and Result each.
i want to join tables by name similarity but limit the result arrived from each group of similarity into max 2 results.

CREATE TABLE data_a(
Id serial primary key,
Name VARCHAR(70) NOT NULL,
Result INT4 NOT NULL);

CREATE TABLE data_b(
Id serial primary key,
Name VARCHAR(70) NOT NULL,
Result INT4 NOT NULL);

INSERT INTO data_a
    (Name, Result)
VALUES
    ('Todd', 2),
    ('John', 5);

INSERT INTO data_b
    (Name, Result)
VALUES
    ('Johns', 5),
  ('Todi', 3),
    ('Tod', 4),
    ('Todd', 5),
    ('John', 1),
('Jon', 1),
('Johny', 1),
('Johnny', 1),
('Johni', 1);

i would like to run a query that join both tables by name similarity and limit results to up to 2 results

SELECT da.Name as Name_a,db.Name , similarity(da.Name,da.Name) > 0.5
FROM data_a da 
JOIN data_b db 
ON  da.Name % db.Name
GROUP BY da.Name,db.Name
ORDER BY similarity 
LIMIT 2

and recive

|Name_a|Name_b|similarity|
|------|------|----------|
|Todd  | Todd | 1        |
|------|------|----------|
|Todd  | Tod  |0.8       |
|------|------|----------|
|John  | John |1         |
|------|------|----------|
|John  | Johny|0.76      |
|------|------|----------|

currently i get

|Name_a|Name_b|similarity|
|------|------|----------|
|Todd  | Todd | 1        |
|John  | John |1         |

it seems that i’m not using correctly in group by , how can i group by

2

Answers


  1. You can use lateral join to map each name to a name from the first table, then truncate the result to 2 values for each name.

    with data AS (
      select 
        da.name da_name, 
        db.name db_name, 
        similarity(da.name::text, db.name::text) similarity 
      from 
        data_a da 
        left join lateral (
          select 
            data_b.Name 
          from 
            data_b
        ) db ON da.Name like '%' || db.Name || '%' 
        or db.Name like '%' || da.Name || '%'
    ) 
    select 
      * 
    from 
      (
        select 
          row_number() over (
            partition by da_name 
            order by 
              similarity
          ) r, 
          t.* 
        FROM 
          data t
      ) x 
    WHERE 
      x.r <= 2;
    

    Demo in sql<>daddy.io

    Login or Signup to reply.
  2. If you want to apply the LIMIT separately per da.Name, you would have to do it in a LATERAL subquery:

    SELECT da.Name as Name_a,db.Name , similarity(da.Name,db.Name)
    FROM data_a da CROSS JOIN LATERAL 
    (
        SELECT db.Name
        FROM data_b db
        WHERE  da.Name % db.Name
        ORDER BY similarity(da.Name,db.Name) DESC 
        LIMIT 2
    ) db;
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search