skip to Main Content

I have a text array field in postgres we can call items. I want to create a query that will tell me how many times each of these items co-occurs with one another.

For an example set of rows:

items
-----
{'a', 'c'}
{'a', 'b', 'c'}
{'a', 'c'}
{'a', 'b', 'c'}

Here is example output using a : to separate the item name and the count of co-occurrences:

item|co_occurrences
-------------------
a   |{c:4,b:2}
b   |{a:2,c:2}
c   |{a:4,b:2}

The item column lists the individual items. The co_occurences column is an array of text elements combining the co-occurring item and the count. What query will produce this?

2

Answers


  1. Convert it into normal relational form first.

    This code assigns arbitrary row id values:

    with create_ids as (
      select row_number() over (order by items) as id,
             items
        from item_groups
    ), normalize as (
      select i.id, u.item
        from create_ids i
             cross join lateral unnest(i.items) as u(item)
    ), correlate as (
      select a.item, b.item as coitem, count(b.item) as occurence_count
        from normalize a
             left join normalize b
               on b.id = a.id and b.item != a.item
       group by a.item, b.item
    )fiddle
    select item, jsonb_object_agg(coitem, occurence_count) as co_occurences
      from correlate 
     group by item;
    

    Working fiddle

    If the objective does not have to be JSON, then we can construct a string. This allows ordering the values in the string:

    with create_ids as (
      select row_number() over (order by items) as id,
             items
        from item_groups
    ), normalize as (
      select i.id, u.item
        from create_ids i
             cross join lateral unnest(i.items) as u(item)
    ), correlate as (
      select a.item, b.item as coitem, count(b.item) as occurence_count
        from normalize a
             left join normalize b
               on b.id = a.id and b.item != a.item
       group by a.item, b.item
    )
    select item, array_agg(coitem||':'||occurence_count order by occurence_count desc) as co_occurences
      from correlate 
     group by item;
    

    Updated fiddle

    Login or Signup to reply.
  2. Instead of generating row ids for the self-join like @MikeOrganek, I’d just unnest twice to produce a relation with duplicate tuples:

    SELECT x AS item, json_object_agg(y, count) AS co_occurences
    FROM (
      SELECT x, y, count(*)
      FROM example, unnest(items) AS x, unnest(items) AS y
      WHERE x <> y
      GROUP BY x, y
    ) tmp
    GROUP BY x;
    

    (online demo)

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search