Postgresql - Calculate correlation between two words

AlecMather
August 17, 2022
153 views
0 votes
2 Answers

Let’s say I have a table in Postgres that stores a column of strings like this.

animal
cat/dog/bird
dog/lion
bird/dog
dog/cat
cat/bird

What I want to do, is calculate how "correlated" any two animals are to each other in this column, and store that as its own table so that I can easily look up how often "cat" and "dog" show up together.

For example, "cat" shows up a total of 3 times in all of these strings. Of those instances, "dog" shows up in the same string 2 out of the three times. Therefore, the correlation from cat -> dog would be 66%, and the number of co-occurrence instances (we’ll call this instance_count) would be 2.

According to the above logic, the resulting table from this example would look like this.

base_animal	correlated_animal	instance_count	correlation
cat	cat	3	100
cat	dog	2	66
cat	bird	2	66
cat	lion	0	0
dog	dog	4	100
dog	cat	2	50
dog	bird	2	50
dog	lion	1	25
bird	bird	3	100
bird	cat	2	66
bird	dog	2	66
bird	lion	0	0
lion	lion	1	100
lion	cat	0	0
lion	dog	1	100
lion	bird	0	0

I’ve come up with a working solution in Python, but I have no idea how to do this easily in Postgres. Anybody have any ideas?

Edit:

Based off Erwin’s answer, here’s the same idea, except this answer doesn’t make a record for animal combinations that never intersect.

with flat as (
  select t.id, a
  from (select row_number() over () as id, animal from animals) t,
    unnest(string_to_array(t.animal, '/')) a
), ct as (select a, count(*) as ct from flat group by 1)

select
  f1.a as b_animal,
  f2.a as c_animal,
  count(*) as instance_count,
  round(count(*) * 100.0 / ct.ct, 0) as correlation
from flat f1
join flat f2 using(id)
join ct on f1.a = ct.a
group by f1.a, f2.a, ct.ct

Answers

Idea is to split the data into rows (using unnest(string_to_array())) and then cross-join same to get all permutations.

with data1 as (
select * 
from corr_tab), data2 as (
    select distinct un as base_animal, x.correlated_animal 
    from corr_tab, unnest(string_to_array(animal,'/')) un,
    (select distinct un as correlated_animal 
    from corr_tab, unnest(string_to_array(animal,'/')) un) X)
select base_animal, correlated_animal, 
(case 
when 
    data2.base_animal = data2.correlated_animal
then 
    (select count(*) from data1 where substring(animal,data2.base_animal) is not NULL)
else 
    (select count(*) from data1 where substring(animal,data2.base_animal) is not NULL
    and substring(animal,data2.correlated_animal) is not NULL)
end) instance_count,
(case 
when 
    data2.base_animal = data2.correlated_animal
then 
    100
else 
    ceil(
        (select count(*) from data1 where substring(animal,data2.base_animal) is not NULL
        and substring(animal,data2.correlated_animal) is not NULL) * 100 /
        (select count(*) from data1 where substring(animal,data2.base_animal) is not NULL) )
end) correlation
from data2
order by base_animal

Refer to fiddle here.

- ErwinBrandstetter
- August 18, 2022 at 3:09 am
- 0 votes
0
Won’t get much simpler or faster than this:
```
WITH flat AS (
   SELECT t.id, a
   FROM  (SELECT row_number() OVER () AS id, animal FROM tbl) t
        , unnest(string_to_array(t.animal, '/')) a
   )
, ct AS (SELECT a, count(*) AS ct FROM flat GROUP BY 1)
SELECT a AS base_animal
     , b AS corr_animal
     , COALESCE(xc.ct, 0) AS instance_count
     , COALESCE(round(xc.ct * 100.0 / x.ct), 0) AS correlation
FROM  (
   SELECT a.a, b.a AS b, a.ct
   FROM   ct a, ct b
   ) x
LEFT   JOIN (
   SELECT f1.a, f2.a AS b, count(*) AS ct
   FROM   flat f1
   JOIN   flat f2 USING (id)
   GROUP  BY 1,2
   ) xc USING (a,b)
ORDER  BY a, instance_count DESC;
```
db<>fiddle here

Produces your desired result, except for …
1. added consistent sort order
2. rounded correctly
This assumes distinct animals per row in the source data. Else it’s unclear how to count the same animal in the same row exactly …

Setp-by-step

CTE flat attaches an arbitrary row number as unique id. (If you have a PRIMARY KEY, use that instead and skip the subquery t.) Then unnest animals to get one pet per row (& id).

CTE ct gets the list of distinct animals & their total count.

The outer SELECT builds the complete raster of animal pairs (a / b) in subquery x, plus total count for a. LEFT JOIN to the actual pair count in subquery xc. Two steps are needed to keep pairs that never met in the result. Finally, compute and round the "correlation" smartly. See:
- Look for percentage of characters in a word/phrase within a block of text
Updated task

If you don’t need pairs that never met, and pairing with self, either, this could be your query:
```
-- updated task excluding pairs that never met and same pairing with self
WITH flat AS (
   SELECT t.id, a, count(*) OVER (PARTITION BY a) AS ct
   FROM  (SELECT row_number() OVER () AS id, animal FROM tbl) t
        , unnest(string_to_array(t.animal, '/')) a
   )
SELECT f1.a AS base_animal
     , f1.ct AS base_count
     , f2.a AS corr_animal
     , count(*) AS instance_count
     , round(count(*) * 100.0 / f1.ct) AS correlation
FROM   flat f1
JOIN   flat f2 USING (id)
JOIN  (SELECT a, count(*) AS ct FROM flat GROUP BY 1) ct ON ct.a = f1.a
WHERE  f1.a <> f2.a  -- exclude pairing with self
GROUP  BY f1.a, f1.ct, f2.a
ORDER  BY f1.a, instance_count DESC;
```
db<>fiddle here

I added the total occurrence count of the base animal as base_count.

Most notably, I dropped the additional CTE ct, and get the base_count from the first CTE with a window function. That’s about the same cost by itself, but we then don’t need another join in the outer query, which should be cheaper overall.
You can still use that if you include pairs with self. Check the fiddle.

Oh, and we don’t need COALESCE any more.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – Calculate correlation between two words

Answers

Setp-by-step

Updated task