Let’s say I have a table in Postgres that stores a column of strings like this.
animal |
---|
cat/dog/bird |
dog/lion |
bird/dog |
dog/cat |
cat/bird |
What I want to do, is calculate how "correlated" any two animals are to each other in this column, and store that as its own table so that I can easily look up how often "cat" and "dog" show up together.
For example, "cat" shows up a total of 3 times in all of these strings. Of those instances, "dog" shows up in the same string 2 out of the three times. Therefore, the correlation from cat -> dog would be 66%, and the number of co-occurrence instances (we’ll call this instance_count
) would be 2.
According to the above logic, the resulting table from this example would look like this.
base_animal | correlated_animal | instance_count | correlation |
---|---|---|---|
cat | cat | 3 | 100 |
cat | dog | 2 | 66 |
cat | bird | 2 | 66 |
cat | lion | 0 | 0 |
dog | dog | 4 | 100 |
dog | cat | 2 | 50 |
dog | bird | 2 | 50 |
dog | lion | 1 | 25 |
bird | bird | 3 | 100 |
bird | cat | 2 | 66 |
bird | dog | 2 | 66 |
bird | lion | 0 | 0 |
lion | lion | 1 | 100 |
lion | cat | 0 | 0 |
lion | dog | 1 | 100 |
lion | bird | 0 | 0 |
I’ve come up with a working solution in Python, but I have no idea how to do this easily in Postgres. Anybody have any ideas?
Edit:
Based off Erwin’s answer, here’s the same idea, except this answer doesn’t make a record for animal combinations that never intersect.
with flat as (
select t.id, a
from (select row_number() over () as id, animal from animals) t,
unnest(string_to_array(t.animal, '/')) a
), ct as (select a, count(*) as ct from flat group by 1)
select
f1.a as b_animal,
f2.a as c_animal,
count(*) as instance_count,
round(count(*) * 100.0 / ct.ct, 0) as correlation
from flat f1
join flat f2 using(id)
join ct on f1.a = ct.a
group by f1.a, f2.a, ct.ct
2
Answers
Idea is to split the data into rows (using
unnest(string_to_array())
) and then cross-join same to get all permutations.Refer to fiddle here.
Won’t get much simpler or faster than this:
db<>fiddle here
Produces your desired result, except for …
This assumes distinct animals per row in the source data. Else it’s unclear how to count the same animal in the same row exactly …
Setp-by-step
CTE
flat
attaches an arbitrary row number as uniqueid
. (If you have aPRIMARY KEY
, use that instead and skip the subqueryt
.) Then unnest animals to get one pet per row (&id
).CTE
ct
gets the list of distinct animals & their total count.The outer
SELECT
builds the complete raster of animal pairs (a
/b
) in subqueryx
, plus total count fora
.LEFT JOIN
to the actual pair count in subqueryxc
. Two steps are needed to keep pairs that never met in the result. Finally, compute and round the "correlation" smartly. See:Updated task
If you don’t need pairs that never met, and pairing with self, either, this could be your query:
db<>fiddle here
I added the total occurrence count of the base animal as
base_count
.Most notably, I dropped the additional CTE
ct
, and get thebase_count
from the first CTE with a window function. That’s about the same cost by itself, but we then don’t need another join in the outer query, which should be cheaper overall.You can still use that if you include pairs with self. Check the fiddle.
Oh, and we don’t need
COALESCE
any more.