Given a directed acyclic graph shaped like this:
"Branch A" "Branch B" branch-a-head-id branch-b-head-id --- | | ^ id-Z id-C | | | | id-Y id-B | | | "Branch A" id-X id-A and | | "Branch B" branch-a-first-id branch-b-first-id | | | | / | / | / v common-ancestor-id --- | ^ id-2 | | "Common Segment" id-1 | | v initial-id ---
And modeled in a PostgreSQL table structure like:
select * from graphtable; id | parent -------------------+-------------- initial-id | (NULL) id-1 | initial-id id-2 | id-1 common-ancestor-id | id-2 branch-a-first-id | common-ancestor-id id-X | branch-a-first-id id-Y | id-X id-Z | id-Y branch-a-head-id | id-Z branch-b-first-id | common-ancestor-id id-A | branch-b-first-id id-B | id-A id-C | id-B branch-b-head-id | id-C
I need a query to find the common ancestor between "Branch A" and "Branch B". Here is a query that works but is inefficient:
WITH RECURSIVE linked_list(id, parent) AS (
SELECT id, parent
FROM graphtable
WHERE id = 'branch-a-head-id' OR id = 'branch-b-head-id'
UNION ALL
SELECT g.id, g.parent
FROM graphtable g
INNER JOIN linked_list ll ON ll.parent = g.id
)
SELECT string_agg(id, ',') AS ids, parent
FROM linked_list
GROUP BY parent HAVING COUNT(DISTINCT id) > 1;
Which for the above structure returns:
ids | parent -------------------------------------+-------------------- branch-a-first-id,branch-b-first-id | common-ancestor-id
This is inefficient because it always runs through all rows until it finds one with a NULL parent. That means that if you have a structure where "Branch A" and "Branch B" are both small (e.g. 100 rows each) but the "Common Segment" is large (millions of rows) it will iterate over millions of rows instead of hundreds.
Here’s a benchmark with 100 nodes in "Branch A" and 100 nodes in "Branch B" and variable number of "Common Segment" nodes:
# Common Segment nodes | Query time (seconds) |
---|---|
16 | 0.011 |
32 | 0.008 |
64 | 0.011 |
128 | 0.014 |
256 | 0.026 |
512 | 0.058 |
1024 | 0.146 |
2048 | 0.44 |
4096 | 0.028 |
8192 | 0.049 |
16384 | 0.104 |
32768 | 0.209 |
65536 | 0.423 |
131072 | 0.859 |
262144 | 1.781 |
524288 | 3.914 |
The question: how to make this query efficient? This appears to be O(C) where C is the number of Common Segment nodes. I hope to be able to achieve O(A+B) where A and B are number of Branch A and Branch B nodes. In other words, complexity should be independent of C.
My examples above both have A and B with the same number of nodes. That is not necessarily the case. I’m not certain at this time what the actual range of segment lengths may be, but my estimates are:
- A and B: 10s to 1000s
- Common: 1000s to hundred-thousands
I’m specifically interested in Postgres. Thanks!
3
Answers
I see four possible solutions:
Below are the four queries.
Original
This is a re-posting of the original query from the question to make it easier to see the changes in the following two queries.
Limiting query depth within the CTE
See line comments for changes from Original.
Preventing cycles
See line comments for changes from Original.
PL/pgSQL
See Laurenz Albe's answer. In my testing I ran:
Comparison of queries with 100 nodes in A and B and variable Common Segment nodes
The table below shows a series of test results
Results:
path
column holding every id in the current row's ancestry, which (a) dramatically increases memory requirements and (b) requires that each row do a search withinpath
(which is itself an O(C) operation)In this test, PL/pgSQL and Limited Depth are both big improvements, with PL/pgSQL being about an order of magnitude faster.
Comparison of queries with 10000 nodes in A and B and variable Common Segment nodes
This is the same test as above but with 10000 nodes in A and B each instead of 100. Original and Cycle Detection weren't tested.
Results: Both still constant with respect to C, but now Limited Depth is at least an order of magnitude faster than PL/pgSQL
Conclusion
Even though PL/pgSQL is quite slow compared to SQL, I still recommend writing a database function. The algorithm is procedural, so that is a natural choice. And that way it is easy to avoid fetching all the ancestors of either argument:
Here is another solution with PL/pgSQL.
Only makes sense for deep roots – many nodes to the common ancestor. The point being to avoid executing a single-row
SELECT
and array concatenation for every single element. Instead, collect an (adjustable) number of nodes and concatenate that once – while the search continues. With a shortcut for the other leg, after one leg ends.Hard to say where the additional overhead starts to pay off. Test with different data sets and different batch sizes (
_batch_size
). I set the default at 100. It’s increased dynamically with every iteration in the loop to cover both short and long paths.Auxiliary function to dig up a given number of steps:
See:
It takes the node to be advanced, and the array of nodes from the other side. The last element is
null
if the root node has been reached.The functions stops at the first match, so (only) the last element can be a match.
Main function:
Call (with optional custom
_batch_size
):fiddle