Efficiently find common ancestor in postgresql directed acyclic graph

Jamesmontalvo3
September 18, 2023
277 views
2 votes
3 Answers

Given a directed acyclic graph shaped like this:

  "Branch A"               "Branch B"

branch-a-head-id        branch-b-head-id           ---
       |                        |                   ^
     id-Z                     id-C                  |
       |                        |                   |
     id-Y                     id-B                  |
       |                        |               "Branch A"
     id-X                     id-A                 and
       |                        |               "Branch B"
branch-a-first-id       branch-b-first-id           |
       |                        |                   |
                              /                    |
                             /                     |
                            /                      v
           common-ancestor-id                      ---
                   |                                ^
                 id-2                               |
                   |                         "Common Segment"
                 id-1                               |
                   |                                v
               initial-id                          ---

And modeled in a PostgreSQL table structure like:

select * from graphtable;
   id              |     parent
-------------------+--------------
initial-id         | (NULL)
id-1               | initial-id
id-2               | id-1
common-ancestor-id | id-2
branch-a-first-id  | common-ancestor-id
id-X               | branch-a-first-id
id-Y               | id-X
id-Z               | id-Y
branch-a-head-id   | id-Z
branch-b-first-id  | common-ancestor-id
id-A               | branch-b-first-id
id-B               | id-A
id-C               | id-B
branch-b-head-id   | id-C

I need a query to find the common ancestor between "Branch A" and "Branch B". Here is a query that works but is inefficient:

WITH RECURSIVE linked_list(id, parent) AS (
    SELECT id, parent
    FROM graphtable
    WHERE id = 'branch-a-head-id' OR id = 'branch-b-head-id'
  UNION ALL
    SELECT g.id, g.parent
    FROM graphtable g
    INNER JOIN linked_list ll ON ll.parent = g.id
)
SELECT string_agg(id, ',') AS ids, parent
FROM linked_list
GROUP BY parent HAVING COUNT(DISTINCT id) > 1;

Which for the above structure returns:

                 ids                 |       parent       
-------------------------------------+--------------------
 branch-a-first-id,branch-b-first-id | common-ancestor-id

This is inefficient because it always runs through all rows until it finds one with a NULL parent. That means that if you have a structure where "Branch A" and "Branch B" are both small (e.g. 100 rows each) but the "Common Segment" is large (millions of rows) it will iterate over millions of rows instead of hundreds.

Here’s a benchmark with 100 nodes in "Branch A" and 100 nodes in "Branch B" and variable number of "Common Segment" nodes:

# Common Segment nodes	Query time (seconds)
16	0.011
32	0.008
64	0.011
128	0.014
256	0.026
512	0.058
1024	0.146
2048	0.44
4096	0.028
8192	0.049
16384	0.104
32768	0.209
65536	0.423
131072	0.859
262144	1.781
524288	3.914

The question: how to make this query efficient? This appears to be O(C) where C is the number of Common Segment nodes. I hope to be able to achieve O(A+B) where A and B are number of Branch A and Branch B nodes. In other words, complexity should be independent of C.

My examples above both have A and B with the same number of nodes. That is not necessarily the case. I’m not certain at this time what the actual range of segment lengths may be, but my estimates are:

A and B: 10s to 1000s
Common: 1000s to hundred-thousands

I’m specifically interested in Postgres. Thanks!

Answers

Chosen as BEST ANSWER

I see four possible solutions:

The original query from the question
Limiting query depth within the CTE
Preventing cycles within the CTE
Using PL/pgSQL (per suggestion in Laurenz Albe's answer)

Below are the four queries.

Original

This is a re-posting of the original query from the question to make it easier to see the changes in the following two queries.

WITH RECURSIVE linked_list(id, parent) AS (
    SELECT id, parent
    FROM graphtable
    WHERE id = 'branch-a-head-id' OR id = 'branch-b-head-id'
  UNION ALL
    SELECT g.id, g.parent
    FROM graphtable g
    INNER JOIN linked_list ll ON ll.parent = g.id
)
SELECT string_agg(id, ',') AS ids, parent
FROM linked_list
GROUP BY parent HAVING COUNT(DISTINCT id) > 1;

Limiting query depth within the CTE

See line comments for changes from Original.

WITH RECURSIVE linked_list(id, parent, depth) AS (    -- line changed
    SELECT id, parent, 0                              -- line changed
    FROM graphtable
    WHERE id = $1 OR id = $2
  UNION ALL
    SELECT g.id, g.parent, ll.depth + 1
    FROM graphtable g
    INNER JOIN linked_list ll ON ll.parent = g.id
    WHERE depth < 10000                               -- line added
)
SELECT string_agg(id, ',') AS ids, parent
FROM linked_list
GROUP BY parent HAVING COUNT(DISTINCT id) > 1;

Preventing cycles

See line comments for changes from Original.

WITH RECURSIVE linked_list(id, parent) AS (
    SELECT id, parent
    FROM graphtable
    WHERE id = 'branch-a-head-id' OR id = 'branch-b-head-id'
  UNION ALL
    SELECT g.id, g.parent
    FROM graphtable g
    INNER JOIN linked_list ll ON ll.parent = g.id
)
CYCLE id SET is_cycle USING path                       -- line added
SELECT string_agg(id, ',') AS ids, parent
FROM linked_list
GROUP BY parent HAVING COUNT(DISTINCT id) > 1;

PL/pgSQL

See Laurenz Albe's answer. In my testing I ran:

SELECT first_common_ancestor('branch-a-head-id', 'branch-b-head-id') AS parent

Comparison of queries with 100 nodes in A and B and variable Common Segment nodes

The table below shows a series of test results

# Common Nodes	Original query (sec)	Limited depth (sec)	Prevent cycles (sec)	PL/pgSQL (sec)
16	0.023	0.014	0.02	0.012
32	0.023	0.009	0.014	0.007
64	0.022	0.012	0.018	0.008
128	0.024	0.016	0.028	0.009
256	0.027	0.028	0.064	0.008
512	0.030	0.055	0.185	0.008
1024	0.033	0.142	0.596	0.007
2048	0.047	0.016	1.922	0.008
4096	0.063	0.027	8.369	0.008
8192	0.098	0.056	16.499	0.008
16384	0.205	0.061	104.991	0.010
32768	0.361	0.069	Too slow to test	0.008
65536	0.681	0.063	Too slow to test	0.007
131072	1.353	0.064	Too slow to test	0.009
262144	2.934	0.064	Too slow to test	0.009
524288	5.954	0.099	Too slow to test	0.008
1048576	12.83	0.075	Too slow to test	0.010
2097152	30.904	0.079	Too slow to test	0.010

Results:

Original = O(C) as stated in the question
Limited depth = O(1) for this test holding A and B constant, likely O(A + B)
Preventing cycles = Blows up fast, probably O(C^2) or worse. This is likely because cycle detection adds a path column holding every id in the current row's ancestry, which (a) dramatically increases memory requirements and (b) requires that each row do a search within path (which is itself an O(C) operation)
PL/pgSQL = O(1) for this test holding A and B constant, likely O(A + B)

In this test, PL/pgSQL and Limited Depth are both big improvements, with PL/pgSQL being about an order of magnitude faster.

Comparison of queries with 10000 nodes in A and B and variable Common Segment nodes

This is the same test as above but with 10000 nodes in A and B each instead of 100. Original and Cycle Detection weren't tested.

# Common Nodes	Limit Depth (sec)	PL/pgSQL (sec)
16	0.085	1.552
32	0.076	1.612
64	0.085	1.538
128	0.078	1.396
256	0.080	1.399
512	0.080	1.392
1024	0.080	1.394
2048	0.100	1.480
4096	0.084	1.406
8192	0.078	1.423
16384	0.082	1.417
32768	0.093	1.412
65536	0.073	1.416
131072	0.073	1.421
262144	0.075	1.393
524288	0.081	1.401
1048576	0.084	1.473
2097152	0.098	1.465

Results: Both still constant with respect to C, but now Limited Depth is at least an order of magnitude faster than PL/pgSQL

Conclusion

If you know that A and B are in the 100s range, PL/pgSQL is faster
If you know that A and B are in the 10s of thousands range, Limited Depth is faster
If unsure, Limited Depth is probably safer since PL/pgSQL query duration seems to increase faster. This assumes you can set a reasonable depth limit, since if your depth limit is less than the length of A or B then you'll never find the common ancestor.

(Edit)

Even though PL/pgSQL is quite slow compared to SQL, I still recommend writing a database function. The algorithm is procedural, so that is a natural choice. And that way it is easy to avoid fetching all the ancestors of either argument:

CREATE OR REPLACE FUNCTION first_common_ancestor(a text, b text) RETURNS text
   LANGUAGE plpgsql AS
$$DECLARE
   iter integer := 1;
   a_ancestors text[] := ARRAY[a];
   b_ancestors text[] := ARRAY[b];
   a_next text;
   b_next text;
   a_done boolean := FALSE;
   b_done boolean := FALSE;
BEGIN
   WHILE NOT a_done AND NOT b_done LOOP
      /* test if we have found a common ancestor */
      IF ARRAY[a_ancestors[iter]] && b_ancestors THEN
         RETURN a_ancestors[iter];
      END IF;
      IF ARRAY[b_ancestors[iter]] && a_ancestors THEN
         RETURN b_ancestors[iter];
      END IF;

      /* get the next ancestors */
      SELECT parent INTO a_next
      FROM graphtable
      WHERE id = a_ancestors[iter];

      IF FOUND THEN
         a_ancestors := a_ancestors || a_next;
      ELSE
         a_done := TRUE;
      END IF;

      SELECT parent INTO b_next
      FROM graphtable
      WHERE id = b_ancestors[iter];

      IF FOUND THEN
         b_ancestors := b_ancestors || b_next;
      ELSE
         b_done := TRUE;
      END IF;

      iter := iter + 1;
   END LOOP;

   /* no common ancestor */
   RETURN NULL;
END;$$;

Here is another solution with PL/pgSQL.

Only makes sense for deep roots – many nodes to the common ancestor. The point being to avoid executing a single-row SELECT and array concatenation for every single element. Instead, collect an (adjustable) number of nodes and concatenate that once – while the search continues. With a shortcut for the other leg, after one leg ends.

Hard to say where the additional overhead starts to pay off. Test with different data sets and different batch sizes (_batch_size). I set the default at 100. It’s increased dynamically with every iteration in the loop to cover both short and long paths.

Auxiliary function to dig up a given number of steps:

CREATE OR REPLACE FUNCTION f_roots(_a text, _b_roots text[], _batch_size int)
  RETURNS text[]
  LANGUAGE sql STABLE STRICT
BEGIN ATOMIC
WITH RECURSIVE roots AS (
   SELECT g.parent AS node
   FROM   graphtable g
   WHERE  g.id = _a

   UNION ALL
   SELECT (SELECT g.parent FROM graphtable g WHERE g.id = r.node)
   FROM   roots r   
   WHERE  r.node <> ALL(_b_roots)
   )
SELECT ARRAY(TABLE roots LIMIT _batch_size);
END;

See:

What does BEGIN ATOMIC mean in a PostgreSQL SQL function / procedure?

It takes the node to be advanced, and the array of nodes from the other side. The last element is null if the root node has been reached.
The functions stops at the first match, so (only) the last element can be a match.

Main function:

CREATE OR REPLACE FUNCTION f_common_root(_a text, _b text, _batch_size int = 100)
  RETURNS text
  LANGUAGE plpgsql STABLE AS
$func$
DECLARE
   a_roots  text[] := ARRAY[_a];
   b_roots  text[] := ARRAY[_b];
   a_step   text[];
   b_step   text[];
   last_a   text := _a;
   last_b   text := _b;
BEGIN
   -- corner case
   IF _a = _b THEN
      RETURN _a;
   END IF;

   LOOP
      -- get the next _batch_size nodes for a
      IF last_a IS NOT NULL THEN
         a_step := f_roots(last_a, b_roots, _batch_size);
         last_a := a_step[array_upper(a_step, 1)];

         IF last_a IS NULL THEN
            -- trim last null!
            a_roots := a_roots || a_step[1:array_upper(a_step, 1)-1];
         ELSIF last_a = ANY(b_roots) THEN
            RETURN last_a;
         ELSE
            a_roots := a_roots || a_step;
         END IF;
      END IF;

      -- shortcut to finish other side (no more array concatenation needed)
      IF last_a IS NULL THEN
         LOOP
            SELECT INTO last_b  g.parent
            FROM   graphtable g
            WHERE  g.id = last_b;
            
            IF NOT FOUND THEN
               RETURN null;
            ELSIF last_b = ANY(a_roots) THEN
               RETURN last_b;
            END IF;
         END LOOP;
      END IF;


      -- EXACTLY symmetrical for b
      IF last_b IS NOT NULL THEN
         b_step := f_roots(last_b, a_roots, _batch_size);
         last_b := b_step[array_upper(b_step, 1)];

         IF last_b IS NULL THEN
            b_roots := b_roots || b_step[1:array_upper(b_step, 1)-1];
         ELSIF last_b = ANY(a_roots) THEN
            RETURN last_b;
         ELSE
            b_roots := b_roots || b_step;
         END IF;
      END IF;      
      
      IF last_b IS NULL THEN
         LOOP
            SELECT INTO last_a  g.parent
            FROM   graphtable g
            WHERE  g.id = last_a;
            
            IF NOT FOUND THEN
               RETURN null;
            ELSIF last_a = ANY(b_roots) THEN
               RETURN last_a;
            END IF;
         END LOOP;
      END IF;

      -- grow batch_size dynamically
      _batch_size := (_batch_size * 1.5)::int;
   END LOOP;   
END
$func$;

Call (with optional custom _batch_size):

SELECT f_common_root('branch-a-head-id', 'branch-b-head-id');

SELECT f_common_root('branch-a-head-id', 'branch-b-head-id', 500);

fiddle

Please signup or login to give your own answer.

Click here to cancel reply.