skip to Main Content

Given a directed acyclic graph shaped like this:

  "Branch A"               "Branch B"

branch-a-head-id        branch-b-head-id           ---
       |                        |                   ^
     id-Z                     id-C                  |
       |                        |                   |
     id-Y                     id-B                  |
       |                        |               "Branch A"
     id-X                     id-A                 and
       |                        |               "Branch B"
branch-a-first-id       branch-b-first-id           |
       |                        |                   |
                              /                    |
                             /                     |
                            /                      v
           common-ancestor-id                      ---
                   |                                ^
                 id-2                               |
                   |                         "Common Segment"
                 id-1                               |
                   |                                v
               initial-id                          ---

And modeled in a PostgreSQL table structure like:

select * from graphtable;
   id              |     parent
-------------------+--------------
initial-id         | (NULL)
id-1               | initial-id
id-2               | id-1
common-ancestor-id | id-2
branch-a-first-id  | common-ancestor-id
id-X               | branch-a-first-id
id-Y               | id-X
id-Z               | id-Y
branch-a-head-id   | id-Z
branch-b-first-id  | common-ancestor-id
id-A               | branch-b-first-id
id-B               | id-A
id-C               | id-B
branch-b-head-id   | id-C

I need a query to find the common ancestor between "Branch A" and "Branch B". Here is a query that works but is inefficient:

WITH RECURSIVE linked_list(id, parent) AS (
    SELECT id, parent
    FROM graphtable
    WHERE id = 'branch-a-head-id' OR id = 'branch-b-head-id'
  UNION ALL
    SELECT g.id, g.parent
    FROM graphtable g
    INNER JOIN linked_list ll ON ll.parent = g.id
)
SELECT string_agg(id, ',') AS ids, parent
FROM linked_list
GROUP BY parent HAVING COUNT(DISTINCT id) > 1;

Which for the above structure returns:

                 ids                 |       parent       
-------------------------------------+--------------------
 branch-a-first-id,branch-b-first-id | common-ancestor-id

This is inefficient because it always runs through all rows until it finds one with a NULL parent. That means that if you have a structure where "Branch A" and "Branch B" are both small (e.g. 100 rows each) but the "Common Segment" is large (millions of rows) it will iterate over millions of rows instead of hundreds.

Here’s a benchmark with 100 nodes in "Branch A" and 100 nodes in "Branch B" and variable number of "Common Segment" nodes:

# Common Segment nodes Query time (seconds)
16 0.011
32 0.008
64 0.011
128 0.014
256 0.026
512 0.058
1024 0.146
2048 0.44
4096 0.028
8192 0.049
16384 0.104
32768 0.209
65536 0.423
131072 0.859
262144 1.781
524288 3.914

The question: how to make this query efficient? This appears to be O(C) where C is the number of Common Segment nodes. I hope to be able to achieve O(A+B) where A and B are number of Branch A and Branch B nodes. In other words, complexity should be independent of C.

My examples above both have A and B with the same number of nodes. That is not necessarily the case. I’m not certain at this time what the actual range of segment lengths may be, but my estimates are:

  • A and B: 10s to 1000s
  • Common: 1000s to hundred-thousands

I’m specifically interested in Postgres. Thanks!

3

Answers


  1. Chosen as BEST ANSWER

    I see four possible solutions:

    1. The original query from the question
    2. Limiting query depth within the CTE
    3. Preventing cycles within the CTE
    4. Using PL/pgSQL (per suggestion in Laurenz Albe's answer)

    Below are the four queries.

    Original

    This is a re-posting of the original query from the question to make it easier to see the changes in the following two queries.

    WITH RECURSIVE linked_list(id, parent) AS (
        SELECT id, parent
        FROM graphtable
        WHERE id = 'branch-a-head-id' OR id = 'branch-b-head-id'
      UNION ALL
        SELECT g.id, g.parent
        FROM graphtable g
        INNER JOIN linked_list ll ON ll.parent = g.id
    )
    SELECT string_agg(id, ',') AS ids, parent
    FROM linked_list
    GROUP BY parent HAVING COUNT(DISTINCT id) > 1;
    

    Limiting query depth within the CTE

    See line comments for changes from Original.

    WITH RECURSIVE linked_list(id, parent, depth) AS (    -- line changed
        SELECT id, parent, 0                              -- line changed
        FROM graphtable
        WHERE id = $1 OR id = $2
      UNION ALL
        SELECT g.id, g.parent, ll.depth + 1
        FROM graphtable g
        INNER JOIN linked_list ll ON ll.parent = g.id
        WHERE depth < 10000                               -- line added
    )
    SELECT string_agg(id, ',') AS ids, parent
    FROM linked_list
    GROUP BY parent HAVING COUNT(DISTINCT id) > 1;
    

    Preventing cycles

    See line comments for changes from Original.

    WITH RECURSIVE linked_list(id, parent) AS (
        SELECT id, parent
        FROM graphtable
        WHERE id = 'branch-a-head-id' OR id = 'branch-b-head-id'
      UNION ALL
        SELECT g.id, g.parent
        FROM graphtable g
        INNER JOIN linked_list ll ON ll.parent = g.id
    )
    CYCLE id SET is_cycle USING path                       -- line added
    SELECT string_agg(id, ',') AS ids, parent
    FROM linked_list
    GROUP BY parent HAVING COUNT(DISTINCT id) > 1;
    

    PL/pgSQL

    See Laurenz Albe's answer. In my testing I ran:

    SELECT first_common_ancestor('branch-a-head-id', 'branch-b-head-id') AS parent
    

    Comparison of queries with 100 nodes in A and B and variable Common Segment nodes

    The table below shows a series of test results

    # Common Nodes Original query (sec) Limited depth (sec) Prevent cycles (sec) PL/pgSQL (sec)
    16 0.023 0.014 0.02 0.012
    32 0.023 0.009 0.014 0.007
    64 0.022 0.012 0.018 0.008
    128 0.024 0.016 0.028 0.009
    256 0.027 0.028 0.064 0.008
    512 0.030 0.055 0.185 0.008
    1024 0.033 0.142 0.596 0.007
    2048 0.047 0.016 1.922 0.008
    4096 0.063 0.027 8.369 0.008
    8192 0.098 0.056 16.499 0.008
    16384 0.205 0.061 104.991 0.010
    32768 0.361 0.069 Too slow to test 0.008
    65536 0.681 0.063 Too slow to test 0.007
    131072 1.353 0.064 Too slow to test 0.009
    262144 2.934 0.064 Too slow to test 0.009
    524288 5.954 0.099 Too slow to test 0.008
    1048576 12.83 0.075 Too slow to test 0.010
    2097152 30.904 0.079 Too slow to test 0.010

    Results:

    • Original = O(C) as stated in the question
    • Limited depth = O(1) for this test holding A and B constant, likely O(A + B)
    • Preventing cycles = Blows up fast, probably O(C^2) or worse. This is likely because cycle detection adds a path column holding every id in the current row's ancestry, which (a) dramatically increases memory requirements and (b) requires that each row do a search within path (which is itself an O(C) operation)
    • PL/pgSQL = O(1) for this test holding A and B constant, likely O(A + B)

    In this test, PL/pgSQL and Limited Depth are both big improvements, with PL/pgSQL being about an order of magnitude faster.

    Comparison of queries with 10000 nodes in A and B and variable Common Segment nodes

    This is the same test as above but with 10000 nodes in A and B each instead of 100. Original and Cycle Detection weren't tested.

    # Common Nodes Limit Depth (sec) PL/pgSQL (sec)
    16 0.085 1.552
    32 0.076 1.612
    64 0.085 1.538
    128 0.078 1.396
    256 0.080 1.399
    512 0.080 1.392
    1024 0.080 1.394
    2048 0.100 1.480
    4096 0.084 1.406
    8192 0.078 1.423
    16384 0.082 1.417
    32768 0.093 1.412
    65536 0.073 1.416
    131072 0.073 1.421
    262144 0.075 1.393
    524288 0.081 1.401
    1048576 0.084 1.473
    2097152 0.098 1.465

    Results: Both still constant with respect to C, but now Limited Depth is at least an order of magnitude faster than PL/pgSQL

    Conclusion

    • If you know that A and B are in the 100s range, PL/pgSQL is faster
    • If you know that A and B are in the 10s of thousands range, Limited Depth is faster
    • If unsure, Limited Depth is probably safer since PL/pgSQL query duration seems to increase faster. This assumes you can set a reasonable depth limit, since if your depth limit is less than the length of A or B then you'll never find the common ancestor.

  2. Even though PL/pgSQL is quite slow compared to SQL, I still recommend writing a database function. The algorithm is procedural, so that is a natural choice. And that way it is easy to avoid fetching all the ancestors of either argument:

    CREATE OR REPLACE FUNCTION first_common_ancestor(a text, b text) RETURNS text
       LANGUAGE plpgsql AS
    $$DECLARE
       iter integer := 1;
       a_ancestors text[] := ARRAY[a];
       b_ancestors text[] := ARRAY[b];
       a_next text;
       b_next text;
       a_done boolean := FALSE;
       b_done boolean := FALSE;
    BEGIN
       WHILE NOT a_done AND NOT b_done LOOP
          /* test if we have found a common ancestor */
          IF ARRAY[a_ancestors[iter]] && b_ancestors THEN
             RETURN a_ancestors[iter];
          END IF;
          IF ARRAY[b_ancestors[iter]] && a_ancestors THEN
             RETURN b_ancestors[iter];
          END IF;
    
          /* get the next ancestors */
          SELECT parent INTO a_next
          FROM graphtable
          WHERE id = a_ancestors[iter];
    
          IF FOUND THEN
             a_ancestors := a_ancestors || a_next;
          ELSE
             a_done := TRUE;
          END IF;
    
          SELECT parent INTO b_next
          FROM graphtable
          WHERE id = b_ancestors[iter];
    
          IF FOUND THEN
             b_ancestors := b_ancestors || b_next;
          ELSE
             b_done := TRUE;
          END IF;
    
          iter := iter + 1;
       END LOOP;
    
       /* no common ancestor */
       RETURN NULL;
    END;$$;
    
    Login or Signup to reply.
  3. Here is another solution with PL/pgSQL.

    Only makes sense for deep roots – many nodes to the common ancestor. The point being to avoid executing a single-row SELECT and array concatenation for every single element. Instead, collect an (adjustable) number of nodes and concatenate that once – while the search continues. With a shortcut for the other leg, after one leg ends.

    Hard to say where the additional overhead starts to pay off. Test with different data sets and different batch sizes (_batch_size). I set the default at 100. It’s increased dynamically with every iteration in the loop to cover both short and long paths.

    Auxiliary function to dig up a given number of steps:

    CREATE OR REPLACE FUNCTION f_roots(_a text, _b_roots text[], _batch_size int)
      RETURNS text[]
      LANGUAGE sql STABLE STRICT
    BEGIN ATOMIC
    WITH RECURSIVE roots AS (
       SELECT g.parent AS node
       FROM   graphtable g
       WHERE  g.id = _a
    
       UNION ALL
       SELECT (SELECT g.parent FROM graphtable g WHERE g.id = r.node)
       FROM   roots r   
       WHERE  r.node <> ALL(_b_roots)
       )
    SELECT ARRAY(TABLE roots LIMIT _batch_size);
    END;
    

    See:

    It takes the node to be advanced, and the array of nodes from the other side. The last element is null if the root node has been reached.
    The functions stops at the first match, so (only) the last element can be a match.

    Main function:

    CREATE OR REPLACE FUNCTION f_common_root(_a text, _b text, _batch_size int = 100)
      RETURNS text
      LANGUAGE plpgsql STABLE AS
    $func$
    DECLARE
       a_roots  text[] := ARRAY[_a];
       b_roots  text[] := ARRAY[_b];
       a_step   text[];
       b_step   text[];
       last_a   text := _a;
       last_b   text := _b;
    BEGIN
       -- corner case
       IF _a = _b THEN
          RETURN _a;
       END IF;
    
       LOOP
          -- get the next _batch_size nodes for a
          IF last_a IS NOT NULL THEN
             a_step := f_roots(last_a, b_roots, _batch_size);
             last_a := a_step[array_upper(a_step, 1)];
    
             IF last_a IS NULL THEN
                -- trim last null!
                a_roots := a_roots || a_step[1:array_upper(a_step, 1)-1];
             ELSIF last_a = ANY(b_roots) THEN
                RETURN last_a;
             ELSE
                a_roots := a_roots || a_step;
             END IF;
          END IF;
    
          -- shortcut to finish other side (no more array concatenation needed)
          IF last_a IS NULL THEN
             LOOP
                SELECT INTO last_b  g.parent
                FROM   graphtable g
                WHERE  g.id = last_b;
                
                IF NOT FOUND THEN
                   RETURN null;
                ELSIF last_b = ANY(a_roots) THEN
                   RETURN last_b;
                END IF;
             END LOOP;
          END IF;
    
    
          -- EXACTLY symmetrical for b
          IF last_b IS NOT NULL THEN
             b_step := f_roots(last_b, a_roots, _batch_size);
             last_b := b_step[array_upper(b_step, 1)];
    
             IF last_b IS NULL THEN
                b_roots := b_roots || b_step[1:array_upper(b_step, 1)-1];
             ELSIF last_b = ANY(a_roots) THEN
                RETURN last_b;
             ELSE
                b_roots := b_roots || b_step;
             END IF;
          END IF;      
          
          IF last_b IS NULL THEN
             LOOP
                SELECT INTO last_a  g.parent
                FROM   graphtable g
                WHERE  g.id = last_a;
                
                IF NOT FOUND THEN
                   RETURN null;
                ELSIF last_a = ANY(b_roots) THEN
                   RETURN last_a;
                END IF;
             END LOOP;
          END IF;
    
          -- grow batch_size dynamically
          _batch_size := (_batch_size * 1.5)::int;
       END LOOP;   
    END
    $func$;
    

    Call (with optional custom _batch_size):

    SELECT f_common_root('branch-a-head-id', 'branch-b-head-id');
    
    SELECT f_common_root('branch-a-head-id', 'branch-b-head-id', 500);
    

    fiddle

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search