skip to Main Content

I have the following neptune query

g.V().hasLabel('User')
  .has('user_id', 1004)
.repeat(both('USES_UPI','USES_ACCOUNT','USES_HARDWARE_ID','USES_GAID','HAS_COOKIES').simplePath().dedup())
  .times(3)
  .hasLabel('Gaid')
  .dedup()
  .count()

and it’s taking too much time.

I tried profiling the query

Original Traversal
==================
[GraphStep(vertex,[]), HasStep([~label.eq(User), user_id.eq(159017810)]), RepeatStep([VertexStep(BOTH,[USES_UPI, USES_ACCOUNT, USES_HARDWARE_ID, USES_GAID, HAS_COOKIES],vertex), PathFilterStep(simple,null,null), DedupGlobalStep(null,null), RepeatEndStep],until(loops(3)),emit(false)), HasStep([~label.eq(Gaid)]), DedupGlobalStep(null,null), CountGlobalStep]

Optimized Traversal
===================
Neptune steps:
[
    NeptuneGraphQueryStep(Vertex) {
        JoinGroupNode {
            PatternNode[(?1, <user_id>, ?9, ?) . project distinct ?1 . ContainsFilter(?9 in (159017810^^<INT>, 159017810^^<LONG>, 1.59017808E8^^<FLOAT>, 1.5901781E8^^<DOUBLE>)) .], {estimatedCardinality=1, expectedTotalOutput=1, indexTime=0, joinTime=0, numSearches=1, actualTotalOutput=1}
            PatternNode[(?1, <~label>, ?2=<User>, <~>) . project ask .], {estimatedCardinality=1327714, expectedTotalOutput=679, actualTotalOutput=379683, indexTime=0, joinTime=0, numSearches=1}
            RepeatNode {
                Repeat {
                    JoinGroupNode {
                        UnionNode {
                            PatternNode[(?3, ?6, ?4, ?7) . project ?3,?4 . IsEdgeIdFilter(?7) . ContainsFilter(?6 in (<USES_UPI>, <USES_ACCOUNT>, <USES_HARDWARE_ID>, <USES_GAID>, <HAS_COOKIES>)) .], {cacheJoin=true, estimatedCardinality=1923966, indexTime=354, joinTime=26212, numSearches=379683}
                            PatternNode[(?4, ?6, ?3, ?7) . project ?3,?4 . IsEdgeIdFilter(?7) . ContainsFilter(?6 in (<USES_UPI>, <USES_ACCOUNT>, <USES_HARDWARE_ID>, <USES_GAID>, <HAS_COOKIES>)) .], {cacheJoin=true, estimatedCardinality=1923966, indexTime=356, joinTime=19633, numSearches=379683}
                        }, annotations={estimatedCardinality=3847932}
                        SimplePathFilter(?1, ?4)) .
                    }
                }
                LoopsCondition {
                    LoopsFilter(?3,eq(3))
                }
            }, annotations={emitFirst=false, untilFirst=false, repeatMode=BFS, dedup=true}
        }, annotations={path=[Vertex(?1):GraphStep, Repeat[̶V̶e̶r̶t̶e̶x̶(̶?̶3̶)̶:̶G̶r̶a̶p̶h̶S̶t̶e̶p̶, Vertex(?4):VertexStep, ̶V̶e̶r̶t̶e̶x̶(̶?̶8̶)̶:̶V̶e̶r̶t̶e̶x̶S̶t̶e̶p̶]], joinStats=true, optimizationTime=2, maxVarId=10, executionTime=107826}
    },
    NeptuneTraverserConverterStep
]
+ not converted into Neptune steps: NeptuneHasStep([~label.eq(Gaid)]),
Neptune steps:
[
    NeptuneMemoryTrackerStep
]
+ not converted into Neptune steps: DedupGlobalStep(null,null),CountGlobalStep,

WARNING: >> [NeptuneHasStep([~label.eq(Gaid)]), DedupGlobalStep(null,null)] << (or one of the children for each step) is not supported natively yet



Runtime (ms)
============
Query Execution: 107828.341

Traversal Metrics
=================
Step                                                               Count  Traversers       Time (ms)    % Dur
-------------------------------------------------------------------------------------------------------------
NeptuneGraphQueryStep(Vertex)                                     743022      743022       52892.767    49.06
NeptuneTraverserConverterStep                                     743022      743022        8142.297     7.55
NeptuneHasStep([~label.eq(Gaid)])                                   4138        4138       46695.267    43.32
DedupGlobalStep(null,null)                                          4138        4138          48.927     0.05
CountGlobalStep                                                        1           1          22.215     0.02
                                            >TOTAL                     -           -      107801.475        -

Repeat Metrics
==============
Iteration  Visited   Output    Until     Emit     Next
------------------------------------------------------
        0        1        0        0        0        1
        1        3        0        0        0        3
        2   379679        0        0        0   379679
        3   743022   743022   743022        0        0
------------------------------------------------------
           1122705   743022   743022        0   379683

Predicates
==========
# of predicates: 26

Results
=======
Count: 1


Index Operations
================
Query execution:
    # of statement index ops: 1,502,390
    # of unique statement index ops: 1,502,390
    Duplication ratio: 1.0
    # of terms materialized: 162

Right now it’s taking time in seconds, can it be optimized somehow?

2

Answers


  1. Chosen as BEST ANSWER

    Turns out that we are having super nodes that got accidentally inserted in the DB.

    What I did was

    1: Filtering nodes from a configurable list and not insert if their ids matched.

    2: In the query added additional clause to filter out super node ids.

    3: As we are getting concurrent requests we made our spring API async via CompletableFuture and gave it a dedicated thread-pool.

    4: Instead of using Apache Tinkerpop driver, we opted for Amazon Neptune Rest API - https://docs.aws.amazon.com/neptune/latest/userguide/access-graph-gremlin-rest.html

    Doing this it reduced the time from 1 minute to 80ms.


  2. Do all users have a unique ID? If so, you could likely speed up the first part of the query by storing the user IDs as the actual vertex ID and not bother with both a label check AND looking for a property value. The fact that you’re using a numeric as the ID, is also causing added latency as Neptune is going through the process of applying type promotion as part of the query execution. Just use something like user_1004 as the ID for the user vertex in this case. Then you can do:

    g.V('user_1004').
        repeat(
            both(
                'USES_UPI',
                'USES_ACCOUNT',
                'USES_HARDWARE_ID',
                'USES_GAID',
                'HAS_COOKIES').
            simplePath().
            dedup()).
      times(3).
      hasLabel('Gaid').
      dedup().
      count()
    

    It also appears that there is a large amount of fan-out in the graph after the 2nd hop. If you have super nodes in the graph, you may need to derive a method to detect those and mitigate them. In most of these ID graph patterns, super nodes provide little value. So placing an incrementing counter property on nodes to denote their fanout can more easily allow you to avoid traversing past the super node. Then in the repeat() you can add something like has('associated_ids', lt(10)) to filter out any nodes that have more downstream connections that you care about.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search