skip to Main Content

I am having difficulty understanding the Gremlin data load format (for use with Amazon Neptune).

Say I have a CSV with the following columns:

  • date_order_created
  • customer_no
  • order_no
  • zip_code
  • item_id
  • item_short_description

  • The requirements for the Gremlin load format are that the data is in an edge file and a vertex file.
  • The edge file must have the following columns: id, label, from and to.
  • The vertex file must have: id and label columns.

My questions:

  • Which columns need to be renamed to id, label, from and to? Or, should I add new columns?
  • Do I only need one vertex file or multiple?

2

Answers


  1. You can have one or more of each CSV file (nodes, edges) but it is recommended to use fewer large files rather than many smaller ones. This allows the bulk loader to split the file up and load it in a parallel fashion.

    As to the column headers, let’s say you had a node (vertex) file of the form:

    ~id,~label,name,breed,age:Int
    dog-1,Dog,Toby,Retriever,11
    dog-2,Dog,Scamp,Spaniel,12
    

    The edge file (for dogs that are friends), might look like this

    ~id,~label,~from,~to
    e-1,FRIENDS_WITH,dog-1,dog-2
    

    In Amazon Neptune, so long as they are unique, any user provided string can be used as a node or edge ID. So in your example, if customer_no is guaranteed to be unique, rather than store it as a property called customer_no you could instead make it the ~id. This can help later with efficient lookups. You can think of the ID as being a bit like a Primary Key in a relational database.

    So in summary, you need to always provide the required fields like ~id and ~label. They are accessed differently using Gremlin steps such as hasLabel and hasId once the data is loaded. Columns with names from your domain like order_no will become properties on the node or edge they are defined with, and will be accessed using Gremlin steps such as has('order_no', 'ABC-123')

    Login or Signup to reply.
  2. To follow on from Kelvin’s response and provide some further detail around data modeling…

    Before getting to the point of loading the data into a graph database, you need to determine what the graph data model will look like. This is done by first deriving a "naive" approach of how you think the entities in the data are connected and then validating this approach by asking the relevant questions (which will turn into queries) that you want to ask of the data.

    By way of example, I notice that your dataset has information related to customers, orders, and items. It also has some relevant attributes related to each. Knowing nothing about your use case, I may derive a "naive" model that looks like:

    enter image description here

    What you have with your original dataset appears similar to what you might see in a relational database as a Join Table. This is a table that contains multiple foreign keys (the ids/no’s fields) and maybe some related properties for those relationships. In a graph, relationships are materialized through the use of edges. So in this case, you are expanding this join table into the original set of entities and the relationships between each.

    To validate that we have the correct model, we then want to look at the model and see if we can answer relevant questions that we would want to ask of this data. By example, if we wanted to know all items purchased by a customer, we could trace our finger from a customer vertex to the item vertex. Being able to see how to get from point A to point B ensures that we will be able to easily write graph queries for these questions later on.

    After you derive this model, you can then determine how best to transform the original source data into the CSV bulk load format. So in this case, you would take each row in your original dataset and convert that to:

    For your vertices:

    ~id, ~label, zip_code, date_order_created, item_short_description
    customer001, Customer, 90210, ,
    order001, Order, , 2023-01-10,
    item001, Item, , , "A small, non-descript black box"
    

    Note that I’m reusing the no’s/ids for the customer, item, and order as the ID for their related vertices. This is always good practice as you can then easily lookup a customer, order, or item by that ID. Also note that the CSV becomes a sparse 2-dimensional array of related entities and their properties. I’m only providing the properties related to each type of vertex. By leaving the others blank, they will not be created.

    For your edges, you then need to materialize the relationships between each entity based on the fact that they are related by being in the same row of your source "join table". These relationships did not previously have a unique identifier, so we can create one (it can be arbitrary or based on other parts of the data; it just needs to be unique). I like using the vertex IDs of the two related vertices and the label of the relationship when possible. For the ~from and ~to fields, we are including the vertices from which the relationship is deriving and what it is applying to, respectively:

    ~id, ~label, ~from, ~to
    customer001-has_ordered-order001, has_ordered, customer001, order001
    order001-contains-item001, contains, order001, item001
    

    I hope that adds some further color and reasoning around how to get from your source data and into the format that Kelvin shows above.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search