I am having difficulty understanding the Gremlin data load format (for use with Amazon Neptune).
Say I have a CSV with the following columns:
date_order_created
customer_no
order_no
zip_code
item_id
item_short_description
- The requirements for the Gremlin load format are that the data is in an edge file and a vertex file.
- The edge file must have the following columns:
id
,label
,from
andto
. - The vertex file must have:
id
andlabel
columns.
- I have been referring to this page for guidance: https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load-tutorial-format-gremlin.html
- It states that in the edge file, the
from
column must equate to "the vertex ID of the from vertex." - And that (in the edge file) the
to
column must equate to "the vertex ID of the to vertex."
My questions:
- Which columns need to be renamed to
id
,label
,from
andto
? Or, should I add new columns? - Do I only need one vertex file or multiple?
2
Answers
You can have one or more of each CSV file (nodes, edges) but it is recommended to use fewer large files rather than many smaller ones. This allows the bulk loader to split the file up and load it in a parallel fashion.
As to the column headers, let’s say you had a node (vertex) file of the form:
The edge file (for dogs that are friends), might look like this
In Amazon Neptune, so long as they are unique, any user provided string can be used as a node or edge ID. So in your example, if
customer_no
is guaranteed to be unique, rather than store it as a property calledcustomer_no
you could instead make it the~id
. This can help later with efficient lookups. You can think of the ID as being a bit like a Primary Key in a relational database.So in summary, you need to always provide the required fields like
~id
and~label
. They are accessed differently using Gremlin steps such ashasLabel
andhasId
once the data is loaded. Columns with names from your domain likeorder_no
will become properties on the node or edge they are defined with, and will be accessed using Gremlin steps such ashas('order_no', 'ABC-123')
To follow on from Kelvin’s response and provide some further detail around data modeling…
Before getting to the point of loading the data into a graph database, you need to determine what the graph data model will look like. This is done by first deriving a "naive" approach of how you think the entities in the data are connected and then validating this approach by asking the relevant questions (which will turn into queries) that you want to ask of the data.
By way of example, I notice that your dataset has information related to customers, orders, and items. It also has some relevant attributes related to each. Knowing nothing about your use case, I may derive a "naive" model that looks like:
What you have with your original dataset appears similar to what you might see in a relational database as a Join Table. This is a table that contains multiple foreign keys (the ids/no’s fields) and maybe some related properties for those relationships. In a graph, relationships are materialized through the use of edges. So in this case, you are expanding this join table into the original set of entities and the relationships between each.
To validate that we have the correct model, we then want to look at the model and see if we can answer relevant questions that we would want to ask of this data. By example, if we wanted to know all items purchased by a customer, we could trace our finger from a customer vertex to the item vertex. Being able to see how to get from point A to point B ensures that we will be able to easily write graph queries for these questions later on.
After you derive this model, you can then determine how best to transform the original source data into the CSV bulk load format. So in this case, you would take each row in your original dataset and convert that to:
For your vertices:
Note that I’m reusing the no’s/ids for the customer, item, and order as the ID for their related vertices. This is always good practice as you can then easily lookup a customer, order, or item by that ID. Also note that the CSV becomes a sparse 2-dimensional array of related entities and their properties. I’m only providing the properties related to each type of vertex. By leaving the others blank, they will not be created.
For your edges, you then need to materialize the relationships between each entity based on the fact that they are related by being in the same row of your source "join table". These relationships did not previously have a unique identifier, so we can create one (it can be arbitrary or based on other parts of the data; it just needs to be unique). I like using the vertex IDs of the two related vertices and the label of the relationship when possible. For the ~from and ~to fields, we are including the vertices from which the relationship is deriving and what it is applying to, respectively:
I hope that adds some further color and reasoning around how to get from your source data and into the format that Kelvin shows above.