A Practical Introduction To Graph Data Applications

Hello, and welcome to the data engineering podcast, the show about modern data management.

What are the pieces of advice that you wish you had received early in your career of data engineering?

If you hand a book to a new data engineer, what wisdom would you add to it? I'm working with O'Reilly on a project to collect the 97 things that every data engineer should know, and I need your help.

Go to data engineering podcast dot com slash 97 things to add your voice and share your hard earned expertise. And when you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With their managed Kubernetes platform, it's now even easier to deploy and scale your workflow, so try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. Stir. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode,

that's l I n o d e, today and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

You listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms, big data, and everything else you need to know about modern data management.

For more opportunities to stay up to date, gain new skills, and learn from your peers, there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to data engineering podcast.com/conferences

to check out the upcoming events being offered by our partners and get registered today.

Your host is Tobias Massey. And today, I'm interviewing doctor Denise Gosnell and doctor Matthias Breckler about the recently published practitioner's guide to graph data. So, Denise, can you start by introducing yourself? Yeah. Thank you so much. It's great to be here. As you mentioned, my name is Denise Gosnell. I, run the data ops organization here at DataStax

and had the, the honor to cowrite, this book with, with Matthias. So thank you so much for having us here. And, Matthias, how about yourself? Hello. Yeah. Thanks for having us here. My name is Matthias. I also work at DataStax.

And, my current role is in product strategy. So I work with the product team to identify

strategic areas of investment and making sure that our products

meet the market and customer needs. And, Denise, do you remember how you first got involved in the area of data management?

I I do. And I you know, there's there's potentially there's potentially 2 different,

main areas. The first of which was a complete accident for getting involved in graph data,

and then the second of which was starting to build with graph technologies.

So I I got involved in graph theory and graph data as a part of my master's degree. It was purely an accident because I was locked out of other courses and I I didn't know what graph theory was. But during that semester and studying with doctor Teresa Haynes, I I found a passion of working with connected data that has led me to getting to meet and talk with you today.

As a part of 1 of my jobs as a data scientist and data engineer,

I was building,

the world's health graph and mapping out all of the healthcare transactions within the United States.

And we were using open source technologies back in 2014 for that, which led me to using Titan and meeting Matthias as well as part of my collaboration with with building on Titan. So both of those 2 events are how I got involved in, data management and specifically with, with Graph Technologies.

And, Matthias, do you remember how you first got involved with data management?

It actually goes a long, long time back. I I read a biography

of Goethe, who is a a German poet and also polymath.

And 1 of the things that just kind of stuck with me was that the at least the author claimed that Goethe was the last human on earth who had all the world's knowledge in his library and was able to sort of comprehend everything there was to know at his time.

And that just really fascinated me to to think about having the ability to tap into anything that is knowable

sort of within arm's reach and kinda never let go of me. So, obviously,

part of, you know, part of trying to make that happen was looking into data and information management, and I got involved

in semantic web research and sort

of kind of that was my first foray into graph data, semantically structured data and and RDF and and the, you know, the linked semantic weapon and all these initiatives.

Did a lot of research there and and got really interested in how do you make

this technology

scale so you can really truly represent all the world's knowledge, which, you know, in the last couple 100 years has obviously expanded quite a bit. And then got interested in doing my PhD research at the University of Maryland in, graph database systems,

and in particular scalable graph database systems. The result of which then culminated in the open source project Titan that Denise just talked about. Started a company around that called Aurelius with my cofounder, Marco Rodriguez, and we ultimately got acquired by DataStax, which is where I am now, to bring graph technology to a wider market. So it's been a long journey

over the

last 20 some years

from sort of this this vague idea of wouldn't it be fun if we could capture somehow all the world's knowledge, all the world's information, all the world's data

in in a format that is

representative of how humans think about data to making the technology

happen. And moving into the topic at hand today is the book that you both cowrote on the practitioner's guide to graph data. So I'm wondering if you can just describe a bit about what your goals were in that book and your motivation for writing it. Yeah. That's a great question. And so I think there there are maybe 2 2 motivations that I would like to call out. And, Matthias, maybe, you know, you might have another perspective as well. But those 2 motivations were

to communicate common patterns, but then also to kind of bring together both worlds that I had been observing,

within the graph space. So just a little bit of each of those. You know, as I had started, building with graph technology and then, you know, after the acquisition of Aurelius to DataStax,

came over here to the to the team to be a part of, you know, building some of the world's largest graph applications. We started discovering that there's a common set of patterns that companies and people were using

for deploying and and, using graph technology and production applications. And what we wanted to do was to essentially tell stories about what those patterns were in a tangible way to teach them to others. You know, just, just like you've seen relational databases kind of converge on common approaches to problems. We were starting to see that in the graph space as well. So that was really 1 of the technical cornerstones of of what we wanted to write the book about. And then second, when, you know, when you kind of were hearing Mathias and I introduce each other, you you can kind of see that we have,

2 different approaches or 2 perspectives

maybe to working with graph data. And I thought that it would be an incredible

partnership to bring together for just kind of bringing the the data oriented approach, but then also the the long history that Matthias has from, building these systems from, within data management solutions and technologies,

bring them together to kind of teach the world about how this works end to end. So those are kind of the 2 primary cornerstones that I know Matthias and I have spent a lot of time talking about on why we wanted to write this book together. But, Matthias, I'm curious if you have anything else that you'd like to add or another perspective to that.

Nakh, I think you captured it beautifully. I think there's a there's a wide

need right now in terms of learning more about these technologies. And 1 of the things that Denise has done in in, you know, many years of her career is work very closely with customers on these problems. So she has a very practical approach to working with graph data, which is 1 of the reasons we picked that title for the book, And and combining that with a sort of foundational understanding of graph technology, we go into the details of the terminology

and and spent a good amount of time in the book to talk about graph thinking. So the way that you approach graph problem is fundamentally different from how you approach

sort of tabular structured problems that we are most familiar with. So there there's a huge need to kind of explain

how to use graph technology, not just from a sort of nuts and bolts perspective, but also from a zooming out and understanding what problems you should be looking for before you apply graph technologies to them.

And I thought it was interesting in looking through the book a little bit that you called out the fact that the past 2 decades have, in large part, been been the era of NoSQL

with trying to scale horizontally to address

the projects and applications that are using growing volumes of data needing to be able to process those scalably and that the previous 2 decades were dedicated to relational databases

as the overall space of digital data management

grew and matured. And you're saying that the current decade that's starting now is going to be

the realm of graph data and this growth in graph oriented thinking. I'm wondering what you see as the driving force behind that growing popularity and the need for graph technologies and how that's going to shape the future trends in technology.

I think the big trend that we see in I mean, just going back after the last sort of the last era of data management, the scaling out, we saw

just so much data being captured and

companies and individuals trying to catch up with that data being able to put it somewhere and put it to use. And I think now that we have sort of kind of wrapped our heads around, like, how do we manage this much data located all over the world? I think the next frontier in data management is how do we get the maximum amount of value out of this data? Right? I think of it as kinda like if you imagine, like, you open a fire hose, Like, the first problem you have is, what do we do with all the water? Like, where do we put the water? And then the next problem is, how do we put this water to good use? And and graph technology

really zooms in on the latter problem of understanding how the relationships in the data make the data more valuable. They amplify value in the data. The more you can connect data, the more valuable it becomes.

And that, I think, is a fundamental insight that more and more companies and individuals are arriving at. And we're currently

just at the, you know, very frontier of that. But I believe that the next decade or so, we'll see much more of that where these people sit down and say, okay. What we have this data now. How can we get the maximum amount of value out of it to drive better value propositions, to build better products, to have better customer experiences, whatever it is that, you know, your organization is optimizing for By being able to look at the data as interconnected pieces

that enrich each other,

you are in a position to build better experiences. And you can already see some companies doing that and out competing other companies. And through that pressure,

we will see more and more

technology teams being sort of led to the graph table. And that's, I think, is sort of the reason for for that shift towards Graph Technologies, which obviously does not mean that any of the older technologies are being invalidated. Right? Like, the the rise of NoSQL

did not diminish SQL or the relational model. It enriches it. And we think similarly think that graph is sort of the next layer on top of the data management cake of being able to put all of this data together and make it more valuable. And on that point too, there's the interesting question of

the way that that graph information is manifested

where there are plug ins for relational systems or data modeling techniques

for being able

to embed a graph into a relational model using things like RDF triples or associative

data structures, but then there are also database technologies that are specifically built for being able to store and analyze graph structures.

And I'm wondering what your thoughts are on the overall trade offs of when to use which implementation

and your thoughts on the use of graph databases

as the primary data store or something complementary

to a different data store that acts as the canonical source of truth?

Yeah. I that that's a that's a a deep that's a deep and very multifaceted question. So I think I wanna start with the the first piece when you were kind of talking about, like, when to use graph technology, if I heard you correctly. Is that right?

Yes. Okay. Yeah.

I like when I I think people ask us this all the time. This is probably 1 of the most popular questions that Matias and I have the opportunity to coach and and talk to teams about. And

it it really starts at the at the beginning. We'd like to think about

or coach people through thinking whether or not the problem they're describing

needs graph data. And what we really mean by that is

whether or not the relationships

or the links or the connections within or across

your data helps

provide more insight or helps provide more understanding for the problem you're trying to solve?

If you're thinking about those questions with your problem and you're screaming yes, like, absolutely,

that's that's a telltale sign that you that you do need graph technology.

But more often than not, you know, when we have those conversations, people are thinking, well, I I really don't know. I don't know if my my problem needs connected data.

And usually, we will take the problem that we're working with and try and break it down a little bit further into a smaller a smaller question and and and really begin to just digging in a little further where you're trying to find at any point, do you need connected data across your across your systems to answer this question? Because once you do get a true yes, that's that's really when you're moving into the realm of of graph thinking or needing graph technology.

And from there, you know, potentially oversimplifying.

But from there, Matthias and I really, we we commonly see and talk about there being 2 approaches,

whether or not you're really looking at needing a graph database or you're really talking about graph algorithms.

So I'm gonna talk about the latter first. So when you're when you know that you need relationships in your data, you know, the next step that you really want to do with it is to put it into a graph structure and to analyze Maybe you maybe you need to understand the between the centrality across your whole graph or you wanna understand page rank for your graph. When you're thinking about these global

analytics or these the global understanding of the topology of your data, you're, you're really thinking about graph algorithms

and wanting to apply the very famous and common set of approaches for studying the connections and the structure and the shape

of your graph data.

And, when you're working within graph algorithms, you probably just need some type of report or you need to kind of be working more traditionally within the data science or machine learning workflow.

More R and D style, right? Like, we're just trying to understand

this new structure and how it applies to your problem.

Once you've iterated on that and you have found that valuable answer that you maybe need to serve in a production application,

that's when graph databases or storing and querying

specific relationships or links between

entities becomes very valuable to do in a production application. And that's when you'll wanna use a graph database.

And how we think about this,

very commonly, just to kind of give you a very specific example, is to think about let's just go with LinkedIn as an example. When you start with LinkedIn and you hit that search bar, you're, you're, you're probably typing someone's name. And the first problem from a tech perspective that you're experiencing is search and doing character completion.

And that type of search is just a semantic way to match what you're typing to a list of names, and that that's not a graph problem.

That's a search and a character edit distance problem that you probably, you know, solve with like elastic search or some type of, tech like that. On the other hand, though, once you look at your search results,

I guarantee you that they're probably listed in order by connection.

And you can imagine on LinkedIn, you see those badges of first connection,

2nd connection, 3rd connection. You can kind of imagine those results.

That, that is exactly explaining and helping you understand

the professional network according to how you fit in, according to your connection to that person.

And serving that badge is a beautiful example of how you use graph algorithms

and storing and querying them in a real time production application

together at the same time to serve 1 feature that we all use all the time on LinkedIn.

Because, you know, doing global path finding to determine

the total number of connections between you and anybody else you could search for on LinkedIn, that's that's not something you're gonna do real time on query when you're searching for somebody.

That's gonna be done as a graph algorithm, and they're gonna store those results in a database,

to query

when you search for somebody. So there's when you kind of think through end to end of how you search for a person on LinkedIn

and how you create that personal story about how you know them through mutual connections or how closely connected you are,

that entire journey is it kind of steps through whether or not your problem needs graph data, whether or not you're using graph algorithms, and then how to maybe store and query a graph specific feature,

in the end application. So that that's how I like to think about that question. It's a great 1.

And in understanding

the fundamental principles

of the graph technologies and the algorithms

that are useful in that context and the different technologies that are necessary for being able to power those use cases, what are the core concepts that data teams and data engineers need to be familiar with and

be able to understand to employ those technologies

and those algorithms effectively?

That's that's a great question. It's it's another 1 of those where where I could talk for, like, 30 minutes straight because there's there's quite quite a lot of nuance to it depending on the use case you're targeting. I think the initial phase, I think, is the most critical 1 where you really dive into the problem

that you're trying to solve, the particular use case you're looking at, the particular application,

whatever it is that you're building,

and really break apart

what part of this problem is

generally a graph problem, meaning a problem

where you want to analyze relationships, where you want to use the connectedness of the data

to to infer something or to to enrich the use case that you're trying to solve. And once you have zoomed in on what those components are, to then break it down, like like Denise was talking about, sort of the the analytics side where you focus more on sort of global optimization,

global type of graph problems versus real time graph problems like path finding. And Denise gave the example of PageRank for a global 1 to kind of drill down into those pieces. But also at the same time, and this is where it gets a little more complicated and nuanced,

is keeping in mind what the the overall problem is that you're trying to solve

just to to keep that in context. And what I mean by that is that oftentimes when you look at graphs graphs, for instance, in a in a streaming context. Right? So you have, like, individual events coming in that over time build a ginormous graph that you're trying to analyze, an example being cybersecurity

type use cases, then you wanna make sure that the overall technology pipeline that you're building

is sort of attuned towards this the streaming nature of your particular use case. And you wanna find an underlying technology that can handle those kinds of high high velocity streaming use cases and build a graph over time that you can analyze in real time to then do threat analysis, for instance. And that's you know, there's a lot of nuance in that. There's very, you know, very many technologies out there right now. And 1 of the really exciting things about the Graph space is that it's unsettled right now. Like, we're we're still in the sort of experimentation phase where we're trying to figure out which products work well for which use cases. So if you're building a knowledge graph, for instance, you're likely making very different technology choices than if you're trying to build a, you know, a cybersecurity

solution that looks at, you know, networks of data and packages being sent forth, which is very different than if you're trying to do supply chain optimization and, you know, supply chain type analysis

to have a better commerce infrastructure.

So in in those different use cases, the the real key is to understand

which part is the graph problem uniquely and which and how is that embedded in the larger problem to make sure you make system choices and technology choices

that are harmonious within the overall architecture that you're building. And there's different products out there. There's, like, for instance, at DataStax, we have a multi model graph database, which allows you to store data in in in multiple forms, 1 of which is graph, which is really useful for, for instance, event streaming use cases or other use cases where you also need to look at the data in tabular form or retrieve it differently. Then there are stand alone graph technologies that really focus on just the graph component of a problem, which can be a great, solution for, let's say, a knowledge graph. And so it is a bit nuanced, unfortunately. And as we learn more about the space and more about how people solve problems, I think we get a clearer and clearer picture for how to solve them.

But I think coming back to the most fundamental

thing that you need to know is, at the very beginning, to take the time to really break down the problem and understand which part is generally graph and which part maybe does not need graph technologies.

Because that's where we see a lot of people

get through that phase a little too quickly

and maybe then look at parts of the problem that may not require graph technologies or miss parts of the problem that should actually be solved with graph technologies. And then everything else beyond that point becomes a lot harder to do because it doesn't quite line up with the technology choices you're making in the in the subsequent steps.

There was a similar evolution that I saw when we were going through some of the NoSQL phase of things where people were just starting to throw NoSQL at everything because they said, oh, I need to be able to scale even if they're just starting off with a single server

and not understanding what the trade offs are in terms of the data model and how to think about the problem in a document versus a relational structure.

And so it sounds like there's a similar pain point happening with graph technologies where they see graph as this

solution for everything. And so they say, oh, we'll just put everything in the graph and not knowing that, well, at some point, we want to be able to query this in a relational model instead of being relationship oriented, or we wanna be able to collect everything as a single document view.

And so given that, what are some of the data modeling principles that are useful as you are going through that solution of defining the domain that you're working in and figuring out where the graph fits, what the actual

components of that graph are going to be, and when you might want to push some of those concepts into a different different type of technology and maybe in a relational or a document oriented structure?

Yeah. That's,

I think that's a really great question. So when you when you think about some of the common data modeling principles

for graph structured data,

there there's kind of at the at the beginning,

towards the end of your question there, you're starting to highlight the way I I see the world as a data practitioner

and whether or not I choose graph technology, but then there's actual data modeling principles once you're in graph technology. So as a just a general practitioner who likes to build applications,

I think in terms of shapes of data.

And,

I'm always looking at the problem trying to dissect it of if it's, you know, if it really is just a table or if we're talking about,

like the a JSON or a document shape for, you know, just just kind of really understanding the shapes of

of information that we're sharing across our APIs,

in our app. And so from from that perspective,

that's how I began to, you know, start to educate my teams and the people we work with on how to make smart choices about what components are in your technology stack. Just kind of by understanding the shape of data.

And the important feature that you hear us talk about and that really, really comes comes from what we're hearing Mathias say is that

the most important piece of of needing graph technology and graph data comes from

when

relationships or links across your data are the primary data modeling elements of your problem.

And and whenever relationships are more important,

that's when you you need to be modeling with graph technology. And so from there, some of the the common data modeling decisions

to highlight, kind of really far into the weeds, you know, kind of you can think about whether or not between any 2 pieces of information you want there to be exactly 1 relationship or there more than 1 relationship.

That's a that's a really, really common item to think about. So we talk about this in context of banking customers.

And, you know, a common data modeling decision is to say whether or not a person can only own or have 1 relationship with their bank account, or if they can have, you know, multiple

relationships with their bank account. So to think about this, you're a person and you own a bank account, so that's kind of like 1 way to have a relationship to,

between a person and another entity, namely an account. But you probably can have many roles for how you own that account. So you can be the primary, owner, you can be the administrator, you can have, you know, different types of permissions.

And in that aspect, when you're thinking about wanting to have multiple roles in your account, you're really thinking about multiple relationships between that person and that account. And so the nuance or the difference between what I just described is probably 1 of the first and most fundamental types of data modeling decisions that you're going

to come to when you are working with graph data. And Matthias mentioned many, you know, a few times just a moment ago about working with event streams and time series data. And if the if the role example between people and accounts seems a little abstract, you can also think about different sensors

that,

send communication over time to other sensors.

And so those are gonna be different vertices or different endpoints in a graph that communicate with, other endpoints over time. This could, you know, this could be cell phones texting and calling other cell phones as another example. And you're gonna have multiple relationships

between any 2 individual endpoints. And having a way to model that in your database, to model time on that edge, because the communication between those 2 pieces is the most important, piece of that of that problem. Knowing how to use different properties in your graph data to model

the the quantity of relationships between 2 pieces of information is extremely important. And probably 1 of the first data modeling items that we've run into,

when we are working with customers and their, and their graph data models. I'm curious, Matthias, if you had, other

common data modeling

piece of advice that that you would like to mention beyond just the multiplicity of an edge between 2 vertices.

I I would say I mean, we can we can obviously go much more into the details, but I think that would be harder to follow, like, verbally. I think we we spend a good amount of time in the book to really nail down, like, what are some of the the things that you need to know on a use case by use case basis that really inform graph data modeling. Going back, I think, to the point that Denise made about

relationships being a central component, I think that's really, really critical because that's very different

from other types of data modeling and other ways of looking at the world

where you kind of think much more in terms of entity to table mapping. And entities are sort of the central component,

And relationships are somewhat secondary, and and, you know, they're used to kind of structure your domain, but not really meant

to be used for enriching the data as you query it, as you build use cases around it. And so I think that is the biggest shift in thinking as it comes to graph data modeling

and schema design is how how do you design the edges to make sense of your domain? We we try in the book, we try to use some some analogies in in terms of language, something that, you know, came about in the semantic web where you sort of look at how you would talk about your domain of interest in nouns and verbs and use those to then structure them as vertex labels and edge labels respectively.

And so there's a number of techniques and sort of rough cut approximations that help you to sort of get started in that area. But I would say that really is 1 thing that I personally had to learn, and I've seen everybody who has been using graph technology learn over time is to really

embrace the notion of of edges, of connectivity,

and to lean on that heavily in solving problems. And that, you know, needs to be reflected in the schema design.

And and getting there is

initially a nontrivial

problem because

we just don't have a lot of practice in it. Most most people who have gone through a computer science degree don't talk much about it. And so that's really something you need to learn in the beginning to kind of feel comfortable in that space.

Mhmm.

Today's episode of the data engineering podcast is sponsored by Datadog, a SaaS based monitoring and analytics platform for cloud scale infrastructure,

applications,

logs, and

more. Datadog uses machine learning based algorithms to detect errors and anomalies across your entire stack, which reduces the time it takes to detect and address outages and helps promote collaboration between data engineering,

operations, and the rest of the company.

Go to data engineering podcast.com/datadog

today to start your free 14 day trial. And if you start a trial and install Datadog's agent, they'll send you a free t shirt.

And

in terms of the actual

data types and data structures that you're working with on these edges and vertices,

are there any commonalities

between the different data storage engines as far as what the structure looks like? Where are you storing a JSON document or a set of tuples as you would in a relational database or

key values for different elements? And is there generally some sort of variance in terms of the type of information that you can store on the edges versus the vertices?

Yeah. I would say that in the graph world, there is some convergence happening around the property graph data model as

sort of the underlying

principle to Prowd's approach to schema and and graph data structure. There's still nuances and and some variations in what that means specifically, and vendors

sort of differ in in how they interpret that. But the general gist of it, like, if you look across the various implementations, you can say that there's basically

2 approaches

to graph data. 1

based in the semantic web with RDF and and SPARQL,

and the other in the property graph model, and then using a query language suitable to the property graph model like Gremlin, Cypher, or some some other variants that are currently

in discussion and in design.

And then going into

the types of challenges

that data engineering teams and other people working with the technologies and the applications might be facing,

what are some of those challenges that you see when people try to bring those graph applications into production

and start to try and deal with some of the scaling pressures that they might have as they bring more people on board and bring it to more different applications?

Yeah. I, that's a that that's a great question. And and first, when you were asking that, I was I was, you know, kinda thinking about the,

I would say, the 3 most common challenges that we've worked through. Just deal with

teaching people about the new, the new processing paradigms that come with

querying deeply connected data, instead of just querying directly from tables.

We talk about those

kind of in 3 terms. We talk about them from the perspective of branching factor,

supernodes,

and then also,

just from the perspective

of more analytic style queries.

And when we mention

branching factor,

we're talking about just the total number of edges that you process from a vertex as you kind of as you walk deeper through your graph. So if just to make math easy, if, you know, if if every vertex has 2 edges connected to it and you keep walking out, you know, that's gonna be an exponential growth on the number of data pieces of data that you're processing in your traversal.

So the the first thing that teams very commonly run into is the problem with branching factor and the overall processing processing time for any individual query. And they're always left thinking, well, why is this taking so long? And, you know, by the very nature of working with connected data and being able to very

expressively explore it with a graph query language, you end up asking for much more data from your application than you than you originally thought. And the first place to look is typically because of branching factor.

We, we commonly advise teams,

to, to be only querying, say the most recent edge, if that makes sense in their application. So That's a very

common technique that people, implement by putting time on edges and only going for most recent elements, just as a specific example. But that's the first place that they run into issues. The the second place is is with the idea of supernodes,

and, supernodes have both constraints in memory and on disk. You're gonna hit the constraints in memory a lot faster than you're gonna hit your constraints on disk. And, a super node here is just talking about that vertex in your graph where almost everything points to it, because most natural graphs

have,

a a few number of vertices that are highly connected and and the the number of edges pointing in or out of that vertex is disproportionately way, way larger than most other vertices which have only a few connections. So, you know, that vertex in your graph that has everything pointing to it is called a super node. And I think the Twitter example is when

they were setting up counters in Cassandra,

they discovered that Ashton Kutcher actually was the super node of Twitter because they were trying to implement counters and to to track the first person to get to a 1000000 followers, and actually that was the profile that was having issues with implementing that in Cassandra. So when you just think about someone who

has as many followers as the celebrity on Twitter, that's what a Supernote is. And so when you're walking through those, types

of vertices in your graph traversal,

it's just gonna require

a lot of memory in order to, you know, process millions of edges as soon as you run into a highly connected vertex.

So, you know, typically, we have some recommendations on what to do there, like only walking through a subselection of the edges or, you know, only walking into supernodes, not out of them. And there's a ton of examples that we have in our book on how to get around them.

So we'd love to have, you know, conversations about that offline.

And the the third item that I was alluding to is just the the balance of running a deep algorithm

on request in your application versus,

storing that analytic to be retrieved. And the a good example here is what is is how companies are likely serving up that recommendation window. And we talk about this in the context of movies,

because collaborative filtering and and doing recommendations

for, you know, the next movie to watch is really, really popular

feature to put into your application.

And to walk from you through the movies you've watched, every person who's watched those movies and then every other movie to recommend,

that's hitting super nodes and branching factor along the way. So

very common teams will set up offline

processes

to to run-in parallel offline and learn the recommendations, but store them with what we call shortcut edges.

So that you are only walking maybe in a star graph in your actual production application

to minimize the SLA for getting that contextualized recommendation.

So those 3,

branching factor super nodes and then applying shortcut edges are our 3 recommendations for where people commonly get, tripped up when they're first writing, graph queries, and using graph databases.

So So the things that Denise mentioned are sort of the very sort of the very practical things that as a graph practitioner, you need to be aware of and you need to make sure you design around or or

understand how they will impact your implementation.

Then there's also, like like, where my head went in the beginning when you asked the question is, like, for

people who are relatively new to graph technology, there's an element to sort of kind of to the the personal

sort of hype cycle of graph, that we've seen a lot of people go through where initially you kind of you're you're new to graph and you're learning it. And then as you learn more about graph, you get to the point where it's like, wow. It seems like all problems are graph structure problems, and it's a really powerful technology that should be applied to everything, like much like you mentioned earlier. And and it becomes,

it becomes a problem in larger teams where where some team members, you know, may be very enthusiastic about graphs, other other ones are not so much. And if you aren't careful to really break down the problem into the the pieces that are generally graph problems versus the ones that should be solved with

technology that we've been using for decades decades and and understand very well, then that can cause frustration and make the move to production much harder.

Know, connectedness of the data. And and you need to be aware of of those elements as you, you know, as you as you move into production from building an initial prototype. And so so going back to, you know, the the recommendation of, you know, really make sure that the the piece that you're tackling with graph technology is generally a graph problem. But make everything else after that, in particular moving to production, so much easier because you can immediately show the value that graph technology is bringing to solving the problem rather than getting lost in an endless debate of, you know, whether this could have been solved with 50, 000, 000 other different technologies

that have been, you know, in or have been used in production for much longer and are much more well known amongst your team or amongst your company or organization that you work in. And and so that's that's 1 element that we see,

quite a bit where,

the the initial enthusiasm

might lead you down a path where you overcommit a little too much from graph technology.

Or, likewise, people who are new to graph technology will under commit, and know, they will be much more hesitant to adopt the technology. And so, really, working with the team to understand the the happy medium path,

and trying things out, building initial prototypes, taking those to production, seeing how they prove out, and then moving on incrementally from there is what we've seen to work best with teams on their path from,

hey, we have a graph structure problem, and we think graph technology can really help us accelerate

our time to market, can really help us accelerate our value delivery and value extraction from that data,

to moving that into production and successfully operating it against real workloads and against real stick scaling concerns.

So general recommendation is take it easy, identify the piece that you wanna solve first, move incrementally,

and then be aware of all the things that Denise mentioned. And and specifically, you know, look into

how the use case you're looking at has been implemented before because there's lots of examples out there. We we try to go through 1 new example for each section of the book so we can cover a broad array of use cases. So look into those and see which ones are similar to the 1 that you're tackling so you can kinda see where the pitfalls are and where you where you can learn from past endeavors.

And then as far as the specifics of the database technologies,

I know that there are

differences in terms of the actual underlying data structures, but also the query syntax that they support where Cypher has been a contender for a while from Neo 4 j, and they recently

created the g SQL standard, which is largely influenced by that language.

And then there was the Gremlin syntax, which has been an open standard that a lot of open source projects have been orienting around. I'm wondering what your thoughts are on the

different capabilities

that the different syntaxes have and your

hopes for the sort of coalescing

of syntax

to help with

making the overall graph database space easier to adopt and easier to deploy and interface with with multiple different tools?

Yeah. That's that's an excellent question. And 1 of the fun things about Graph is that it is still fairly new. Like, all the products are very new. The query languages that are being proposed are very new. And so part of the, you know, the fun of being in the ecosystem is that you can kinda see

all these different ideas emerging and, you

know, like, you can see how they merge together and and form new ideas out of it. Like like the GSKL initiative that you mentioned is sort of a a merging of Cypher with some, like, SQL like

superimposed

graph type syntax that Oracle developed and and and how those ideas are coming together.

And, Gremlin is taking a very different approach looking at graph problems

from a traversal centric perspective rather than sort of the declarative nature that the set declarative nature that SQL brings to the table. And so that, I think, is also a really, really interesting idea to kind of instead of trying to anchor the query language in SQL, which arguably a lot of people are already familiar with, but but it does anchor it in a sort of set and relational

world where by relation, I mean the mathematical sense of a relation as a tuple, which is not the most

perfect abstraction for a graph in many ways. In some ways, it works really well, but in others, it does not so much. And so you can see there being some benefits to Gremlin, which uses

a traversal centric syntax that allows you to kind of map your walk on the graph in a in a syntax that is very intuitive and and path like. It has sort of a functional functional vibe to it. And so those are all different ideas, and I'm really curious to see how they emerge and evolve over time.

I think they're you know, the standardization initiatives that are on on the way are trying to

give the industry and adopters some sense of security and that their, you know, knowledge investment is paying off over time. And so I think that's very important. At the same time, I think it's also important to keep an open mind and and look at the world and see how, as graph technologies emerge and evolve, to make sure that we come up with the language that can really harness the power of that

in a way that is easy to use for people. Right? Like, that's really the the critical spot you need to hit with a query language is if you look at SQL and and some of the earlier proposals that ultimately led to SQL, it was really about how can we harness the power

of the relational algebra

in a way that makes it easy for people to express so that the database system can do the bulk of the work. And so we need to arrive at something similar for graph, and I I personally don't think we're quite there yet. I don't think we have nailed this part,

where we're still we're still kind of putting together historical pieces with some of the stuff that we know. But it is still too hard at this stage, for instance, to build

custom network flow algorithms to in in a declarative manner to do a simple the just do just simple and complex path finding top type queries. Like, all of that still requires

a ton of syntactic

gymnastics

that that make it somewhat awkward. So I think there's my opinion is we will see like, in 10 years from now, the world is gonna look very, very different in the in that realm. And I'm really excited to see all the innovation that is happening and excited to contribute to it and and keep the conversation going because I think that is really going to be the crux of it. How do we get people to be able to harness this tremendous power that is in a graph? Right? Like, when you think of, like, millions of interconnected pieces of data, and I wanna ask a simple question, like, how is person x related to person y

through the various types of paths that could exist in a social network, for instance. How can you give someone a query language that allows them to succinctly express

their sense of what makes the path valuable so that the database system can then do. They have a lifting of doing efficient path finding against optimized internal data structures

that make this an efficient query to run without having to manually tune internal data structures. Right? And we're not at that point yet. And we need to get to that point before graph technology can be widely adopted.

But people who

do not and should not need to care about internal data structure tuning and leave that up to the database level. So that's kind of that's kind of the bar that I'm holding. And and I don't see that we have met that bar yet. So I look at everything that's going on as sort of steps towards that goal. But I think that there's still a lot more to come in this, in this arena. And I think that's gonna be incredibly exciting.

And then looking beyond the database layer, what are some of the other components of the overall data stack and data life cycle that either currently do or will eventually need to be able to

work with and understand graphs natively?

Yeah.

Great question. Matthias, you you might have more on that 1, but I'll I can start with, I can start with what I think.

So for for tooling to kinda complete the ecosystem,

I see, again, more from the practitioner and data science, you know, data engineering perspective, I see 2

tools that really are needed. You know, first, we we really need a uniform

layer for graph visualization.

So that,

you have an easy way to manage

and understand the structure of the data, but then seamlessly also communicate

visually

about the schema of your database. Because both I mean, because at the end of the day, your your your data schema, your your graph database schema essentially functions as your map, as you're attempting to process information in your graph database. So having a a unified system that both visualizes your data and your schema is gonna be something that I really hope to see emerge,

because it's it's a critical piece in the in the stack as new as new technologists are picking up, picking up graphs.

And the the second piece I see and this is this is definitely coming more from my perspective as a data scientist and a data engineer.

Having a a notebook like environment that more a part of the global ecosystem for sharing notebooks, more like Jupyter or the or Zeppelin.

Having something that is more online so that technologists can share their graph work, you know, in this storytelling

or or developer driven notebook experience

is going to be really critical for helping to share and communicate

what almost everyone finds to be very fun and very exciting when they they run their first traversals or they they experience how to start to use graph data.

That that notebook like experience gives them a communication point. And I I believe that both of those tools

uniformly across the entire, graph ecosystem, are really missing. So that we have more seamless

enablement and communication about, fun problems to solve with graph technology.

Yeah. I I totally second Denise's point on graph visualization. There's there's a lot we need to do on the usability side of graph technology. I I spoke to the query language earlier, which I think is 1 really critical component there because that's really the interface at the end of the day. But we need to support that with tooling that allows you to understand

what you're actually querying and how that query

gets interpreted and run. And that's

much harder to think about in the graph world. Right? Like, in the relational world,

you can you know, you have the table as the abstraction. And that that's a really nice abstraction because it's nicely 2 dimensional, and you can put it on a screen. And most of us are very familiar with navigating tables, columns, and rows. And we can kind of reason about that. And we can, you know, reason about how they get the rows get filtered down if you put a where clause on some columns.

Those are fairly intuitive things to to reason about. Now with graph, if you even have a simple graph with, like, a 1000 vertices and 10, 000 edges, like, that already gets very overwhelming. And if you just put it on a screen, it looks like a spaghetti ball, that is this hard to penetrate. And so there there's more work needed in terms of giving users

enough visibility into the graph and the connectedness without overwhelming them

that we haven't quite figured out yet. There's a lot of really cool work happening there. And, particularly, we we have technology now that can visualize massive graphs.

But there is sort of an inherent limitation to what our visual cortex can absorb and what we can reasonably make sense of. And we need to do a better job of really filtering it down and giving sort of the the dynamic nature of the graph and the the multidimensionality

the right level of visual representation so people can make intuitive sense of it, which is really critical. On the other hand, I think another component is that graphs, in many cases, are emergent data structures. And what I mean by that is if I were to, let's say, send a text message to Denise. Right? That's that's an interesting data point. Right? Like, Matthias sent a message to Denise on, you know, this day at this time,

and the message content was this long. Where it gets much more interesting is when you take all the messages together and you arrive at the global communication network. Now you're talking about a massively interesting dataset that that emerges from these individual events or individual data points. And and so we need better infrastructure

technology to really allow people to harness

all these events and all this data that comes in and then build an emergent graph over time without having to do quite so much heavy lifting up front. Like, right now, it is very laborious

to plug the different streams and static and dynamic data sources together and and build a graph from it. There's, you know, parts of it are entity resolution related where you need to say, well, this, you know, this phone number is the same as this phone number or, like, those are the same individual and things like that. And and all of that right now is still very manual, very laborious.

And it it delays graph projects quite significantly and oftentimes puts kind of a nail in their coffin because it takes you so long to get to that emergent graph that by the time you get there, people have already given up on the project. And so there and since there are so many common elements there, I think we can do a better job of building an ecosystem of tools around that to really help you build emergent graphs. And then on the on the downstream side, to make it easier to to store such graph, to export them, to put them into data warehouses, and on all these other, like, analytic tools that people are already familiar with. So so, basically, making projections out of the graph that can serve other downstream systems that we're familiar with and comfortable with. So that that again, that there's less less of that plumbing that we need to do to fit graph technology into an existing data

infrastructure, which for most of us exists. Right? Like, there is you know, like, the executives use a certain suite of BI tools to file the daily reports. Right? So we need to make sure that the data goes there some way. Otherwise, it won't exist from their perspective.

And that, again, needs to be something that becomes more automated and and

not a headache. It just is 1 of the things, and you have a tool that does that simply. So, basically, on the on the insight on the sort of getting data in, building an emergent graph, then having the right tools, visualizations, notebooks,

and and other types of user interfaces that we have yet to build for this

relatively new I mean, not new in the sense of having been invented recently, but new in the sense of becoming popular as

graph data structure. We do need new usability tools

to work with it. And then on the downstream side, make it easy to project out of the graph to serve other systems and other use cases that rely on that data.

And as you look at the current trends in the overall industry of

the technologies and the use cases for graph databases

and applications built around graph structures, what are some of the overall trends and developments

or just general improvements that you're seeing that you're most excited about, and that you're keeping a close eye on? Good question. On on all the elements that we talked about previously, I'm really, really keenly interested to see how the query language conversation discussion is going to evolve and,

and how we are able to

better connect with and pull users into the ecosystem.

So for instance, you know, we talked about Gremlin. We talked about SciFi. We talked about G SQL. But then there was kind of like out of left field came GraphQL, and it became an incredibly popular

sort of API protocol in the JavaScript ecosystem. And it has some, I mean, it is called GraphQL because it looks at

API consumption from a graph data perspective. And and it has become wildly popular,

much more popular in terms of if you look at raw numbers of developers using it

than any graph technology that's on the market today. And so it's, you know, like, I expect those things to keep happening, and they will all kind of influence each other to to to evolve towards this goal, this ultimate goal of a really user friendly,

but powerful

graph query language.

And so that's 1 area that I'm really keenly interested in. And and it's really interesting to see how even in areas that you think are completely unrelated

into some extent, right, like the the whole problem of how does a mobile device consume data from a back end service, which is where GraphQL came from at at Facebook,

how that now is influencing

the graph technology landscape

quite heavily. And and I expect that will continue to happen as graph structured data will become more and more valuable and more and more important. On the other hand, there's we talked about graph visualization. There's a lot of really good work happening there on both using GPU accelerated graph visualization so you can visualize larger graphs, but also being able to cluster the data in certain ways so you can kind of zoom in and out of a graph like you can with Google Maps.

Like imagine you could only see the world at the level of street detail. It would be very overwhelming. And so we can zoom in and out. And something similar, we should be able to do with graph data. But we haven't quite figured out yet how to semantically group things in the right way. So there's a lot of really good stuff happening there using modularity optimization.

And then in terms of data structures and actual query engines,

there is 1 of the fundamental realities is that our computer architecture, like the the Von Neumann architecture

around which all of our computers are built,

ignoring quantum computers for a second here, they all have an essentially linear memory. Right? Like, all of our memory is being addressed linearly,

which is great for, you know, continuous blocks of data. And this is why it's so fast to read, let's say, a movie off your disk because you can't sequentially read it off. And that is incredibly fast operation.

Jumping, you know, from 1 page on your disk to another page is very slow. And so that tends to be the case with graph data, which is

higher dimensional and, in fact, can be infinitely dimensional. So 1 of the really interesting questions is, how do you take graph data, which can be infinitely dimensional,

and map it onto 1 dimensional memory in such a way that you can minimize,

sort of jumping or seeking across disk or your SSD and and pulling various pages into memory and rather have continuous blocks that that are co accessed together, which is still very much an open problem. There's lots of really interesting research in graph partitioning and there's interesting, like, partitioners elements to it. So how can you, like, as a partitioner,

influence the schema in such a way that you get better locality on disk? How can we build better index structures so we can index graph data in a way that that allows for faster retrieval of highly complex operations like path finding, for instance? So all of those areas are still very much work in progress, and I'm really keenly interested in seeing how they evolve and trying to keep tabs on the research that's happening, you know, the technologies that are being built. And then stuff that comes out of left field, like GraphQL

did to some extent and tends to influence the space a lot.

Yeah. Yeah. And and, Matthias, I when I was listening to the question and hearing your first answer, I could not be more aligned, with your your general first 2 recommendations

right off the riff.

Number 1, watching what happens with graph query languages is completely agree with you on that.

Number 2, though, with graph visualization, I I I would,

kinda go in a different direction with it, though. I think there's a need for

better

I guess, more of a better developer experience from a visual perspective for understanding your schema, but then also just more localized traversals instead of global structures, which the nuance and the difference between how you and I see it really just as whether or not you're trying to visual very large segments of your graph, or you're just trying to get confirmation about locally where you are. And I,

I think the the ladder of that is is where I have seen

the industry pulling us more towards developing a better tool for the developer experience. Like, there's just the initial,

initial perspective.

And then lastly, as I was listening to your your comments about how we better represent graph structures on disk and and and the different paging requirements, that that was just reminding me of the emergence of neuromorphic com computing that's happening

where, people are starting to,

you know,

essentially create

chips that are, you know, trained structures of known networks,

for essentially solving that problem, for, you know, 1 very specific application of machine learning and deep learning, namely neural nets. But I think that the advancements there in neuromorphic computing

and how they are creating

a hardware for that acceleration could have a very interesting influence on,

how we approach storage of graph structures in the future. So interesting space to watch.

For anybody who wants to dig deeper into this problem domain or learn more about the overall space and applications

of graph data and graph algorithms, what are some of the resources that you recommend?

That's yeah. Great question. I'm always gonna come from the perspective of get your hands on it and try something. You know, I I love to

keep writing code or working on things until it it's not working or you break you break something. So from that perspective,

I have 2 recommendations.

The first of which is if you're just getting started, there are a ton of datasets,

on Kaggle,

where you can get your hands on graph structured data and see how people have solved

graph problems like link prediction.

So I recommend going there for Kaggle. If you wanna go a little bit deeper,

the snap or Stanford network analytics project has both a synthetic graph generator in addition to a whole suite of graph,

datasets, some of which we use in our book. And then if you do want to get into using graph databases,

alongside

our, our book, we have an entire technology stack of, of playing with data and these graph traversals for solving the most common problems. There's really, really cool get started developer experience that we have them deployed in. It's called DataStax Desktop and it had you know, just kinda gets everything ready for you so you can just get started with where exactly we have everything set up in the book. So if you want a vendor specific example, I would go I would go to DataStax desktop for just kind of exploring that get getting started experience,

with the examples that we have in our book. Are there any other aspects of graph technologies

or applications

of graph data that we didn't discuss that you'd like to cover before we close out the show?

I think we talked about most of the most of the important elements. I'll I'll I'll bring it back to

really

I think

it is a really fascinating way to look at the world

through the lens of graph structured data. And so for anybody who is listening and who finds this whole area

somewhat interesting, I I definitely second what Denise said about try it out, play around with it. And I would also recommend folks to to sort

of play with this notion of graph thinking and try to look at their problems that they're currently solving from a, hey, how would I solve this with graph? And what would it look like if I looked at this problem from the angle of a graph problem and apply graph thinking to it? Not necessarily because you're then gonna go and do exactly that, but just kinda do to train your brain a little bit to think in those ways. And and you'll see how it it really opens up, your aperture for for problem solving in a in a really fun and interesting way. And so that kind of like, talking about the technologies, I think the technologies are important, but I also think that, you know, those technologies are gonna be very different 10 years from now. And they they they'll hopefully have evolved tremendously since then. But the 1 thing that I think is gonna be the same is that

you still in order to use them effectively, you still need to be able

to to think in your brain in terms of graphs and apply graph thinking. And so learning that, I think, is a skill that will be useful for

years years, is not decades to come. And I encourage anybody

who is interested in this to to try this out and and play around with this.

Alright. Well, for anybody who wants to follow along with the work that you're doing or get in touch, I'll have you each add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Denise, if you wanna go first. Yeah. That's, that's a that's a great great question to close with, and I'm I'm sure you've kinda picked up on the theme throughout this podcast. I think that the biggest gap single handedly is,

an easy to use graph query language.

I'm, I think both of us, both Matthias and I are really looking forward to seeing the innovation coming from that space and on that topic.

And,

to the points that we've made earlier, I think we're gonna see innovation

primarily in terms of ease of use, but then also trying to bridge the gap between relational algebra thinking, which was primarily,

you know, SQL and relational databases, but then also bridging over to why people love graph data structures and that's mainly from the graph theory perspective. So,

I'm really looking forward to seeing where where we start closing the gap in, graph query languages.

Yeah. A 100% agree. It's all about usability. Like, the technology is powerful.

Graph thinking allows you to solve problems in a novel way. The big gap is making it easy for people to do so in a way that does not require you to have years years of practical experience,

but rather, you know, get up and running quickly and and solve problems in a matter of hours,

very effectively. And we're not we're not there yet. And I think, you you know, understandably so, because it's still very, very,

you know, very new technology and very early days.

But that will be the big the big item for us to figure out is how do we make this easy to use.

Alright. Well, thank you both very much for taking the time today to join me and share your experience of working in this problem domain.

As we've said, it's definitely something that is increasingly

relevant and necessary for the types of data problems that we're being faced with. So I appreciate all the time and effort that you've put into

your work in the field and your work on this book. So thank you again for that, and I hope you enjoy the rest of your day. Yeah. Thank you so much for having us. This was, this was a real honor and pleasure to get to be on this. Thank you. Yeah. Thank you so much. Have a good day. Bye.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast

dotcom to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of data engineering podcast dot com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links