Summary
The way that you store your data can have a huge impact on the ways that it can be practically used. For a substantial number of use cases, the optimal format for storing and querying that information is as a graph, however databases architected around that use case have historically been difficult to use at scale or for serving fast, distributed queries. In this episode Manish Jain explains how DGraph is overcoming those limitations, how the project got started, and how you can start using it today. He also discusses the various cases where a graph storage layer is beneficial, and when you would be better off using something else. In addition he talks about the challenges of building a distributed, consistent database and the tradeoffs that were made to make DGraph a reality.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
- If you have ever wished that you could use the same tools for versioning and distributing your data that you use for your software then you owe it to yourself to check out what the fine folks at Quilt Data have built. Quilt is an open source platform for building a sane workflow around your data that works for your whole team, including version history, metatdata management, and flexible hosting. Stop by their booth at JupyterCon in New York City on August 22nd through the 24th to say Hi and tell them that the Data Engineering Podcast sent you! After that, keep an eye on the AWS marketplace for a pre-packaged version of Quilt for Teams to deploy into your own environment and stop fighting with your data.
- Python has quickly become one of the most widely used languages by both data engineers and data scientists, letting everyone on your team understand each other more easily. However, it can be tough learning it when you’re just starting out. Luckily, there’s an easy way to get involved. Written by MIT lecturer Ana Bell and published by Manning Publications, Get Programming: Learn to code with Python is the perfect way to get started working with Python. Ana’s experience
as a teacher of Python really shines through, as you get hands-on with the language without being drowned in confusing jargon or theory. Filled with practical examples and step-by-step lessons to take on, Get Programming is perfect for people who just want to get stuck in with Python. Get your copy of the book with a special 40% discount for Data Engineering Podcast listeners by going to dataengineeringpodcast.com/get-programming and use the discount code PodInit40! - Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Manish Jain about DGraph, a low latency, high throughput, native and distributed graph database.
Interview
- Introduction
- How did you get involved in the area of data management?
- What is DGraph and what motivated you to build it?
- Graph databases and graph algorithms have been part of the computing landscape for decades. What has changed in recent years to allow for the current proliferation of graph oriented storage systems?
- The graph space is becoming crowded in recent years. How does DGraph compare to the current set of offerings?
- What are some of the common uses of graph storage systems?
- What are some potential uses that are often overlooked?
- There are a few ways that graph structures and properties can be implemented, including the ability to store data in the vertices connecting nodes and the structures that can be contained within the nodes themselves. How is information represented in DGraph and what are the tradeoffs in the approach that you chose?
- How does the query interface and data storage in DGraph differ from other options?
- What are your opinions on the graph query languages that have been adopted by other storages systems, such as Gremlin, Cypher, and GSQL?
- How is DGraph architected and how has that architecture evolved from when it first started?
- How do you balance the speed and agility of schema on read with the additional application complexity that is required, as opposed to schema on write?
- In your documentation you contend that DGraph is a viable replacement for RDBMS-oriented primary storage systems. What are the switching costs for someone looking to make that transition?
- What are the limitations of DGraph in terms of scalability or usability?
- Where does it fall along the axes of the CAP theorem?
- For someone who is interested in building on top of DGraph and deploying it to production, what does their workflow and operational overhead look like?
- What have been the most challenging aspects of building and growing the DGraph project and community?
- What are some of the most interesting or unexpected uses of DGraph that you are aware of?
- When is DGraph the wrong choice?
- What are your plans for the future of DGraph?
Contact Info
- @manishrjain on Twitter
- manishrjain on GitHub
- Blog
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- DGraph
- Badger
- Google Knowledge Graph
- Graph Theory
- Graph Database
- SQL
- Relational Database
- NoSQL
- OLTP (On-Line Transaction Processing)
- Neo4J
- PostgreSQL
- MySQL
- BigTable
- Recommendation System
- Fraud Detection
- Customer 360
- Usenet Express
- IPFS
- Gremlin
- Cypher
- GSQL
- GraphQL
- MetaWeb
- RAFT
- Spanner
- HBase
- Elasticsearch
- Kubernetes
- TLS (Transport Layer Security)
- Jepsen Tests
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello. Welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline, you'll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40 gigabit network, all controlled by a brand new API, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today to get a $20 credit and launch a new server in under a minute. If you have ever wished that you could use the same tools for versioning and distributing your data that you use for your software, then you owe it to yourself to check out what the fine folks at Quilt Data have built. Quilt is an open source platform for building a sane workflow around your data that works for your whole team, including version history, metadata management, and flexible hosting.
Stop by their booth at JupyterCon in New York City on August 22nd through 24th to say hi and tell them that the data engineering podcast sent you. After that, keep an eye on the AWS marketplace for a prepackaged version of Quilt for teams to deploy into your own environment and stop fighting with your data. And Python has quickly become 1 of the most widely used languages by both data engineers and data scientists, letting everyone on your team understand each other more easily. However, it can be tough learning it when you're just starting out. Luckily, there's an easy way to get involved. Written by MIT lecturer Anna Bell and published by Manning Publications, get programming, learn to code with Python is the perfect way to get started working with the language.
Anna's experience as a teacher of Python really shines through as you get hands on with the language without being drowned in confusing jargon or theory. Filled with practical examples and step by step lessons to take on, Git programming is perfect for people who just want to get stuck in with Python. Get your copy of the book with a special 40% discount for data engineering podcast listeners by going to data engineering podcast.com/gitdashprogramming and use the discount code pod init40. And go to the data engineering podcast.com
[00:02:09] Unknown:
website to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macy. And today, I'm interviewing Manish Jain about d graph, a low latency, high throughput native and distributed graph database.
[00:02:23] Unknown:
So, Manish, could you start by introducing yourself? Hi, Tobias. I'm Manish. Before I started Deepgram, I used to, work at Google. I was there for 6 and a half years working in, web search and knowledge graph infrastructure group. In fact, that's where I started my journey with, Graph, Systems and started the graph in 2015, and,
[00:02:43] Unknown:
now we are here. And do you remember how you first got involved in the area of data management?
[00:02:48] Unknown:
Yeah. I think it was in 2010. We had just wrapped up a big project at Google, real time indexing system and, launched that. And after that, I started looking into seeing what I should do next. Google acquired, a startup in San Francisco, which brought knowledge graph to Google. And I started working with them to see how we could use knowledge graph in web search. And as part of that, I started, wiggling my feet with the graph systems in general, and that's where I started the whole. Can you start by describing a bit about what d graph is and what your motivation was when you first started building it? Yeah. So d graph is a distributed graph database, which is meant to scale out. It's it's meant to provide all the properties in database that you would expect from from modern database systems. We consider it to be sort of like a new graph.
And the reason, you know, we I started, building Deepgram was I had already built a similar system at Google. It was not a database, but more of a graph service system. And, it was meant to be low latency, meant to be high throughput because it was, gonna be put behind web search. So a good chunk of web search traffic was gonna hit the system. And what I learned from there, when I started comparing that after I left Google, this 2 years after I left Google, I started comparing that to what's outside in the graph world. And I realized the the best graph databases out there were only running on a single server, didn't perform very well, didn't scale well, and then then it's hard. It it's probably something that, that should be addressed. There's clear need in the market, but the existing solutions were just not, good enough. And the entire concept around graph databases and graph theory and the associated algorithms have been part of computing for several decades now,
[00:04:36] Unknown:
but graph databases as a storage engine have been fairly nascent until recent years, which has seen a bit of an explosion in that as a particular mechanism for people to incorporate into their data layer. So I'm curious what has changed in the recent years that allows for that proliferation of graph oriented storage systems and
[00:05:00] Unknown:
the interest of people using it within their systems? Yeah. I think so. What has changed is that, modern data sets tend to be increasingly sparse and and, highly interconnected. The connections between data sets are are exploding. And to be able to make sense of these data sets, you need to be able to do arbitrary depth joints and, connection traversals, edge traversals. That's something that traditional relation databases are not very good at handling and and and being performed at. So we are seeing, you know, obviously, SQL database is being used in companies, then we see that they use, some level of NoSQL, but then there are there's a clear need where they they have certain use cases which can only be solved or can be solved lot easier with the draft database and, or graph system, and that's when they go and reach out, in the market to see what's out there. So almost all the medium to big companies are using some sort of graph system, internally.
[00:05:57] Unknown:
And because of the current interest in graph systems, there are fairly large number of systems that are targeting that use case in various capacities. So I'm curious how you view d graph fitting in that overall landscape and how it compares to the offerings that are available in that space. Yeah. So I think some of the ones that I've seen have been sort of around analytics,
[00:06:24] Unknown:
latest use cases, and, we have systems which which can compute page rank, complete data of memory, etcetera. So I think with the Deepgram, the sort of, like, focus upon analytics, we actually wanted to be a database that people can build their applications upon. And so we went straight after, OLTP use case, the transactional real time transactional use cases. And there are a few, obviously, like, the the elephant in the in the room or in the industry would be Neo 4 j, the most popular graph database out there. But, it just has not been able to capture the market, like, let's say, Postgres or MySQL have done for SQL landscape because of various issues, like, I would say performance and and data corruption, etcetera. If you go stack overflow, you'll see some of these things.
So, also, I felt like we don't need to limit ourselves to to to these kind of design choices when clearly the world has gone from, like, you know, Google released Bigtable back in 2006 or 7, the paper upon that. And then in, 20, I think, 11 or 12, there was a release of spanner paper. And so there's a race to build the next geographical distributed SQL database, but we are stuck with a single server, graph database, which doesn't really perform very well. So so clearly, we can do better. And, the way we have designed t graph, we consider it to be the most advanced cloud database in the market. And the idea is that we will make those design decisions, which are hard, which means we we want to be able to scale it horizontally. It has to be consistently replicated. It has to be transactional. All those things which make for a good database
[00:08:08] Unknown:
of today. And 1 of the points of confusion that comes up occasionally when people are first getting introduced to graph databases and graph storage is the potential use cases for it. So can you discuss some of the common uses of d graph and other graph databases that you are aware of and some of the use cases that are often overlooked but would be a good, workload for a graph system?
[00:08:39] Unknown:
Yeah. So I think some of the use cases that, that people think about certainly think about graph databases, I would say, is recommendations. You know, when you when you have a a a rich dataset and you want to be able to recommend, for example, what the users should be purchasing if you are building an ecommerce website. That's, that's probably 1 of the most common use cases of cloud database. The next most common, I would say, is, is fraud detection, real time fraud detection. In fact, we have seen a bunch of, Fintech and even non Fintech and highly technical companies use data for fraud detection.
There's others like, you know, in ad tech, you can place ads based upon users' activities and interests and the social graph. We have seen customer 360 where they unify vertical silos of data into 1 big connected graph so that they can query across these different, independent data sets and be able to surface interesting information about their how their customers have been with. 1 of the interesting things about graph actually to this point is that, they they tend to, like, graph databases do not necessarily have these these, boundaries of tables. Right? So you actually have the entire dataset into 1 database, and you can query from any data point to any other data point without necessarily losing performance.
And that's that's 1 of the interesting things that we have seen with 360, for example. Now, obviously, there's artificial intelligence use case where you can interpret user queries and understand them better with knowledge graph, and that's in fact what I had started with back in 2010. I would say some of the overlooked use cases is the most common 1 would be to just build a social website, you know, with the post, comments on the post, likes on comments, comments on comments, likes on those. This is a repeated pattern. You can see it in Facebook. You can see it in Cora, which is a question answering website. You can see it in Stack Overflow. It's a similar sort of, workload, and I think you can you can cut down a lot of code by using a cloud database to solve this content because you can recursively retrieve, posts and the comments can connect you to those posts and then likes on those comments and so and so forth with the graph database easily and cut down your code base in in probably half.
So, that's 1 of the things that I would like to see, but I don't I think it's often overlooked.
[00:11:01] Unknown:
And when I've been looking at different graph systems, there are variations in the way that the data is represented, nodes can have nodes can have their own attributes and things like that. So I'm wondering where d graph falls along that continuum in terms of how the data is modeled and stored. So data actually has a very interesting approach to this and, note that we were going for arbitrary dev joints, and at the same time, we want a distributor system.
[00:11:38] Unknown:
So I I don't know if any of the system which is trying to do arbitrary test joints while being distributed. So we actually had an interesting pretty interesting design challenge there. The the thing is you can you can divide up the data, and this is actually nodes and edges. Right? So you can divide the data in either nodes or edges. Most designs that I've seen were divided up the data in nodes because they start out evenly across different machines, different different servers. The problem with that is if you need to do joins, each join would require you to do a broadcast to all the servers in the cluster, and that significantly affects your 95th percentile latency.
So what we do instead is we we divide the data by edges instead of nodes. And so we can then we are then able to do a join with a single network call to another server. And if you have, let's say, n depth join, then you can just do n network calls and and your query will be completely executed. And this is, it's a lot more it gives you a lot lot, lower latency and lot more sort of persistent latency even if you increase the number of machines in the cluster. So that was deliberate design decision that we made, up front. And it's obviously a lot more complicated to build, but, it actually gives us basically, it reduces the number of network calls in the system and gives us a reliable latency number, which is important for real time transactional workloads.
[00:13:07] Unknown:
And for people who are moving from a relational system who might be familiar with things like data normalization and the various normal forms when they're moving to a graph system, is any of that knowledge applicable, or do they need to completely rethink the way that they structure the data that they're storing?
[00:13:25] Unknown:
Oh, actually, interestingly, they would need to just simplify the whole model. And, actually, it'll be a lot closer to how they would visualize it in their mind. So, you know, taking the whole, like, Facebook post, kinda example, you have a post pointing to a comment, which is a comment on a post, then you have another comment on that comment. So if you think in terms of graph, then you have a post and then a comment coming out of that, another comment coming out of that comment, and a like on that comment, and so and so forth. So in terms of mind map, it's maybe, like, 3 or 4 bubbles connected to each other. But if you if you look at how you would represent this in a a relational database, you'll have to have, like, different tables and a comment or comment would point to the same table back, the phone ID of that. But if it's a comment or a post, it would be a a different kind of, like, you know, a pointer back.
It it basically requires a lot more thinking to to represent this in relation databases than it does in graph databases. So if anything, they would just need to sort of unlearn some of that complexity and go with the most natural mind map, that would that they would think of think of and represent that in graph. And my understanding is that at the sort of foundational layer of d graph, you're using a key value store,
[00:14:49] Unknown:
and I'm wondering if that is reflected at the user interface and, sort of query level when somebody's storing the data if they are only able to store key value attributes or if it supports more of a sort of rich document structure at the nodes or in the vertices?
[00:15:07] Unknown:
Yeah. It's it's not reflected in in the end user, experience. So if they were to be actually internally convert the data that we have, all these nodes and edges, etcetera, we represent them in a very interesting structure. For example, we would assign a integer a unique integer to each, node. And, then if you need to store a list of your friends, we've actually stored it in a single, what we call the posting list, with your friend IDs in a sorted integer fashion. And that's really useful because if you need to find common friends between you and somebody else, we get 2 posting lists, and we can just do a integer intersection. In fact, intersection of 2 sorted integer lists, which is extremely efficient. So we, so we kinda, like, store these posting lists internally in the key value database that we wrote, in house.
And so we don't really expose the actual key value nature of this, to the end user.
[00:16:05] Unknown:
And I was reading through some of the blog posts and documentation for the Badger storage engine that you wrote, and I'm curious if that's being used within other projects or if it's currently only leveraged by Deepgram. Oh, it's definitely being used. In fact, UsenetExpress,
[00:16:23] Unknown:
which is like running on used hosting site. I think they are running, over, like, dozens of petabytes of data on Badger. We have, IPFS, the interplanetary file system, supporting Badger using Badger 0 store, and there's actually a list of, users on our on our website. And, yeah, in fact, there's probably, like, a lot of Badger users. I gave a talk in in call for con China, and I was explicitly asked to talk about Badger, because there's a lot of interest,
[00:16:55] Unknown:
of Badger in China. And going back to d graph, there are a number of different syntaxes that are used by some of the different database engines for queries. So there are things like Gremlin and Cypher and G SQL. So I'm wondering if you can discuss a bit about the query interface that you settled on for d graph and your opinions on some of the other languages that have been adopted by some of those other systems?
[00:17:24] Unknown:
So yeah. I mean, I have I have a bit of opinion about those. So, Grambling, I think when we were starting with Deepgram, we had the same question. Hey. Do we wanna support? Do you wanna support Cypher? What should be our our language of choice? And Kremlin is is the most popular and the most widely used graph, graph query language. However, if you look at the actual, semantics of how it how it runs and and how it actually the queries run, it's very it's it's a bit too simplistic in that you kinda you are retry traversing the the graph and and telling what to return. You can't lose a relationship information from how you go from a to b. You don't really get a sub graph of result back. You just get, like, a list of result back, and it's actually slow to to execute. So we we kinda ruled out on that pretty early on. We looked at Cypher, and, we had a similar conversation around that. It's again, used to running a list of results as opposed to a a subgraph. And you can always go from a list, sorry, a subgraph to a list and normalize it, but you cannot go back from a list to a subgraph because you lose that relationship information.
To give you an example, like, let's say you are looking for, movies, by Tom Hanks and, movies where Tom Hanks acted and directed by by Steven Spielberg, etcetera. So if you if you just return the list of movies, you lose, which movies were acted in by by Tom Hanks, and, you actually want to maintain that information. So we wanted to support a query language, which would, maintain, these relationships and return a subgraph back. And we came across GraphQL, and it looked very it looked like it it was the right fit. We I talked to some other people back at Google as well, from MetaWeb, and it was close to MetaWeb's own query language.
And we embarked upon, using GraphQL as our language of choice. However, we we also were aware that this is not a full fledged query language for our database, and we ended up, like, modifying it to to serve our database, his needs, and kinda simplifying it a bit more. So, adding filtering and, I think a few other things, and removing some of the some of the complexity that we thought. So so we ended up with a modified version of GraphQL. And I think in fact, 1 of the most asked features from Deepgram is to also support the native GraphQL.
Also 1 of the most sort of interesting 1 for us is that when people play with the other version of GraphQL and find it to be simpler to use, than seed. So so that's how we ended up with it. So going a bit deeper, can you discuss
[00:20:12] Unknown:
the way that d graph itself is architected and how that has evolved over the course of time that you've been working on it? So Deepgram's architecture actually has been sort of continuously
[00:20:24] Unknown:
evolving, though it has, I think, largely, I would say, stabilized now. We started with a single server, you know, architecture in version 0.1. I think that in version 0.2 or 0.3, we went over to have multiple servers so that you could actually share your data across servers. In 0.7, we introduced, draft based replication for these for these charts, so that it will be consistently replicated. And we were planning to, like, point on transactions. We never thought we actually build transactions in the graph for a while.
And at some point, and Gustavo Nemir, 1 of, a big personality in in co community, He he created a very convincing issue in Deepgram that he's been playing with MongoDB for a while, and it's it's really hard because MongoDB did not support transactions. But I think they they now do, but, at the time, they did not. And, he made us promise that we would support transactions. And that sort of, like, started the whole quest for us to to see what it would take to support transactions in such such a distribution system. And we then introduced transactions into Badger. We then introduced, we then actually tried out 2 different approaches to building transactions in Deepgram, which required a lot of architectural changes, and then we settled upon the 1 that we currently So, yeah, it's a pretty So, yeah, it's a pretty unique system with a with a pretty, unique architecture.
Also because we we have promised people that this is gonna be fast, and that's what people really like about Deepgram. So we then we even when we introduced transactions, we had to make sure that that these transactions would perform. And, so we, we made some, some choices where we chose not to go with the spanner style transactions, but more of, sort of percolator slash HPS style, but then still modified it quite extensively.
[00:22:25] Unknown:
And in terms of the schema definition, d graph supports, flexible schema or schema on right approach. So I'm wondering how you balance that flexibility with the additional complexity required at the application layer to support schema on
[00:22:43] Unknown:
read versus schema on. Right? So, yeah, I think I think flexible schema is absolutely important for for developer iteration. I think we we wanted to make sure that our users actually have that flexibility. So Deepgram actually supports schema, both on, fight and read. So what would happen is, you know, if you don't specify a scheme upfront, which is completely alright, data would try to interpret your data and figure out whether it's a string or an int or a float or a date, etcetera. And then, try to then we kinda, like, set that that schema for for the predicate, and then all future data that is coming in would be checked against that. And in case you say, you know, change, change the the data type from, int to float or int to string, then at the time of read, we would, look at the type that is current and look at the type that was stored, and we will try to do a conversion if that's possible and, return back the result. So the end user do not does not have do their own conversion. And it obviously, it comes with, certain, potentially CPU cost, but I think it saves a lot more for the end user.
[00:23:49] Unknown:
So in terms of reference to how somebody might talk about types, you could refer to the scheme of d graph as being, dynamically but strongly typed similar to Python where once you assign the type of something, it will maintain that. But if or or you also have the option of assigning it, dynamically. So I guess in that case, it's more akin to how you'd how you would interact with types in in Go, where it will infer the type at assignment time and then enforce it for subsequent, interactions. And And then you also have the option of explicitly defining the types and the structure of the data so that any subsequent rights to that structure will enforce that those data types and those
[00:24:37] Unknown:
sort of data shapes. Is that accurate? That's accurate. I think it's probably more close to Python in that sense where you could have it you could have, it were interpreted, but then you can then go and still change it after having used it, and it would be fine with that. As opposed to go there, once you set it, then you cannot change that anymore. But in Deepgram, you can definitely go back and change it. And even if you have indices upon that, it would recalculate it would delete the oldest indices and recalculate the indices based upon the new type. So for somebody who's developing on top of d graph, their workflow might look like they just sort of start putting data and objects into
[00:25:17] Unknown:
the database. And then as they iterate, it will adapt to their schema. And then once they have sort of cemented the, architecture and structure of their application, they can then export that schema definition and then maybe propagate that to a production system so that the schema is defined ahead of time once you move that into the full operational
[00:25:40] Unknown:
base. Yeah. I think that that sounds absolutely the right thing to do, you know, because as you're iterating over your application, you learn more and more things. You figure out where you need more indices, where you know, what kind of data structure you want. And, and you just go in and modify that, and Deepgram would take care of that for you. Yeah. So that sounds like a good middle ground between MongoDB where you
[00:26:02] Unknown:
just put data in, and you hope that eventually you'll be able to read it out and that everything will be there, or something that's more rigorous, such as a postgreSQL
[00:26:10] Unknown:
where you're required to define the schema ahead of time. Right. We we we do our best effort actually. We actually have a conversion logic that we can convert from what store to what is the current schema. And, you know, we'll do our best to to the conversion. And it's only if we really cannot find a path to conversion, then we would return with an error. And in your documentation
[00:26:31] Unknown:
and as we've discussed a little bit earlier, you contend that d graph is a viable replacement for a relational storage system and can be used as the primary storage mechanism for somebody's application or data environment. So I'm curious what the potential switching costs in terms of technical and cognitive overhead exist for somebody who is looking to make that transition?
[00:26:56] Unknown:
Yeah. That's actually hard to calculate. Honestly, it depends upon how deep they are in it because database sits at the the right at the bottom of the stack. And so everything is kinda built on top of it. The technology also depends upon if the team is largely engineers or there are sort of, analysts or non engineers in the team using the database. It's always a higher cost for non engineers to to learn a new database or a new technology. I think the right way to think about is is is to get up the right choice for your next project, the 1 that you want to build. If you have many interconnected tables, then would they be represented load easily in a graph database? I've seen I've seen people with, with with relational database with, like, so many different tables, which are highly connected. So if you have I would I would say if you have more than 10, that means there might be a good fit for for graphs. And do you need to query across them? Do you need, them to be, you know, to to to to run joints, etcetera? So if you have interconnected datasets, then by switching to DROP, you could save a lot of, back end code and and and develop for time. So the I think the right question should be moving moving forward, what's the best way to to to what's the best database to use as opposed to ripping up what you already have. And when I was reading through the documentation,
[00:28:13] Unknown:
I saw that there are people who are also using d graph as their primary storage for search workloads as well. That that's right. I think we also there there's there's companies,
[00:28:25] Unknown:
startups sort of built, solely on Dgraph and and all the database. And we're also hearing that, a bunch of Elasticsearch, users who who need join functionality. They are using Deepgram for for and switching from Elasticsearch.
[00:28:42] Unknown:
And in terms of operationalizing d graph and putting it into production, what are the limitations that people should be aware of in terms of the scalability or the overall usability
[00:28:55] Unknown:
of d graph? So I think in terms of scalability, there are no such limits. The design is so that it's it's pretty scalable, and we have had people turning 3 node, 5 node, 10 node, 30 node, cluster of Deepgram. So as such, there is no limitation. I mean, draft itself imposed some limitations on the size of the of the group. You typically want a replication of 3 times or a maximum 5 times. They recommend not to go beyond that. But in terms of, the sharding of data, you could actually have a, let's say, a 100 node cluster easily.
In terms of usability, you know, we still are are sort of developing those tools which help you manage your cluster. Those are, we actually have those APIs, but don't have any any good sort of visual tools which can that you easily see and monitor what's happening in your cluster, what's going on, etcetera. So that still is a bit bit more challenge for from the ops side. I would say not too much because we still have loads and we still have, the metrics and whatnot. But, obviously, it is not as evolved in terms of pruning as as, as relation universes.
[00:30:08] Unknown:
And in terms of the cap theorem, where does it fall along the various axes?
[00:30:14] Unknown:
So is CP. If we so if we have a replica, if we have 3 replic replicated servers and if 2 of them go down, then you won't be able to read or write. You have to have at least majority of them up. So, yeah,
[00:30:32] Unknown:
And for somebody who is interested in building on top of d graph and deploying it to their production environment, what is actually involved in getting a cluster deployed and operationalized, and what kind
[00:30:47] Unknown:
of overhead should they be planning for in terms of their server architecture? Actually, it's it's not it's not too hard, to to run a data cluster because we don't rely upon it does not rely upon any other technology. And that was 1 of the design decisions that we had to make upfront, particularly when we're doing transactions. We saw I think it was HPE's paper, and they were using Zookeeper for for providing sort of transactions in a in a draft based replication system. And we always stayed away from having to use a third party system to to make it easier for us. So when you learn Deepgram, you just run 1 process. And if you need to scale it out, you just run multiple processes. A lot of people use, our Kubernetes configuration that we have on our site or, or even docker compose. At least we do testing with docker compose and and a lot of people that we know are using tool analytics to to run the instances.
Some of the big companies who actually have the ops teams, they just run the servers directly on the, like, let's say, Amazon on the Amazon instances instead of running it in Docker. And, yeah, I think it the data basically just takes care of, like, you know, doing membership handling and whatnot. So all you need is you need a, 1 1 server to be up and running, and you can point any new any new server to that, and it just automatically connects and figures out where it shouldn't be serving. There's free data movement between servers if needed. It tries to make sure that each sort of group of clusters, has approximately the same amount of data. So there's a good usage of of RAM. Yeah. I think in terms of operations, it's it's not that hard. And does it have any capacity for user management and authentication and security aspects? Yeah. So that's where we we still need to do more work. We we are planning to build access control lists, in the graph, I think probably early early next year. And, currently, what we have is there is the TLS layer that you can TLS connections that you can have between the client and the server.
But the communication within the servers is still, unencrypted, and the idea is that they'll be behind the cloud that you can then tightly, you know, secure. But obviously, more more work needs to be done in terms of, for example, encryption at rest and, you know, communication encrypted communication between servers, access control lists to to make sure that you don't have, users who are not supposed to do things, etcetera.
[00:33:20] Unknown:
And in the process of building d graph and growing the community around it, what have you found to be some of the most interesting or, challenging aspects of that process?
[00:33:33] Unknown:
I think I think building a new database is just very challenging work to begin with. Despite my time at Google and I was working on pretty complicated problems there, I felt like we were up against, like, a mountain load of work. And our own understanding of things to be able to build this practically a beast. You need to get to, with with the database, you need to get to a certain level of functionality before the database is of use to others. So there's this time where you're just building and building without any any majorly visible or changes or any sort of like positive outcomes, which can kinda get to you. So it's not really for the fate harder. And the community growth actually has been pretty steady, slow and steady kind of thing. We we decided of it, with announcing the cloud, the side putting, content out there of a blog post and stuff. We we build, like, demos. For example, we built a demo where we put the entire Stack Overflow on Deepgram and and put that in open source. And then people came to us. We just helped everybody that we could, get to people who had issues with Deepgram or had feedback. We tried to respond to that.
In fact, we told people that, the best way to contribute to Deepgram was to file an issue. It was not to contribute to code, but to actually just file an issue. Tell us what doesn't work, what bugs or issues, you faced, what you want to do, but you currently cannot do because of, details limitations. And they responded quickly to their questions and and which actually helped solidify the team's, sincerity to, to our community. And they have kept growing to a point where where now we actually have to have dedicated people handling them. We we just don't have enough bandwidth to to handle the community.
[00:35:22] Unknown:
And I understand that early on in the process of building d graph, you had it fairly, restrictively licensed with, I believe, it was a GPL, and you realized that that was actually an inhibiting factor to people being able to use it within their companies, and so you ended up relicensing it. So I don't know if you want to talk a bit about that overall experience.
[00:35:46] Unknown:
For sure. So actually, we started with, Apache license, and at some point, we looked at, MongoDB, and we thought, hey. We should really like, we wanna build a company like them, and, they are sort of, guide for every other, I would say, database startup out there. And, we went with AGPL, but then we start realizing, I think, I think Google's, open source team was, pretty vehemently against AGPL and the band AGPL from Google, which probably caused other big companies to also have equal bands within them. And so, without naming any names, I've been they started talking to some of these, company. I mean, they reached out to us. They just told us that they can't even some of the developers in those companies, they couldn't even, like, bring it in to their code base because of the license.
And, so we decided to figure out what's the best way for us to to, to to still have Apache license, but but also be able to to build a business around, this company. And that's when we ended up switching over to Apache, which is modified with the common close restriction. And the good thing about this restriction is that if you're just wanting to use and build a commercial product on top of the graph, you're not affected. It only basically restricts you from selling the graph database itself. And most of our users, plan to use the graph to to sell something, but not sell the graph itself. And so, you know, this is a very good, license, and we actually only heard positive things, about it since our change. And, actually, since then, we actually have onboarded a bunch of, really big companies, who are now using d lock in production.
[00:37:30] Unknown:
And are there any particularly interesting or unexpected uses of d graph that you've become aware of? I think there was 1 interesting 1 where
[00:37:39] Unknown:
where this company had, something like 51 different data silos, like data coming in from different streams. And in, so they actually had 51 different database instances, which could not talk to each other, and, they wanted to be able to query across them. And so they they they are using the graph to stream in sort of, like, data from from many of these silos into a single digraph, database cluster. And and now building an app on top of this information, which they can now query across. That was, like, just a number, like, 50 database instances to me was, like, like, pretty crazy. And so I found it to be pretty interesting.
[00:38:22] Unknown:
And when is d graph the wrong choice when somebody is considering building a new platform?
[00:38:28] Unknown:
You know, these data thing, a lot of people want to have a time series database, and, time series database by by definition is very flat dataset, which just has a lot of data points, but not a lot of connections. And, so for those use cases, I know people can people are still using d graph, but, I would say that's not what it is designed for. It does support date time. It does support indexing of that, but it's not really designed to to be a a time series database. And as you continue work on d graph, both both the project and the business, what do you have planned for the future? So I think our our current focus is still on the actual d graph project. We we wanna make sure that, it is it is working for its users, that our users are happy. And we actually we actually as I said, we have been stabilizing the database. We actually have just some test report coming up, by the end of this month on, which actually helped us fix a lot of issues. And we have a road map for this year with with feature requests that people have been asking us for for a while now that we will deliver upon.
And, so I think that's that's where the focus is. It's still on making sure that the graph people who choose the graph, it's it's a success for them. And are there any other aspects of d graph or graph databases
[00:39:50] Unknown:
in general that we didn't discuss yet, which you think we should cover before we close out the show? There was a question, I think, around
[00:39:57] Unknown:
some of the some of the, I think, I would say, sentiment around graph databases, that keeps on coming up, where people feel that, you know, their graph databases don't work or they don't scale, etcetera. And, actually, I always like have a very sort of philosophical argument to that. And so I think I'll I'll probably share that with you. You know, we have just just take language example. Right? So we have many experiments in in different computer languages. We have, you know, c c c plus plus, which are really old Java, but then lately, we have seen, like, Python and Ruby and now go. There's a lot of experimentation with languages. New languages are designed to help with developer iteration, and that comes even at the cost of machine utilization and user experience.
But when it comes to databases, many developers tend to take a single-minded view that, SQL is king. However, SQL databases, can slow down all of the iteration when compared to their modern alternatives, easily doubling the code base size compared to graph databases. And, I think those development costs need to be considered and rationalized. And that's the biggest gap I think there is in terms of understanding. It it's it's not in terms of a tool or technology, but a but a culture of sort of worshiping all the databases over new.
Almost, I think, at the cost of engineering, that sort of, slightly
[00:41:24] Unknown:
sort of irks me or or upsets me. I think that needs to change. Well, for anybody who wants to follow the work that you're up to and get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, you answered this a little bit just now, but if you could share your perspective
[00:41:43] Unknown:
on the biggest gap in the tooling or technology that's available for data management today. I think I think we have we have a lot of tools already built, and, and they are probably like new tools, almost on a almost like being built on a daily basis. I think the biggest 1 would have to be coming in terms of, what developers are willing to try. Right? I see as I said, the companies are willing to try new languages. And I I I think more companies should be willing to try new databases to to achieve, maximum developer
[00:42:18] Unknown:
productivity. Alright. Well, thank you very much for taking the time to talk about your work on d graph and for the wonderful job you've done on it so far. So thank you for that, and I hope you enjoy the rest of your evening. Thank you. Thanks for having me.
Introduction to Manish Jain and d graph
What is d graph?
The Rise of Graph Databases
d graph in the Graph Database Landscape
Common and Overlooked Use Cases for Graph Databases
Data Modeling in d graph
Key-Value Store and Badger Storage Engine
Query Languages and GraphQL
d graph Architecture and Evolution
Schema Flexibility in d graph
Switching Costs and Transitioning to d graph
d graph for Search Workloads
Operationalizing d graph
Challenges in Building d graph and Community Growth
Licensing and Adoption
Interesting Use Cases of d graph
When Not to Use d graph
Future Plans for d graph
Philosophical Views on Graph Databases
Biggest Gap in Data Management Tooling