Summary
As data professionals we have a number of tools available for storing, processing, and analyzing data. We also have tools for collaborating on software and analysis, but collaborating on data is still an underserved capability. Gavin Mendel-Gleason encountered this problem first hand while working on the Sesshat databank, leading him to create TerminusDB and TerminusHub. In this episode he explains how the TerminusDB system is architected to provide a versioned graph storage engine that allows for branching and merging of data sets, how that opens up new possibilities for individuals and teams to work together on building new data repositories. This is a fascinating conversation on the technical challenges involved, the opportunities that such as system provides, and the complexities inherent to building a successful business on open source.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Do you want to get better at Python? Now is an excellent time to take an online course. Whether you’re just learning Python or you’re looking for deep dives on topics like APIs, memory mangement, async and await, and more, our friends at Talk Python Training have a top-notch course for you. If you’re just getting started, be sure to check out the Python for Absolute Beginners course. It’s like the first year of computer science that you never took compressed into 10 fun hours of Python coding and problem solving. Go to dataengineeringpodcast.com/talkpython today and get 10% off the course that will help you find your next level. That’s dataengineeringpodcast.com/talkpython, and don’t forget to thank them for supporting the show.
- You invest so much in your data infrastructure – you simply can’t afford to settle for unreliable data. Fortunately, there’s hope: in the same way that New Relic, DataDog, and other Application Performance Management solutions ensure reliable software and keep application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo’s end-to-end Data Observability Platform monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence. The platform uses machine learning to infer and learn your data, proactively identify data issues, assess its impact through lineage, and notify those who need to know before it impacts the business. By empowering data teams with end-to-end data reliability, Monte Carlo helps organizations save time, increase revenue, and restore trust in their data. Visit dataengineeringpodcast.com/montecarlo today to request a demo and see how Monte Carlo delivers data observability across your data infrastructure. The first 25 will receive a free, limited edition Monte Carlo hat!
- Your host is Tobias Macey and today I’m interviewing Gavin Mendel-Gleason about TerminusDB, an open source model driven graph database for knowledge graph representation
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what TerminusDB is and what motivated you to build it?
- What are the use cases that TerminusDB and TerminusHub are designed for?
- There are a number of different reasons and methods for versioning data, such as the work being done with Datomic, LakeFS, DVC, etc. Where does TerminusDB fit in relation to those and other data versioning systems that are available today?
- Can you describe how TerminusDB is implemented?
- How has the design changed or evolved since you first began working on it?
- What was the decision process and design considerations that led you to choose Prolog as the implementation language?
- One of the challenges that have faced other knowledge engines built around RDF is that of scale and performance. How are you addressing those difficulties in TerminusDB?
- What are the scaling factors and limitations for TerminusDB? (e.g. volumes of data, clustering, etc.)
- How does the use of RDF triples and JSON-LD impact the audience for TerminusDB?
- How much overhead is incurred by maintaining a long history of changes for a database?
- How do you handle garbage collection/compaction of versions?
- How does the availability of branching and merging strategies change the approach that data teams take when working on a project?
- What are the edge cases in merging and conflict resolution, and what tools does TerminusDB/TerminusHub provide for working through those situations?
- What are some useful strategies that teams should be aware of for working effectively with collaborative datasets in TerminusDB?
- Another interesting element of the TerminusDB platform is the query language. What did you use as inspiration for designing it and how much of a learning curve is involved?
- What are some of the most interesting, innovative, or unexpected ways that you have seen TerminusDB used? https://en.wikipedia.org/wiki/Semantic_Web-?utm_source=rss&utm_medium=rss What are the most interesting, unexpected, or challenging lessons that you have learned while building TerminusDB and TerminusHub?
- When is TerminusDB the wrong choice?
- What do you have planned for the future of the project?
Contact Info
- @GavinMGleason on Twitter
- GavinMendelGleason on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- TerminusDB
- TerminusHub
- Chem Informatics
- Type Theory
- Graph Database
- Trinity College Dublin
- Sesshat Databank analytics over civilizations in history
- PostgreSQL
- DGraph
- Grakn
- Neo4J
- Datomic
- LakeFS
- DVC
- Dolt
- Persistent Succinct Data Structure
- Currying
- Prolog
- WOQL TerminusDB query language
- RDF
- JSON-LD
- Semantic Web
- Property Graph
- Hypergraph
- Super Node
- Bloom Filters
- Data Curation
- CRDT == Conflict-Free Replicated Data Types
- SPARQL
- Datalog
- AST == Abstract Syntax Tree
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Do you want to get better at Python? Now is an excellent time to take an online course. Whether you're just learning Python or you're looking for deep dives on topics like APIs, memory management, async and await, and more, our friends at Talk Python Training have a top notch course for you. If you're just getting started, be sure to check out the Python for absolute beginners course. It's like the 1st year of computer science that you never took compressed into 10 fun hours of Python coding and problem solving. Go to data engineering podcast dotcom/talkpython today and get 10% off the course that will help you find your next level. That's dataengineeringpodcast.com/talkpython, and don't forget to thank them for supporting the show.
Your host is Tobias Macy. And today, I'm interviewing Gavin Mendel Gleason about Terminus DB, an open source model driven graph database for knowledge graph representation. So, Gavin, can you start by introducing yourself?
[00:01:48] Unknown:
Yeah. Sure. I'm Gavin Mendel Gleason. I'm the CTO of Terminus DB, and I'm a computer scientist, but I identify as a data engineer.
[00:01:56] Unknown:
Do you remember how you first got involved in the area of data management?
[00:02:00] Unknown:
Yeah. So I've been doing data management for a very long time, for about 23 years, and my first job was actually in looking at managing records of bed and breakfasts in Alaska. That's how I first got involved in it. And from there, I went on to do stuff with Cheminformatics, large Cheminformatics databases. And then about 15 years ago, I was in a involved in a startup of a graph database before graph databases were even really very well known. Since then, I've done a lot of stuff in text indexing, search engines, machine learning on large text corpora. I took a brief hiatus for a PhD in type theory, but then immediately went back to the data management problem after my PhD.
[00:02:44] Unknown:
And so all of that has led you to starting the Terminus DB project. I'm wondering if you can give a bit of an overview about what it is and what motivated you to build
[00:02:54] Unknown:
it. Great. Yeah. So, Terminus DB is collaborative graph database with a rich schema modeling layer. So the motivation really came out of a project called Seshat, which I was working on at Trinity College Dublin. That project is a very ambitious project to store data about all civilizations in human history. They wanted to do analytics over information about those civilizations, and that information includes things like the population carrying capacity, number of important buildings, the kinds of rituals they had. They had human sacrifice, just a huge range of different data points.
And they had some very specific project requirements that actually turned out to be very generally useful in data curation and data management. So they needed to be able to collaborate between researchers all over the world, so they needed some ability to send data around easily. The data they were using is extremely complicated. Everything had to have confidence. It had to have geotemporal scoping. You had to know, you know, what the ranges of times that something started at to the ranges of times that it may have ended at. So even the the endpoints of the temporal scoping were uncertain in in terms of time, and they had a lot of different kinds of data points that range from the numeric through to much more sort of, subjective and then various kinds of sort of choice data points like whether or not your major war animal was a a camel or a horse, and these sorts of things were really quite important to them. So in in addition to that, the data was often very sparse. So you have these very elaborate schemata that have lots of different kinds of things that you might say about a civilization, but then you may not know many of them, or points might be uncoded because you haven't coded them yet. You just haven't entered data about it. They might be unknown, or you might have specific known quantities within sorts of confidence ranges. So it was a very elaborate data project.
And in addition to that, they needed to staging and versioning of their documents because they would often have data entered by nonprofessionals or students, and then they had to review it by experts. And then it would go into some other staging area where you'd collect the data together. You'd do some sort of analysis with that data, and then finally, you'd prepare it for publishing. And the data needed to be easily publishable to the public on a regular basis in a format that was potentially machine readable, and they had to have it in a place where it could be embargoed for publication.
So you had to tag the data with specific sort of releases, which is very similar to what you might do on software. On top of all that, the data needed to be extractable via query in formats that were amenable to analytics and computer modeling. And the researchers were specifically very interested in predictive analytics on the past and even the present and future based on the models that they had designed that have to do with sociology and and the dynamics of civilizations. So it's really a really cool project, but it had some huge data management problems that were quite difficult to solve. So the requirements myself and the CEO of thermostat, Kevin Feeney, went looking around for pieces of the solution in a tool chain to meet them.
And we sort of cobbled some things together, but nothing we had was great. And really, it sort of pressed us in trying to build up a system that could actually do all of these things. And we began on this crazy task of actually writing a database ourselves on that basis.
[00:06:53] Unknown:
Yeah. And writing a database is definitely not an undertaking to embark on lightly. So I'm curious what you saw as being the major lacking components in the overall space of data versioning tools and graph engines, particularly at the time that you were getting started with the project?
[00:07:11] Unknown:
Yeah. So, I mean, 1 of the things was we wanted to be able to have this version control in the graph. There wasn't really anything at the time that could do that very well. We played around with a number of the sort of open source tools. We played with Jena. We had various experimentation, actually using Postgres at some point, and all of it felt very forced and difficult to work with and not particularly scalable. And after the Cescet project, we had another project on the same European funding proposal that, involved ingesting all sorts of economic data about the Polish economy since the fall of the Warsaw Pact that involved, like, which companies existed, who the directors were, shareholdings of those companies, that kind of thing. And the solutions that we had sort of patched together just couldn't scale up to ingesting that whole thing. And so that really pressed us to to find a scalable mechanism of graph versioning that we could stick very large knowledge graphs into.
[00:08:19] Unknown:
Graph engines currently are definitely seeing a bit of a renaissance where there are a number of different projects that aim to provide support for that either natively or as an add on capability or as part of a multimodel ability within the graph engine. So thinking of things like d graph and DurangoDB and Neo 4 j. And there are also a number of projects available for building these knowledge graphs and knowledge engines. I'm thinking in particular about things like Kraken. I'm wondering how you see Terminus DB sitting within that landscape. And if you were to start the project over again, do you think that it would still be worth building this engine as a dedicated project, or do you think that it would make more sense to build on top of some of the existing technologies?
[00:09:06] Unknown:
Yeah. I mean, that's a very good question. So, I mean, I think our core competence like, we started with versioning as an important thing. We knew we needed the collaboration features, but it turns out, like, versioning is the basis on which to do appropriate collaboration features, and Git really got this right. So that the git approach where you have, you know, a delta encoding of all the changes allows you to do all of this push pull merge rebase, and all of these sorts of things are facilitated by that versioning technique.
And there are things like DeltHub, I think, is the closest to Termas DB, but it's really a relational database. In terms of the graph databases, there's nothing that really does that collaboration well. And we're now we're in a place where I think even if we like, I didn't think it would take this long, frankly. It's writing a database is hard work. It takes a long time. But I think that where we are now is much better than we could have done in trying to build on top of something else. And there's a couple of reasons for that. Some of them have to do with the technical approaches that we took in actually implementing the database itself.
[00:10:13] Unknown:
My last question was about the positioning of Terminus DB in the context of graph engines. But but as you said, 1 of the primary considerations of it is data versioning. And there are a number of other versioning tools that are out there for being able to handle various database, lakefs for being able to branch and merge and version datasets within object storage, DVC, or data version control for managing the versioning and collaboration capabilities on machine learning projects. I'm wondering if you can give a bit of context as to where Terminus DB fits in relation to some of those other data versioning systems and the major use cases that Terminus DB is uniquely well suited for.
[00:11:00] Unknown:
We're similar to a lot of those in the sense that we have versioning. So, I mean, like, DVC is really data versioning. We're a database that has versions, that's focused on the collaboration aspect. So I think that's where we really vary the most. So we would see that push, pull, clone, and merge as the sorts of things that really differentiate us from competitors. But again, like DOLT and DOLT Hub, they have that same kind of idea, but it really is for relational databases rather than graph. Like Datomic, like, we have full time travel ability, can do branching. So we're in that sense, we're similar to Datomic and Lake FS, but I think it's really that idea of like, okay, I'm working on this dataset for now. I create a branch. I do some experiments, and then I want to merge that back into a branch that I then share with somebody else who's going to do further experiments on it. And that ability to move things around easily is really, I think, where we really find a strength in Terminus DB.
[00:12:01] Unknown:
And so digging more into the actual project, can you give a bit of context as to how Terminus DB is implemented and some of the ways that the design and objectives of the project have changed or evolved since you first began working on it?
[00:12:15] Unknown:
We are a in memory database, and that has been a conscious choice from the very beginning. So I worked on a database previously, that was a a graph database about 15 years ago, 16 years ago now. The paging is very difficult, and doing paging effectively is very difficult. And so if you can avoid it, there are lots of advantages. So I really spent a lot of time trying to think about how could we have big datasets and still avoid that very steep memory hierarchy. So steep that you really fall off a cliff if you start paging to disk, and you have to be very careful about how you page to disk when you're doing it that way. So instead, we opted for another thesis, and the thesis the central thesis is essentially that memory is growing relatively quickly. It's relatively cheap. You can get servers with lots of RAM in them now.
And many databases, I think there was some survey by the MySQL folks where they said that, like, 90% of databases were less than 2 gigabytes. So, you know, those 90% should be obviously in a memory database if you can do it because it just it simplifies so much of the engineering, and it really makes, performance nice. But in order to do that, in order to be able to get really big knowledge graphs, you have to be very careful about the way that you engineer the data structures. So from the very beginning, we were looking at ways of really getting compact data structures in memory. And so we've settled on something that's called a persistent, succinct data structure.
It's persistent in the sense that all the versions of the data structure are available even after an update. So you can go back to any point in time. You can have branching trees of these things that facilitates all the sort of version control type aspects, but it also simplifies concurrency and update, which is a big advantage. And the other aspect is the succinct data structure. And so succinct data structures are specifically data structures that approach the information theoretic minimum size that you can have for a data structure while still enabling fast query at some specific computational complexity class.
And so we have used these succinct data structures to get very compact representations of the graph, and that allows us to have extremely large graphs in in relatively small space. So for instance, DBPedia under a gig, and so you can have a very large knowledge graph and stick it into memory without too much trouble. And we can see going further on this than we have to date. I think there's a lot of cool things that can happen in making these data structures even more succinct over time.
[00:15:05] Unknown:
And another interesting aspect of the implementation of Terminus DB is the fact that it was written in prologue. And I'm curious what the decision process and design constraints were that led you to choose that as the language for actually building this and how that has affected the level of community contributions given that this is an open source project?
[00:15:26] Unknown:
The core is implemented in Rust and Prolog. So all of the low level data structure bit manipulation stuff is in Rust because it's quite a good language for low level memory layout concerns. Prolog is absolutely terrible at that kind of thing, so it's not what you'd want to use it for. However, Prolog is an extremely good query language. And really, I started actually implementing Terminus in Java. And as I was doing it, it just became apparent that I was writing a poor man's Prolog virtual machine. So, like, I had a Warren abstract machine written by myself, you know, is is not gonna be as good as some of the Prolog implementations that have had decades of programming effort go into them.
So Prolog has really gotten Curry down pat. They're very, very good at writing extremely efficient backtracking implementations. So I figured, well, I'll do some experiments, and sure enough, they were much faster than a naive implementation in something like Java. So we went back and we implemented most of our database in Prolog. Currently, the way that it works is that our query language is called WAQL, is actually compiled into Prolog, and then that Prolog is compiled to an abstract machine. And that turns out to be an extremely effective way of getting very fast query performance. All of the low level operations and low level backtracking on the data structures themselves are still done in Rust, but the high level implementation of language is in Prolog.
[00:17:02] Unknown:
Going back to the concept of scalability, you mentioned that a large majority of datasets are actually under this 2 gigabyte threshold. And I'm curious what the scaling capabilities are for Terminus DB, particularly because I know that it can be very complex to shard a graph across multiple instances. And so wondering what the viable sizes and scales of usage are for Terminus DB, both in terms of volume of data, but also in terms of the scalability of concurrent usage?
[00:17:38] Unknown:
The way that we have implemented it, we have very few locks that are necessary anywhere in the system. Because of the persistent data structures, this facilitates a non locking approach to concurrency. This means that we can actually have multiple instances of thermosdb running on the same backing store. So the backing store will keep information about all of the databases that exist in their current state, and we can load them into memory on different versions of Terminus, and they can advance the persistent data structure in lockstep with each other all running on the same backing store.
So for reads, it's very easy to have multiple readers. For writes, you have write contention around trying to get your transaction in first, so there's going to be some scalability issues around that. In terms of sharding, we just decided not to try to shard graphs. Graph sharding is very difficult. The way that we have attempted to do it ourselves is largely on a design basis. So if you can horizontally partition it, then do so at the schema level and create a separate database, and then you can have it sharded in that way. Like right now, memory is fairly cheap. So you can go to Google and you can get very big instances with the terabyte of memory, and that means you could be able to scale up to extremely large databases. So some of the datasets like the 1 on the Polish economy that was that was a large number of edges. It was something like I I think we had 100 of millions of edges anyhow. So it should be very possible to scale up to over a 1000000000 edges without falling over on hardware that you could buy on Google or AWS for rental. So I think, you know, that we have ideas about how to shrink things even further, and then I think that's a more productive way to go about most data use cases. There's some use cases, right, like where you have Google type situation where you're trying to spider the entire Internet, and then you probably wouldn't want to try to stick that all into 1 knowledge graph that fits into memory that's not gonna work for you. Like, even when people think they have big data, it turns out that you'd be able to comfortably fit it into memory on a big enough machine.
[00:19:55] Unknown:
In terms of the modeling and interaction with the database, I know that the core data object is the RDF triple, which have become popular because of the work being done for the semantic web, and that the data structures that are returned and represented are in JSON LD. And I'm wondering if you could just talk through some of the ways that that influences the data modeling considerations for people working with Terminus DB and how that impacts the overall interaction and data modeling considerations for people using Terminus DB?
[00:20:30] Unknown:
RDF has some really great ideas in it, the idea of using your eyes to represent points, and we really like that idea. That makes it so you can move data around. You can merge databases in a more sensible sort of way because you don't have just arbitrary identifiers like node 1 or something like that. It forces you to think a little bit more clearly around how you're going to publish your datasets as well. So we we really like that. The other thing we really liked about RDF is the the simple modeling of everything as triples because it just simplifies the e design criteria for the database itself. However, because Zrdf, it was really relegated mostly to academia.
There are few niche areas in which industry has used RDF, but it hasn't really gotten broad popularity ever. And I think some of that is due to specific design choices that were encouraged and not necessarily a good idea, but encouraged by the academics, around RDF. So we think it's a really good basis for a database, but we haven't found really much interested in Terminus DB because of its use of RDF. And we're not particularly, you know, going around shouting about how RDF is the most important part of it. We do see that URI as data points thing is is quite important, but not maybe some of the other things that are associated with RDF.
Now there are implications to using an RDF database as opposed to a property graph database or as opposed to something like GRACAN, which is a hypergraph database, and that is that the modeling has to be in terms of triples. So there's a slightly different approach to modeling that you have to take, but you can represent any of these sorts of things. So like a a hypergraph edge is actually not very hard to model in RDF. You just have a specific class that you consider your relationship, and it has arrows off of it, and it has more than 1 arrow off of it. And that sort of represents, you know, a hyper edge itself. So it's possible to do that. Property graphs, likewise, are not particularly difficult to model. It's just another class that has some data type properties off of it and then an edge coming in and an edge going out, and so it's sort of like a slightly fat edge modeled as a class. So you can do the the same sorts of tricks that you do in property graphs or hyper graphs in r e f as long as you have tools to help you model it. So we've spent a lot of time trying to make TerminaDB's tools very good for modeling and to enable you to do that. And so in the last, TerminaDB 4 0 release, we have a visual modeling tool that helps you to design these things, and then there's various different views you can have of the schema and the data facilitated by that modeling.
In addition, we have, like, the ability to represent fragments of the graph as documents in JSON LD, and that really allows you to go back and forth between a sort of graph view of the world and a document view of the world. And this is really useful for data curation. So when you're trying to edit, like, you know, a document that's about a patient record or something like that, you kind of want to display it as a record, and so you can have all of the data fields, etcetera, in there. You enter them in. In the back end, it's all actually a graph, but it's a fragment of the graph that's wrapped up into this JSON LD object.
You can also communicate these back and forth. You can update it using a document view of the world. You can extract using a document view of the world, and you can query using, a graph view of the world on top of those documents. So you can, like, say, okay, What what was the patient record? You know, what was their age, etcetera, like that? You can do those searches in the graph. And so it really facilitates JSONLT and its ability to model fragments of a graph really make it nice because you can go back and forward. Makes it a sort of multi model database.
[00:24:29] Unknown:
And another aspect of RDF is that I know that a lot of the other engines that have used RDF and things like SPARQL for being able to store and query data have had challenges of being able to scale the queries and the datasets. And you mentioned that being in memory has helped with that. And I know another concern with graph data models is the concept of supernodes. And so I'm wondering if you can talk a bit more about some of the ways that Terminus DB has worked to overcome some of those limitations and some of the data modeling considerations that people should be aware of as they are constructing these graphs and populating them and building up these knowledge graphs?
[00:25:14] Unknown:
Yeah. So I mentioned that 1 of our first forays was in trying to use Postgres as the back end for our database, and we found that it didn't really scale up in terms of performance using an RDF triple based approach because essentially the link following was just too slow. The combination of the succinct data structure so the succinct data structure allows us to query in any mode. It's indexed, so it's very fast, especially in the subject predicate direction, but it's also in log n time in the object to subject direction. So we have it indexed in such a way that you can basically you can do any mode of query with a very compact representation.
I guess there were a lot of, like, approaches where people had, like, double indexes on every single element. You ended up with really big databases, and they it didn't scale very well, didn't perform very well. We found that this compact representation scales very well and is much more performant than most of the RDF type database, especially when you get up to large datasets. So I think that we've done really well there. I'm quite happy with the performance. We have other things that are necessary for improving performance. So the persistent data structures, they degrade in performance over time. And so we have something called a delta roll up, which helps us to improve the query performance by essentially forming a snapshot that allows you to sort of see what it is at a particular commit, what the state of the database if you took into account all of the if you sort of flatten all of the changes that were made on the persistent data structure to a single plane.
That also improves performances when transaction chains become long. This delta roll up is not out in 4 o. We're gonna get it out for 4.1. We already have it implemented. We're just working on making sure that it's correct and that there are no bugs in it before we launch it. But this should help a lot with the performance, and we have some heuristics for automatically putting in these delta roll ups without causing memory problems for people. So that's 1 of the things that we've done in order to improve performance. There are other things as well. So, like, if you can compact things even further or you can get locality of reference, then then you can improve performance quite a lot by fitting into cache lines.
So we have some ideas about how to do that as well by doing sort of reordering of the way that our datasets are stored so that we consistently query information into a sort of compacted space. And this can be done without too much difficulty because we can we have sort of a pointer swizzling trick that allows us to reorder these without without too much undue cost in in the queries. We ran into this quite a bit actually when we were doing long search chains in in the Polish economy. So this is actually a very serious concern. If you hit a super node, suddenly the size of potential solutions expands radically.
Now we were still able to go much deeper than we were on Neo 4J. So we tried it against Neo 4j, and they were falling over on the same dataset at about 5 or 6. And we were going up to 11 hops. And this really it's just partly due to the fact that, you know, everything is resident in memory and compacts, and that allowed us to do a lot of that. And the fact that we were doing a depth first search, so we didn't keep residents in memory very much extra sort of bookkeeping information, which allowed us to do it even using a very naive approach where we were able to make it through supernodes. However, we had some ideas about how to improve supernode performance using bloom filters that we prototype, but have not actually released in the database yet. So as we found, most of the people who are using thermosdb are really using it for data curation and the sort of data management approach stuff. We thought initially we'd get more pressure from people who want to use it for sophisticated graph search, and that just hasn't been the case as yet. So we're going where our users are. We wanna make it usable for our users. So if we start getting those kinds of queries pressing on us later down the line, we'll pay a little bit more attention to it. But at the moment, we we don't have any plans in our road map to improve performance on those particular types of problems.
[00:29:40] Unknown:
You mentioned the snapshotting capabilities for being able to collapse all the versions into a single reference point for sake of query performance. Wondering what the other aspects of overhead are that are incurred by maintaining the history of changes for various objects within the database and some of the strategies that you have for being able to potentially handle life cycling of those versions or things like garbage collection and compaction over time?
[00:30:07] Unknown:
Yeah. Those are very good questions. So hilariously enough, we don't have a working garbage collector at the moment. It's on our road map, and we intend to start working on it, I think, either this month or next month, depending on how other things go. But we know that we need it. The surprising thing is that we haven't needed it yet. Even though we're working with really big datasets in production on large data systems. For instance, we have a large retailer that's using using our system. Because our layers are so compact, because we use this succinct representation, we actually don't take up very much memory.
So so far, we've been able to just limp along without actually dealing with the fact that we're generating lots of garbage. And I found this kind of interesting because I've heard some of the other version control databases have had much more difficulty with this, but I think this is really an advantage of succinct data structures again that you really wanna keep things as compacted as possible in the first place. We're going to have a garbage collector for dealing with this. There's a number of things that have to be done. So the first one's gonna be relatively simple. 1 that doesn't need to know very much. But then there's other things like we have a commit graph that keeps a graph of all of the commit histories because the commit histories are not necessarily linear. They can branch out and, you know, they could be trees. They could even merge back, so things get relatively complicated.
But you can also have tips that have no branch head or tag on them similar to in Git, and so you need to do some kind of pruning. So we're also gonna have to do pruning at some point on the commit graph and how we wanna do that, whether we make that a feature that people have to call a pruning operation on or something. We haven't really decided that yet. That's 1 of the things that we're going to be working on though in the future is adding those sorts of features. I think it's a really effective way of doing it. It allows you to basically ignore the other versions unless you specifically search for them. And then you you do have to load them for those specific searches, but you can have, strategies for purging that from memory after they're no longer being used.
And that should be able to deal with the performance pretty well, I think, as of yet. So when the versions get very long, we're gonna need to have some sort of like, at the moment, we can do squashes. Squashes like a delta roll up, except it actually squashes all the commits into a single commit layer and forgets the history. That can be a solution if you actually don't need your history anymore. But if you need all of your history going all the way back, the combination of Delta roll ups and then, like, you know, just not loading if you don't query it, it won't get loaded into memory, so it's just on disk. And I think we can we can start thinking about archiving the the older versions the older layers back to some kind of, like, cold storage later on. But we're gonna have to look at that more going forward.
[00:33:03] Unknown:
You invest so much in your data infrastructure, you simply can't afford to settle for unreliable data. Fortunately, there's hope. In the same way that New Relic, Datadog, and other application performance management solutions ensure reliable software and keep application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo's end to end data observability platform monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence. The platform uses machine learning to infer and learn your data, By empowering By empowering data teams with end to end data reliability, Monte Carlo helps organizations save time, increase revenue, and restore trust in their data.
Visitdataengineeringpodcast.com/monte carlo today to request a demo and see how Monte Carlo delivers data observability across your data infrastructure. The first 25 people will receive a free limited edition Monte Carlo hat. And digging more into the actual collaboration capabilities and branching and merging and the versioning aspects of Terminus DB, I'm wondering if you can talk through some of the ways that those capabilities impact the approach that data teams take when working on a given project and some of the capabilities that it adds and some of the challenges that it might introduce for how to effectively manage those strategies. And if the best practices for Git are directly applicable to terminus DB or if there are different aspects to it?
[00:34:41] Unknown:
It's honestly a game changer. Like, we find ourselves working with a curation of data and fixing data all the time, and we can just we're a distributed team at Terminus DB. We work all over the place, and we can just send these layers to each other. Somebody can create an experiment. They can find a bug or something like that, and they can send it to me by, you know, just push and pull, and, you know, it's really incredible. It's really magical. And I think that when people start using it, they're gonna be amazed at how well it works. The Sysynq data structure, again, it facilitates the communication of this data because it's relatively small deltas that you send. You don't send the whole database all the time, but even when you do, when you do a clone, even very large databases, you know, fit into relatively small amount of space. So that also facilitates this data sharing aspect.
I don't think any of this stuff existed out of the box previously. It's gonna take a while before people really get used to it. Now in terms of merging, so we use rebase quite a lot. We don't have a full 3 way merge at the moment. That's also on our road map, so we wanna be able to do that. In terms of the conflict resolution, that is an extremely complicated problem. So for the most part, with a lot of the examples that we've worked on, and we're eating our own dog food, so we use it in large deployments already, We haven't had too many problems with it, but when you do have a problem, our tools for fixing it are relatively light. They're not very helpful for the user. You have to be a bit sophisticated. So we have fix up queries that allow you to alter the final state before the commit, and that allows you to bring it back into compliance with the schema before the commit takes place, allowing the commit to go through.
That can be a little bit daunting if you have a complicated situation and you're a novice user. And that's really an area we want to explore more for various types of solutions. So automatic merge strategies, fix up strategies, help screens that give you some information about the commits, and libraries of merge tolerant data structures like CRDTs that can help people avoid these types of problems. So, I mean, with code management, a lot of the merge conflict resolution is completely manual, and you just have to go in and edit the, you know, the 2 areas between the commits to make it so that it goes through and then figure out that semantically it's correct so that your code merges.
So we're kind of in that stage at the moment, but we have more information than Git has about the data. So we have a schema. So we should be able to give you a lot more information about strategies that you might take. And I think also that those sort of merge tolerant data structures are a place where this can be resolved automatically in a lot of cases if special attention is taken towards creating the right types of schema. But like I said, we have tools for it right now, but they're very light, and there's a lot of room in this space. But I think this is gonna be an active area of sort of experimentation, and it will be very interesting.
[00:37:59] Unknown:
Now digging into the actual query language and the interaction with the database, You mentioned that you have, the w o q l query language built specifically for Terminus DB. I'm wondering what the reasoning was for creating a new syntax for working with this engine and what you drew on for inspiration as to how to structure that language.
[00:38:22] Unknown:
We looked at using SPARQL for a while, but SPARQL has a number of features that I I found to be less than ideal. So it wasn't has a syntax that's somewhat similar to SQL but lost a lot of the beauty of SQL. SQL is really beautiful because it's both declarative and composable in its nature. It's very easy to compose fragments of SQL together. Everything is a table, has a nice simple interface, and that was really appealing to me. I really like SQL, and I wanted something to work with Graph Query that had that same kind of feel where everything felt very unified in its view of the world and where composability was really taken as central.
So our query language is inspired by Datalog, and the main difficulty, I guess, in learning Datalog type things is the idea of unification. But once you've wrapped your head around this idea that variables get assigned by matching and that once they have taken on a value, they don't change it during a given solution. It's a very flexible and powerful approach to to query. The composability of WAQL as compared to SPARQL is just a lot greater. So you can mix and match. We have all kinds of manipulation that we do in the front end. So, like, our console is very much largely written in WACL itself for getting all of the data assets about Terminus, all the metadata, all that stuff, actually uses WACL.
And it's just a lot more composable thing than you would get even from SQL because SQL tends to be, you know, string based language, whereas we communicate via an AST represented in JSON LD. And so you get a very native sort of JavaScript or Python way of manipulating an AST data structure. Feels kind of like an SDK, feels a little bit like a query language, but really, you're just building up an AST that you're gonna send to the query endpoint. That makes composability just really incredibly good. So, like, we've had some pressure to implement a SPARQL endpoint for thermosdb, and it is possible.
You could go ahead and do that. There's even a library in Prolog, so it wouldn't be too hard to add it. But I think for the most part, like most by users of Sparkler, it tends to be fairly unsophisticated. So it wouldn't take somebody that long to rewrite such a thing in in WACCUL, and I don't see it as a big impediment. The fact I know it's kind of weird that we chose our own language, but we looked around at a lot of other graph query languages and just weren't very impressed with what they were. And so we decided we may as well write our own. Still very early in the history of the graph database. So who's gonna win is still very much an open question.
And my personal opinion is that whoever wins, it's gonna be a data log. It's not gonna be these other weird languages that people are writing. It's gonna be something based very closely on data log, I think. And the fact that you have a custom language for the database and the fact that it is open source, I'm wondering
[00:41:41] Unknown:
how you have seen the overall growth of the community and what your strategy is for the overall longevity of the project and the sustainability of it.
[00:41:54] Unknown:
The fact that we wrote a lot of the query language in Prolog definitely has lowered the community interest in that fragment. So if you look at the Python SDK and the JavaScript SDK, we've had well, especially in the Python SDK, we've had a lot of community contributions. A lot of people have come in to help work on that stuff. And by contrast, on the server end of things, there have really only been a couple. And then the Rust the community around Rust is really burgeoning and growing, but the underlying data structures for a database are not easy. Like, it's not exactly the easiest place to get jumped into when you're trying to start on a project. So we did have a really good contributor in the Rust back end, named Sean Leather, and he did some incredible work with us.
But other than that, there hasn't been a lot of a lot of community development on that end of things. But in terms of the longevity of the project, you know, I think we're going to, moving forward, try to get more community involvements. Like our community, it's really getting quite large now in terms of the people who are interested in using it. As I said, it's more on the front end that people tend to be doing it, and I'd like to see more growth in some of the back end stuff. But I that'll come a little bit further down the line.
[00:43:13] Unknown:
And in terms of the business model for the company that you've built out around this project, it seems to be largely oriented around the Terminus hub platform that you're using as this access for collaboration and for being able to publish and work on public datasets. And I'm curious if you can talk through some of the add on capabilities that you've built into that and some of the overall goals that you have both for the project and for the ways that it is used.
[00:43:41] Unknown:
Yeah. So, I mean, 1 of our goals is really to have this database where the best version of the database is the community version of the database. There isn't sort of just an open core model, but, like, the 1 you get is the best 1 that's there. So you can do all your sophisticated queries. You can have all of the things that you want in your database, and we make our money through Hub. So Hub is like GitHub. We wanna be like GitHub except for data. So if you want to collaborate, then the easiest way to do the collaboration is via our hub.
You can do it by setting up your own origins. You can do that right now with the command line tool. You can set an origin, can communicate between different terminal SDV installations without ever going through hub. That's already possible, but it's most convenient if you just use the online server. The online server, we will continually develop in a direction similar to the way that GitHub has gone where they have lots of different tools for exploration of published datasets. So you get a nice read me page, you know, when you land on a repo. That kind of thing is the way that we wanna go forward with it, make the publishing much more seamless for people so that if you're trying to publish a public dataset, then you can easily do it. And then maybe even going forward, have, like, special pages that allow you to put up a view of the database that you have so you can actually extract meaningful information like maps or whatever and demonstrate your data in that way for public datasets.
So that's how we imagine going forward that we're going to have this sophisticated database. I think, like, if you want a sophisticated database that gets widespread developer adoption, it really has to be open source these days. I think it's really the only way, and so we're really quite committed to that. And I do think that the convenience of having a central hub is enough to really make a viable business model.
[00:45:44] Unknown:
1 of the things that you mentioned there, I think, is very interesting is the ability to have multiple origins that you're working with for being able to federate the use of the datasets so similar to Git where you might fork a project and add some contributions and then try and submit a merge request back upstream. And I'm just wondering if you can talk through some of the most interesting or innovative or unexpected ways that you've seen Terminus DB being used.
[00:46:09] Unknown:
That's a really powerful 1 because there's a lot of things that you can do with that that aren't immediately obvious. 1 of them, though, like, the simplest thing that you can do, which is actually really powerful, is that you can you can take backups very easily just by pulling the latest version of some server. So and you and you can federate these things. So you can have 1, you know, is connected to hub. It feeds information from hub so people can be working on the model or whatever. They push it to the production database. The production database, it has a connection to a backup, and the backup just pulls whatever the current commit is on the production database, and you have have all of that backed up. 1 of the coolest things that I've seen in terms of using the sort of commit structure was there was a biochemist who was doing some complicated experiments, and they have all of this stuff moving around in the laboratory.
And so they have a schema model that models all of these different kinds of, like, assay trays, etcetera, and the different experiments that are being done, the merger of various different things into new things. And instead of representing it all in 1 flat knowledge graph, they were representing it as a series of commits where each operation was actually creating a new commit. And then because you can query any commit by looking at the commit graph in Terminus DB, you can actually look and see, like, a sort of you get a temporal view of a changing knowledge graph. And I thought that was not an intended outcome of the way that we structured things, but I thought it was a very clever way of using what's possible with the commit graph. So now it's kind of interesting that, like, some like Dolto Hub, for instance, they have a commit graph as well to structure all their commits, but they're a lit relational database.
For us, the commit graph is exactly just a Terminus DB graph. It's a graph that you create with WACL and you can manipulate with WACL. So you can actually you know, there's a lot of sort of meta level manipulation that you can do with Walkold because of that.
[00:48:12] Unknown:
In terms of your experience of building the project and growing the team around it and building the business to help support it, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:48:24] Unknown:
Getting the data structures right is really tricky. Being compact and fast and durable, you know, getting ACID requirements takes a lot of attention to and making it all perform well takes a lot of attention to detail and a lot of experimentation. I feel like there's constant work that can be done on this. We're we're increasingly confident in our underlying structure and its performance and stability, but we've learned a lot on the way. And we have ideas about how to make that better, but it just takes a really long time to get these things at the data layer correct.
So that's something I kinda knew it going in, but it it always surprises me how long these things take in practice. The other thing is just merges are tricky. Merges are a tricky problem for fundamental reasons. There's a lot of power in the concept of merges, though. So there's all kinds of things that you can do on a persistent data structure that are kind of cool. So for instance, if I come in, I'm committing a merge or I'm committing a change, an update on a database. Right? And then the database head moves out from under me. And when I come to commit, the head has moved forward, then I have to restart my commit on top of the 1 that's already gotten there. So this is a way of sort of I guess it's optimistic currency with MVCC type approach.
However, there's ways if you have a schema, you can also look at the commutivity between the 2 layers and see if they commute. If they commute, then you can actually just stick it on the head without rerunning the transaction. That's a merge quality. It has to do with maintaining the constraints on the database simultaneously. So there's a lot of stuff that can happen here. We're not currently doing that trick with the commutation, but it is possible. And a lot of things are possible in this sort of merge and reordering of commits that I think are going to be very interesting. In terms of merge on big datasets as well, when you get a very large dataset, you have a merge, you have some kind of conflict because of the merge, Maybe somebody added a like, you could do something like add a cardinality constraint that then is violated after the merger, and there's, like, a 100, 000, 000 things are no longer correct.
And then you have to have some way of viewing that in a way that would enable you to fix it. And so I think there's a lot of challenging work that has to go in here, and we have some ideas about how to do it. So for instance, we can have error reporting that actually also is written to a graph and a specific error reporting graph, and then you could do some of your fix ups by querying the error reports and then making those into queries that then fix up the original. We currently have a way to have intermediate layers that don't meet the schema constraints.
So there's also possibilities of sort of like CICD type situations where you have layers that are actually in inconsistent states and you allow them to be that way, and then then you can do some kind of fix up on that inconsistent state to bring it into a consistent state later. And maybe it doesn't move the branch head, and you just have a tip that's sitting around or I'm not sure. There's various ways we could deal with that. But I think there's a load of possibilities here, and I think as we go forward, that's gonna be a really exciting area of development and design.
[00:51:45] Unknown:
And for people who are looking for some of the data versioning capabilities or the graph capabilities that are in Terminus DB, what are the cases where it's the wrong choice and they might be better served by another system?
[00:51:58] Unknown:
If you don't have a lot of concerns about the queryability of the data, then you probably you're gonna have an easier time with something else like DBC or something like that. If you want a live database, then, you know, we're we're gonna be probably a better choice, but a more complicated 1. If you have loads of write transactions, if you, you know, you have constant updates like a stream of logging or something like that, then we should not be your main transaction processing database because we will fall over. You could use us as an OLAP type server for a scenario, but you'd need to batch the updates into sort of reasonable chunks.
And reasonable being, you know I I don't know. You could have a 100 a day or something like that, but, you know, tens of thousands of rights a day would Terminus just would not do very well with those types of scenarios.
[00:52:48] Unknown:
And so as you look to the future of the project, what are some of the things that you have planned for the near to medium term, both for Terminus DB and the Terminus Hub business that you've built around it?
[00:53:00] Unknown:
In this Terminus DB core, 1 of the things we really want is content addressable hashing. It's in our road map for the near term, and this is gonna allow signed and encrypted layers. And I think this is gonna be an important addition in terms of functionality. We have some ideas about even more succinct data structures to improve scale up and ease of collaboration. We've been exploring some of these other alternative to sync data structures for graphs. I think that's gonna be quite cool. We want to make Terminus Hub more browsable. So that's right now, it's quite you know, it's just a stub. It works really well with your Terminus CV, but it's not very accessible to the web. We want to make it way more web accessible so that people can see which datasets you put up and give people an idea of what's being used inside of Terminus DB for public datasets.
And then I think, you know, we're we're gonna see more attention to the user interface for merge strategies is is something that we're gonna see going forward. We're constantly improving the document editing interface, the sort of curation facilities, and the model building facilities for the database. And I think those will keep getting better, and we'll have lots of new stuff to see coming out in each version.
[00:54:15] Unknown:
Are there any other aspects of the Terminus DB project or the overall space of knowledge engines and graph databases and data versioning that we didn't discuss yet that you'd like to cover before we close out the show?
[00:54:28] Unknown:
I think the space of distributed data collaboration is gonna be absolutely huge. Like, in GitHub changed the world of software engineering, and this kind of approach hasn't come to data engineering, but I think it should come to data engineering. So, like, these CICD, approaches where you have continuous integration, continuous deployment. You have a place to sort of have staged commits before going to production. These sorts of things are really, I think, really important for data. I'm actually kind of surprised that they haven't gotten there yet, but I think they will come. And I think thermostat is is gonna be 1 of the solutions that sort of fill the space, but it's gonna be a very big space. So if you think about the number of people who write code versus the number of people who are curating data, the data curation aspect is just a lot bigger. So you have, you know, 100 of thousands of Excel users, for instance.
It's just a much larger space in terms of software engineering. This space is gonna be absolutely massive, and I think there's data collaboration, CICD for data, data meshes. There's a lot of tooling that needs to go in to plug this absolutely massive chasm, and I think this is a place to look in the future.
[00:55:43] Unknown:
Well, for anybody who wants to get in touch with you or follow along with the work that you're doing or contribute to Terminus DB, I'll have you add your preferred contact information to the show notes. And as a final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:56:01] Unknown:
I think it's this distributed data collaboration aspect. It's that's what we went after, and it wasn't just because, you know, we were just grasping at something. We had a problem that we needed to solve, and we just couldn't find anything that was filling that space in a way that we thought was appropriate. So, you know, if you have a problem with data curation, data management, with a complicated schema, or you just have lots of people who are all editing the same data, then I think you should give thermosdb a try and see if it fills your
[00:56:32] Unknown:
needs. Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with Terminus DB. As you said, collaboration around data and being able to version it and have safety for being able to determine when to publish and what you might need to roll back is very important area. So I appreciate all the time and effort you've put into that, and I hope you enjoy the rest of your day.
[00:56:54] Unknown:
Great. Thank you very much. Thanks for having me.
[00:57:02] Unknown:
Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave review on iTunes and tell your friends and coworkers.
Introduction to Gavin Mendel Gleason and Terminus DB
The Genesis of Terminus DB
Challenges in Data Versioning and Graph Engines
Comparing Terminus DB with Other Data Versioning Tools
Implementation and Scalability of Terminus DB
Data Modeling and Interaction with Terminus DB
Collaboration and Versioning in Terminus DB
WAQL: The Query Language of Terminus DB
Community Growth and Business Model
Innovative Uses and Future Plans for Terminus DB
Challenges and Lessons Learned
Biggest Gaps in Data Management Tooling