Enabling Version Controlled Data Collaboration With TerminusDB

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Do you want to get better at Python?

Now is an excellent time to take an online course. Whether you're just learning Python or you're looking for deep dives on topics like APIs, memory management, async and await, and more, our friends at Talk Python Training have a top notch course for you.

If you're just getting started, be sure to check out the Python for absolute beginners course.

It's like the 1st year of computer science that you never took compressed into 10 fun hours of Python coding and problem solving.

Go to data engineering podcast dotcom/talkpython

today and get 10% off the course that will help you find your next level. That's dataengineeringpodcast.com/talkpython,

and don't forget to thank them for supporting the show.

Your host is Tobias Macy. And today, I'm interviewing Gavin Mendel Gleason about Terminus DB, an open source model driven graph database for knowledge graph representation. So, Gavin, can you start by introducing yourself?

Yeah. Sure. I'm Gavin Mendel Gleason. I'm the CTO of Terminus DB, and I'm a computer scientist, but I identify as a data engineer.

Do you remember how you first got involved in the area of data management?

Yeah. So I've been doing data management for a very long time, for about 23 years, and my first job was actually in looking at managing

records of

bed and breakfasts in Alaska.

That's how I first got involved in it. And from there, I went on to do stuff with Cheminformatics,

large Cheminformatics

databases.

And then about 15 years ago, I was in a involved in a startup

of a graph database before graph databases were even really very well known.

Since then, I've done a lot of stuff in text indexing,

search engines, machine learning on large text corpora.

I took a brief hiatus for a PhD

in type theory, but then immediately went back to the data management problem after my PhD.

And so all of that has led you to starting the Terminus DB project. I'm wondering if you can give a bit of an overview about what it is and what motivated you to build

it. Great. Yeah. So, Terminus DB is collaborative graph database

with a rich schema modeling layer. So the motivation really came out of a project called Seshat, which I was working on

at Trinity College Dublin.

That project is a very ambitious project to store data about all civilizations

in human history.

They wanted to do analytics

over information about those civilizations, and that information includes things like the population carrying capacity,

number of important buildings, the kinds of rituals they had. They had human sacrifice,

just a huge range of different data points.

And they had some very specific project requirements

that actually turned out to be very generally useful in data curation and data management.

So they needed to be able to collaborate between researchers all over the world, so they needed some ability to send data around

easily.

The data they were using is extremely complicated.

Everything had to have confidence. It had to have geotemporal

scoping.

You had to know, you know, what the ranges of times that something started at to the ranges of times that it may have ended at. So even the the endpoints of the temporal scoping were uncertain in in terms of time,

and they had a lot of different kinds of data points that range from the numeric

through

to much more sort of, subjective

and then various kinds of sort of choice data points like whether or not your major

war animal was a a camel or a horse, and these sorts of things were really quite important to them. So

in in addition to that, the data was often very sparse. So you have these very elaborate

schemata that have lots of different kinds of things that you might say about a civilization,

but then you may not know many of them, or points might be uncoded because you haven't coded them yet. You just haven't entered data about it. They might be unknown, or you might have specific known quantities

within sorts of confidence ranges. So it was a very elaborate data project.

And in addition to that, they needed to

staging and versioning of their documents because they would often have data entered by nonprofessionals

or students,

and then they had to

review it by experts. And then it would go into some other staging area where you'd collect the data together. You'd do some sort of analysis with that data, and then finally, you'd prepare it for

publishing.

And the data needed to be easily publishable to the public on a regular basis in a format that was potentially machine readable,

and they had to have it in a place where it could be embargoed for publication.

So you had to

tag the data with specific sort of releases,

which is very similar to what you might do on software.

On top of all that, the data needed to be extractable via query in formats that were amenable to analytics and computer modeling.

And the researchers were

specifically very interested in predictive

analytics

on the past and even the present and future

based on the models that they had designed

that have to do with sociology and and the dynamics of civilizations.

So it's really a really cool project, but it had some huge data management problems that were quite difficult to solve.

So the requirements myself and the CEO of thermostat,

Kevin Feeney,

went looking around for pieces of the solution in a tool chain to meet them.

And we sort of cobbled some things together, but nothing we had was great. And really, it sort of pressed us in trying to build up a system that could actually do all of these things. And

we began on this crazy task of actually writing a database ourselves

on that basis.

Yeah. And writing a database is definitely not an undertaking

to embark on lightly. So I'm curious what you saw as being the major lacking components in the overall space of data versioning tools and graph engines, particularly at the time that you were getting started with the project?

Yeah. So, I mean, 1 of the things was we wanted to be able to have this

version control in the graph.

There wasn't really anything at the time that could do

that very well. We played around with a number of the sort of open source tools. We played with Jena. We had various

experimentation,

actually using Postgres at some point,

and

all of it felt very forced and difficult to work with and not particularly scalable.

And after the Cescet project, we had another project on the same European

funding proposal

that, involved

ingesting

all sorts of economic data about the Polish economy since the fall of the Warsaw Pact that involved, like, which companies existed,

who the directors were,

shareholdings of those companies, that kind of thing. And the solutions that we had sort of patched together

just couldn't scale up to ingesting that whole thing. And so that really pressed us to to find a scalable

mechanism

of graph versioning

that we could stick very large knowledge graphs into.

Graph engines

currently are definitely seeing a bit of a renaissance where there are a number of different projects that aim to provide support for that either natively or as an add on capability or as part of a multimodel

ability within the graph engine. So thinking of things like d graph and DurangoDB

and Neo 4 j.

And there are also a number of projects

available for building these knowledge graphs and knowledge engines. I'm thinking in particular about things like Kraken.

I'm wondering how you see Terminus DB sitting within that landscape. And if you were to start the project over again, do you think that it would still be worth building this engine as a dedicated project, or do you think that it would make more sense to build on top of some of the existing technologies?

Yeah. I mean, that's a very good question. So, I mean, I think our core competence like, we started with versioning as an important thing. We knew we needed the collaboration features, but it turns out, like, versioning

is the basis on which to do appropriate collaboration features, and Git really got this right. So that the git approach where you have, you know, a delta encoding of all the changes

allows you to do all of this push pull merge rebase, and all of these sorts of things are facilitated

by that versioning technique.

And there are things like DeltHub, I think, is the closest to Termas DB, but it's really a relational database. In terms of the graph databases, there's nothing that really does that collaboration

well. And we're now we're in a place where I think even if we like, I didn't think it would take this long, frankly. It's writing a database is hard work. It takes a long time. But I think that where we are now is

much better than we could have done in trying to build on top of something else. And there's a couple of reasons for that. Some of them have to do with the technical approaches that we took in actually implementing the database itself.

My last question was about the positioning of Terminus DB in the context of graph engines. But but as you said, 1 of the primary considerations of it is data versioning. And there are a number of other versioning tools that are out there for being able to handle various

database,

lakefs for being able to branch and merge and version datasets within object storage, DVC, or data version control for

managing the versioning and collaboration capabilities on machine learning projects. I'm wondering if you can

give a bit of context as to where Terminus DB fits in relation to some of those other data versioning systems

and the major use cases that Terminus DB is uniquely well suited for.

We're similar to a lot of those in the sense that we have versioning. So, I mean, like, DVC

is really data versioning.

We're a database that has versions, that's focused on the collaboration aspect. So I think that's where we really

vary the most. So we would see that push, pull, clone, and merge as the sorts of things that really differentiate us from competitors.

But again, like DOLT and DOLT Hub, they have that same kind of idea, but it really is for relational databases rather than graph. Like Datomic, like, we have full time travel ability,

can do branching. So we're in that sense, we're similar to Datomic and Lake FS, but I think it's really that idea of like, okay, I'm working on this dataset for now. I create a branch. I do some experiments, and then I want to merge that back into a branch that I then share with somebody else who's going to do further

experiments on it. And that ability to move things around easily

is really, I think, where

we really find a strength in Terminus DB.

And so digging more into the actual project, can you give a bit of context as to how Terminus DB is implemented and some of the ways that the design and objectives of the project have changed or evolved since you first began working on it?

We are a in memory database, and that has been a conscious choice from the very beginning.

So I worked on a database previously,

that was a a graph database about 15 years ago, 16 years ago now.

The paging is very difficult, and doing paging effectively is very difficult. And so if you can avoid it, there are lots of advantages. So I really spent a lot of time trying to think about how could we have big datasets and

still avoid

that

very steep memory hierarchy.

So

steep that you really fall off a cliff if you start paging to disk, and you have to be very careful about how you page to disk when you're doing it that way. So

instead, we opted for another thesis, and the thesis the central thesis is essentially that memory is growing relatively quickly. It's relatively cheap. You can get servers with lots of RAM in them now.

And many databases, I think there was some survey by the MySQL

folks where they said that, like, 90% of

databases were less than 2 gigabytes.

So, you know, those 90%

should be obviously in a memory database if you can do it because it just it simplifies so much of the engineering,

and it really makes, performance nice. But in order to do that, in order to be able to get really big knowledge graphs,

you have to be very careful about the way that you engineer

the data structures.

So from the very beginning, we were looking at ways of really getting compact

data structures

in memory. And so we've settled on something that's called a persistent, succinct data structure.

It's persistent in the sense that all the versions of the data structure

are available

even after an update. So you can go back to any point in time. You can have branching trees of these things that facilitates all the sort of version control

type aspects, but it also simplifies

concurrency

and update, which is a big advantage.

And the other aspect is the succinct data structure. And so succinct data structures are specifically data structures that approach the information theoretic minimum size

that you can have for a data structure while still enabling

fast query at some specific

computational

complexity

class.

And so we have used these succinct data structures to get very

compact representations

of the graph, and that allows us to have extremely large graphs in in relatively small space.

So for instance, DBPedia

under a gig, and so you can have a very large knowledge graph and stick it into memory

without too much trouble. And we can see going further on this than we have to date. I think there's a lot of cool things that can happen

in making these data structures even more succinct over time.

And another interesting aspect of the implementation of Terminus DB is the fact that it was written in prologue. And I'm curious what the decision process and design constraints were that led you to choose that as the language for actually building this and how that has

affected

the level of community contributions given that this is an open source project?

The core is implemented in Rust and Prolog. So all of the low level data structure bit manipulation stuff is in Rust because it's quite a good language for low level memory layout concerns.

Prolog is absolutely terrible at that kind of thing, so it's not what you'd want to use it for. However, Prolog is an extremely good query language.

And really, I started actually implementing

Terminus in Java.

And as I was doing it, it just became apparent that I was writing a poor man's Prolog

virtual machine. So, like, I had a Warren abstract machine written by myself,

you know, is is not gonna be as good as some of the Prolog implementations that have had

decades of programming

effort go into them.

So

Prolog has really gotten Curry down pat. They're very, very good at writing extremely efficient backtracking

implementations.

So I figured, well, I'll do some experiments, and sure enough, they were much faster than a naive implementation

in something like Java. So

we went back and we implemented

most of our database in Prolog.

Currently, the way that it works is that our query language is called WAQL,

is actually compiled into Prolog, and then that Prolog is compiled to an abstract machine.

And that turns out to be an extremely effective way of getting very fast query performance.

All of the low level operations and low level backtracking

on the data structures themselves are still done in Rust, but the high level implementation of language

is in Prolog.

Going back to the

concept of scalability, you mentioned that a large majority of datasets are actually

under this 2 gigabyte threshold.

And I'm curious

what the

scaling capabilities

are for Terminus DB, particularly because I know that it can be very complex to shard a graph across multiple instances. And so

wondering what the viable sizes

and scales of usage are for Terminus DB, both in terms of volume of data, but also in terms of

the scalability of concurrent usage?

The way that we have implemented it, we have very few locks that are necessary

anywhere in the system.

Because of the persistent data structures, this facilitates a non locking approach

to concurrency.

This means that we can actually have multiple instances of thermosdb running on the same backing store. So the backing store will keep information about all of the databases that exist in their current state, and we can load them into memory on different versions of Terminus, and they can advance the persistent data structure

in lockstep with each other all running on the same backing store.

So for reads, it's very easy to have multiple readers.

For writes, you have write contention around trying to get your transaction in first, so there's going to be some scalability issues around that. In terms of sharding, we just decided

not to try to shard graphs. Graph sharding is very difficult.

The way that we have attempted to do it ourselves is

largely on a design basis. So

if you can horizontally

partition it, then

do so at the schema level and create a separate database, and then you can have it sharded in that way.

Like right now, memory is fairly cheap. So you can go to Google and you can get very big instances with the terabyte of memory, and that means you could be able to scale up to extremely large databases. So some of the datasets like the 1 on the Polish economy that was that was a large number of edges. It was something like I I think we had 100 of millions of edges anyhow. So it should be very possible to scale up to over a 1000000000 edges without falling over

on hardware that you could buy on Google or AWS

for rental. So I think, you know, that

we have ideas about how to shrink things even further, and then I think that's a more productive

way to go about most data use cases. There's some use cases, right, like where

you have Google type situation where you're trying to

spider the entire Internet,

and then you probably wouldn't want to try to stick that all into 1 knowledge graph that fits into memory that's not gonna work for you. Like, even when people think they have big data,

it turns out that you'd be able to comfortably

fit it into memory on a big enough machine.

In terms of the modeling

and

interaction with the database, I know that the core

data object is the RDF triple, which have become popular because of the work being done for the semantic web, and that the data structures that are returned and represented are in JSON LD.

And I'm wondering if you could just talk through some of the

ways that that influences

the data modeling considerations

for people working with Terminus DB

and how that impacts the overall interaction and data modeling considerations for people using Terminus DB?

RDF has some really great ideas in it, the idea of using your eyes to represent points, and we really like that idea.

That makes it so you can move data around. You can merge databases

in a more sensible sort of way because you don't have just arbitrary identifiers like node 1 or something like that.

It forces you to think a little bit more clearly

around how you're going to publish your datasets as well. So we we really like that. The other thing we really liked about RDF is the the simple

modeling of everything as triples because it just simplifies the e design criteria for the database itself.

However, because Zrdf, it was really relegated

mostly to academia.

There are few niche areas in which industry has used RDF,

but it hasn't really gotten broad popularity ever.

And I think some of that is due to specific design choices that were encouraged and not necessarily a good idea, but encouraged by the academics,

around RDF.

So we think it's a really good basis for a database, but we haven't

found

really much interested in Terminus DB because of its use of RDF. And we're not particularly,

you know, going around shouting about how RDF is the most important part of it. We do see that URI

as data points thing is is quite important, but not maybe some of the other things that are associated with RDF.

Now there are implications

to using an RDF database as opposed to a property graph database

or as opposed to something like GRACAN, which is a hypergraph

database,

and that is that the modeling has to be in terms of triples. So there's a slightly different approach to modeling that you have to take, but you can represent any of these sorts of things. So like a a hypergraph edge is actually not very hard to model in RDF. You just have a specific class that you consider your relationship, and it has arrows off of it, and it has more than 1 arrow off of it. And that sort of represents, you know, a hyper edge

itself. So it's possible to do that. Property graphs, likewise, are not particularly difficult to model. It's just another class that has

some data type properties off of it and then an edge coming in and an edge going out, and so it's sort of like a slightly fat edge

modeled as a class. So you can do the the same sorts of tricks that you do

in property graphs or hyper graphs in r e f as long as you have tools to help you model it. So we've spent a lot of time trying to make TerminaDB's

tools very good for modeling and to enable you to do that. And so in the last, TerminaDB

4 0 release, we have a visual modeling tool that helps you to design these things, and then there's various different views you can have of the schema and the data

facilitated by that modeling.

In addition, we have, like, the ability to represent fragments of the graph as documents

in JSON LD,

and that really allows you to go back and forth between a sort of graph view of the world and a document view of the world. And this is really useful for data curation. So when you're trying to edit, like, you know, a document that's about a patient record or something like that, you kind of want to display it as a record, and so you can have all of the data fields, etcetera, in there. You enter them in.

In the back end, it's all

actually a graph, but it's a fragment of the graph that's wrapped up into this JSON LD object.

You can also communicate these back and forth. You can update

it using a document view of the world. You can extract using a document view of the world, and you can query using,

a graph view of the world

on top of those documents. So you can, like, say, okay, What what was the patient record? You know, what was their age, etcetera, like that? You can do those searches

in the graph. And so it really facilitates JSONLT

and its ability to model fragments of a graph really make it nice because you can go back and forward.

Makes it a sort of multi

model database.

And another aspect of RDF is that I know that a lot of the other engines that have used

RDF and things like SPARQL for being able to store and query data have had challenges of being able to scale the

queries and the datasets. And you mentioned that being in memory has helped with that. And I know another concern with

graph data models is the concept of supernodes.

And so I'm wondering

if you can talk a bit more about

some of the ways that Terminus DB has worked to overcome some of those limitations

and some of the data modeling considerations

that people should be aware of as they are constructing these graphs and populating them and building up these knowledge graphs?

Yeah. So I mentioned that 1 of our first forays was in trying to use Postgres as the back end for our database, and

we found that it didn't really scale up in terms of performance using an RDF triple based approach because

essentially the link following was just too slow.

The combination of the succinct data structure so the succinct data structure allows us

to query in any mode.

It's indexed, so it's very fast, especially in the subject predicate direction, but it's also in log n time in the object to subject direction.

So we have it indexed in such a way that you can basically you can do any mode of query

with a very compact representation.

I guess there were a lot of, like, approaches where people had, like, double indexes on every single

element. You ended up with really big databases, and they it didn't scale very well, didn't perform very well. We found that this compact representation

scales very well and is much more performant than most of the RDF type database, especially when you get up to large datasets.

So I think that we've done really well there. I'm quite happy with the performance. We have other things that

are necessary for improving performance. So the persistent data structures,

they degrade in performance over time. And so we have something called a delta roll up, which helps us to improve the query

performance by essentially forming a snapshot

that allows you to sort of see what it is at a particular commit, what the state of the database if you took into account all of the if you sort of flatten all of the changes that were made on the persistent data structure to a single plane.

That also improves

performances

when transaction chains become long. This delta roll up is not out in 4 o. We're gonna get it out for 4.1.

We already have it implemented. We're just working on

making sure that it's correct and that there are no bugs in it before we launch it. But this should help a lot with the performance, and we have some heuristics for

automatically

putting in these delta roll ups without

causing memory problems for people.

So that's 1 of the things that we've done in order to improve performance.

There are other things as well. So, like, if you can compact things even further

or you can get locality of reference, then then you can improve performance quite a lot by fitting into cache lines.

So we have some ideas about how to do that as well by doing sort of reordering

of the way that our datasets

are stored so that we consistently query information

into a sort of compacted space.

And this can be done without too much difficulty because we can we have sort of a pointer swizzling trick that allows us to reorder these without without too much undue

cost in in the queries.

We ran into this quite a bit actually when we were doing long search chains in in the Polish economy. So

this is actually

a very serious concern. If you hit a super node, suddenly the size of

potential solutions expands radically.

Now we were still able to go much deeper than we were on Neo 4J. So we tried it against Neo 4j, and they were falling over on the same dataset at about 5 or 6. And we were going up to 11

hops.

And this really it's just partly due to the fact that, you know, everything

is resident in memory and compacts, and that allowed us to do a lot of that. And the fact that we were doing a depth first search, so we didn't keep residents in memory

very much extra sort of bookkeeping information, which allowed us to do it even using a very naive approach where we were able to make it through supernodes.

However,

we had some

ideas about how to improve

supernode performance

using bloom filters

that we prototype, but have not actually released in the database yet. So as we found, most of the people who are using thermosdb are really using it for data curation and the sort of data management approach stuff. We thought initially

we'd get more pressure from people who want to use it for sophisticated graph search, and that just hasn't been the case as yet. So

we're going where our users are. We wanna make it usable for our users. So if we start getting those kinds of queries pressing on us later down the line, we'll pay a little bit more attention to it. But at the moment, we we don't have any plans in our road map to improve performance on those particular types of problems.

You mentioned the snapshotting capabilities

for being able to collapse all the versions into a single reference point for sake of query performance.

Wondering what the other aspects of overhead are that are incurred by maintaining

the history of changes for various objects within the database and

some of the strategies that you have for being able to potentially handle life cycling of those versions or things like garbage collection and compaction over time?

Yeah. Those are very good questions. So hilariously enough, we don't have a working garbage collector at the moment. It's on our road map, and we intend to start working on it, I think, either this month or next month, depending on how other things go.

But we know that we need it. The surprising thing is that we haven't needed it yet. Even though we're working with really big datasets in production on large data systems. For instance, we have a large retailer that's using

using our system.

Because our

layers are so compact,

because we use this succinct representation,

we actually don't take up very much memory.

So so far, we've been able to just limp along without actually dealing with the fact that we're generating lots of garbage. And I found this kind of interesting because I've heard some of the other version control databases have had much more difficulty with this, but I think this is really an advantage of succinct data structures again that you really wanna keep things as compacted as possible in the first place. We're going to have a garbage collector

for dealing with this. There's a number of things that have to be done. So the first one's gonna be relatively simple.

1 that doesn't need to know very much. But then there's other things like we have a commit graph that keeps a graph

of all of the commit histories because the commit histories are not necessarily linear. They can branch out and, you know, they could be trees. They could even merge back, so things get relatively complicated.

But you can also have tips that have no branch head or tag on them

similar to in Git, and so you need to do some kind of pruning.

So we're also gonna have to do pruning at some point on the commit graph and how we wanna do that, whether we make that a feature that people have to

call a pruning operation on or something. We haven't really decided that yet.

That's 1 of the things that we're going to be working on though in the future is adding those sorts of features.

I think it's a really effective way of doing it. It allows you to basically

ignore the other versions unless you specifically search for them. And then you you do have to load them for those specific searches, but you can have, strategies for purging that from memory after they're no longer being used.

And that should be able to deal with the performance pretty well, I think, as of yet. So when

the versions get very long, we're gonna need to have some sort of like, at the moment, we can do squashes.

Squashes like a delta

roll up, except it actually squashes all the commits into a single commit layer and forgets the history.

That can be a solution if you actually don't need your history anymore.

But if you need all of your history going all the way back,

the combination of Delta roll ups and then, like, you know, just not loading if you don't query it, it won't get loaded into memory, so it's just on disk.

And I think we can we can start thinking about archiving

the the older versions

the older layers back to some kind of, like, cold storage later on. But we're gonna have to look at that more going forward.

You invest so much in your data infrastructure,

you simply can't afford to settle for unreliable data.

Fortunately, there's hope. In the same way that New Relic, Datadog, and other application performance management solutions ensure reliable software and keep application downtime at bay,

Monte Carlo solves the costly problem of broken data pipelines.

Monte Carlo's end to end data observability

platform monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence.

The platform uses machine learning to infer and learn your data, By

empowering

By empowering data teams with end to end data reliability,

Monte Carlo helps organizations save time, increase revenue, and restore trust in their data.

Visitdataengineeringpodcast.com/monte

carlo today to request a demo and see how Monte Carlo delivers data observability across your data infrastructure.

The first 25 people will receive a free limited edition Monte Carlo hat.

And digging more into the actual

collaboration capabilities and branching and merging and the versioning aspects of Terminus DB, I'm wondering if you can talk through some of the ways that those capabilities

impact

the approach that data teams take when working on a given project and some of the capabilities that it adds and some of the challenges that it might introduce for

how to effectively manage those strategies. And if the

best practices for Git are directly applicable to terminus DB or if there are different aspects to it?

It's honestly a game changer. Like, we find ourselves working with a curation of data and fixing data all the time, and we can just we're a distributed team at Terminus DB. We work all over the place,

and we can just send these layers to each other. Somebody can create an experiment. They can find a bug or something like that, and they can send it to me by, you know, just push and pull, and, you know, it's really incredible. It's really magical.

And I think that when people start using it, they're gonna be amazed at how well it works.

The Sysynq data structure, again, it facilitates the communication

of this data because it's relatively small deltas that you send. You don't send the whole database

all the time, but even when you do, when you do a clone,

even very large databases, you know, fit into

relatively small amount of space. So that also facilitates this data sharing aspect.

I don't think any of this stuff existed out of the box previously.

It's gonna take a while before people really get used to it. Now in terms of merging,

so we use rebase quite a lot. We don't have a full 3 way merge at the moment.

That's also on our road map, so we wanna be able to do that. In terms of the conflict resolution, that is an extremely

complicated problem.

So for the most part, with a lot of the

examples that we've

worked on, and we're eating our own dog food, so we use it in large

deployments already,

We haven't had too many problems with it, but when you do have a problem,

our tools for fixing it are relatively light. They're not very helpful for the user. You have to be a bit sophisticated. So we have fix up queries that allow you to alter the final state before the commit,

and that allows you to bring it back into compliance with the schema before the commit takes place, allowing the commit to go through.

That can be a little bit daunting

if you have a complicated situation

and you're a novice user.

And that's really an area we want to explore more for various types of solutions. So automatic merge strategies, fix up strategies,

help screens that give you some information about the commits,

and libraries

of merge tolerant data structures like CRDTs

that can help people avoid

these types of problems. So, I mean, with code management, a lot of the

merge conflict resolution is completely manual, and you just have to go in and edit the, you know, the 2 areas between the commits

to make it so that it goes through and then figure out that semantically

it's correct so that your code merges.

So we're kind of in that stage at the moment, but we have more information than Git has about the data. So we have a schema. So we should be able to give you a lot more information

about strategies that you might take.

And I think also that those sort of merge tolerant data structures

are a place where this can be resolved

automatically in a lot of cases

if special attention

is taken towards creating the right types of schema.

But like I said, we have tools for it right now, but they're very light, and there's a lot of room in this space. But I think this is gonna be an active area of sort of

experimentation,

and it will be very interesting.

Now digging into the actual query language

and the interaction with the database,

You mentioned that you have, the w o q l query language built specifically for Terminus DB.

I'm wondering what the reasoning was for creating a new syntax for working with this engine and what you drew on for inspiration as to how to structure that language.

We looked at using SPARQL for a while, but SPARQL

has a number of features that I I found to be

less than ideal. So it wasn't

has a syntax that's somewhat similar to SQL

but lost a lot of the beauty of SQL.

SQL is really beautiful because it's both declarative and composable in its nature. It's very easy to compose fragments of SQL together.

Everything is a table, has a nice simple interface,

and that was really appealing to me. I really like SQL, and I wanted something to work with Graph Query

that had that same kind of feel where everything felt very unified in its view of the world and where composability

was really taken

as central.

So our query language is inspired by Datalog,

and the main difficulty, I guess, in learning Datalog type things is the idea of unification. But once you've wrapped your head around this idea that

variables get assigned by matching and that once they have taken on a value, they don't change it during a given solution.

It's a very flexible

and powerful

approach to

to query.

The composability

of WAQL as compared to SPARQL is just a lot greater. So you can mix and match. We have all kinds of manipulation

that we do in the front end. So, like, our console

is very much largely written in WACL itself

for getting all of the data assets about Terminus, all the metadata, all that stuff, actually uses WACL.

And it's just a lot more composable thing than you would get even from SQL because

SQL tends to be, you know, string based language, whereas

we communicate

via an AST

represented

in JSON LD. And so you get a very native sort of JavaScript or Python way

of manipulating

an AST data structure.

Feels kind of like an SDK, feels a little bit like a query language, but really, you're just building up an AST

that you're gonna send to the query endpoint. That makes composability just really

incredibly

good.

So, like, we've had some pressure to implement a SPARQL endpoint for thermosdb,

and it is possible.

You could go ahead and do that. There's even a library in Prolog, so it wouldn't be too hard to add it. But I think for the most part, like

most by users of Sparkler, it tends to be fairly unsophisticated.

So it wouldn't take somebody that long to rewrite such a

thing in in WACCUL,

and I don't see it as a big impediment. The fact I know it's kind of weird that we chose our own language, but we looked around at a lot of other graph query languages

and just weren't very impressed with what they were. And so we decided

we may as well write our own. Still very early

in the history of the graph database. So who's gonna win is still very much an open question.

And my personal opinion is that whoever wins, it's gonna be a data log. It's not gonna be these other weird languages that people are writing. It's gonna be something based very closely on data log, I think. And the fact that you have a custom language for the database and the fact that it is open source, I'm wondering

how you have seen the overall growth of the community

and what your strategy is for

the overall

longevity of the project

and the sustainability of it.

The fact that we wrote a lot of the query language in Prolog definitely

has lowered the

community interest in that fragment.

So if you look at the Python SDK and the JavaScript

SDK, we've had well, especially in the Python SDK, we've had a lot of community contributions.

A lot of people have come in to help work on that stuff.

And by contrast, on the server end of things, there have really only been a couple.

And then the Rust

the community around Rust is really burgeoning and growing,

but the underlying

data structures for a database are not easy. Like, it's not exactly the easiest place to get jumped into

when you're trying to start on a project. So we did have a really good

contributor

in the Rust back end, named Sean Leather, and he did some incredible work with us.

But other than that, there hasn't been a lot of a lot of community

development on that end of things.

But in terms of the longevity of the project, you know, I think we're going to, moving forward, try to get more community involvements. Like our community, it's really getting quite large now in terms of the people who are interested in using it. As I said, it's more on the front end that people tend to be doing it, and I'd like to see more growth in some of the back end stuff. But I that'll come a little bit further down the line.

And in terms of the business model for the company that you've built out around this project, it seems to be largely oriented around the Terminus hub platform that you're using as this access for collaboration

and for being able to publish and work on public datasets. And I'm curious if you can talk through some of the add on capabilities that you've built into that and some of the

overall goals that you have both for the project and for the ways that it is used.

Yeah. So, I mean, 1 of our goals is really to have this database where

the best version of the database is the community version of the database. There isn't

sort of just an open core model, but, like, the 1 you get is the best 1 that's there. So you can do all your sophisticated queries. You can have all of the things that you want in your database,

and we make our money through Hub. So Hub is like GitHub. We wanna be like GitHub

except for data. So if you want to collaborate, then the easiest way to do the collaboration

is via our hub.

You can do it by setting up your own origins.

You can do that right now with the command line tool. You can set an origin,

can communicate between different terminal SDV

installations without ever going through hub. That's already possible,

but it's most convenient

if you just use the online server. The online server, we will

continually develop

in a direction similar to the way that GitHub has gone where they have

lots of different tools for exploration of published datasets. So you get a nice read me page, you know, when you land on a repo.

That kind of thing is the way that we wanna go forward with it, make the publishing much more seamless for people

so that if you're trying to publish a public dataset, then

you can easily do it. And then maybe even going forward, have, like, special

pages that allow you to put up a

view of the database

that you have so you can actually

extract meaningful information like maps or whatever

and demonstrate your data in that way for public datasets.

So that's how we imagine going forward that we're going to have this sophisticated database. I think, like, if you want a sophisticated

database that

gets widespread

developer adoption, it really has to be open source these days. I think it's really the only way, and so we're really quite committed to that. And I do think that the convenience of having a central hub is enough to really

make a viable business model.

1 of the things that you mentioned there, I think, is very interesting is the ability to have multiple origins that you're working with for being able to federate the use of the datasets so similar to Git where you might fork a project and add some contributions and then try and submit a merge request back upstream.

And I'm just wondering if you can talk through some of the most interesting or innovative or unexpected ways that you've seen Terminus DB being used.

That's a really powerful 1 because there's a lot of things that you can do with that that aren't immediately obvious. 1 of them, though, like, the simplest thing that you can do, which is actually really powerful, is that you can you can take backups very easily

just by pulling the latest version of some server.

So and you and you can federate these things. So you can have 1, you know, is connected to hub. It feeds information from hub so people can be working on the model or whatever. They push it to the production database.

The production database,

it has a connection to a backup,

and the backup just pulls whatever the current commit is on the production database,

and you have have all of that backed up. 1 of the coolest things that I've seen in terms of using the sort of commit structure

was there was a biochemist who was doing some complicated experiments, and they have all of this stuff moving around

in the laboratory.

And so they have a schema model that models all of these different kinds of,

like, assay trays, etcetera, and the different experiments that are being done, the merger of various different things into new things. And instead of representing it all in 1

flat knowledge graph, they were representing it as a series of commits where each operation was actually creating a new commit. And then because you can query

any commit by looking at the commit graph

in Terminus DB, you can actually look and see, like, a sort of you get a temporal view

of a changing knowledge graph. And I thought that was not an intended

outcome of the way that we structured things, but I thought it was a very clever

way of using what's possible with the commit graph. So

now it's kind of interesting that, like, some like Dolto Hub, for instance, they have a commit graph as well to structure all their commits, but they're a lit relational database.

For us, the commit graph is exactly just a Terminus DB graph. It's a graph that you create with WACL and you can manipulate with WACL. So you can actually

you know, there's a lot of sort of meta

level

manipulation that you can do with Walkold because of that.

In terms of your experience of building the project and growing the team around it and building the business to help support it, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

Getting the data structures right is really tricky.

Being compact and fast and durable,

you know, getting ACID

requirements

takes a lot of attention to and making it all perform well takes a lot of attention to detail and a lot of experimentation.

I feel like there's

constant work that can be done on this. We're we're increasingly confident in our underlying structure and its performance and stability,

but we've learned a lot on the way. And we have ideas about how to make that better, but it just takes a really long time to get these things

at the data layer correct.

So that's something I kinda knew it going in, but it it always surprises me how long these things take in practice.

The other thing is just merges are tricky.

Merges are a tricky problem for fundamental reasons. There's a lot of power in the concept of merges, though. So there's all kinds of things that you can do

on a persistent data structure that are kind of cool.

So for instance, if I come in,

I'm committing a merge or I'm committing a change, an update on a database. Right? And then the database head moves out from under me. And when I come to commit, the head has moved forward,

then I have to restart my commit

on top of the 1 that's already gotten there. So this is a way of sort of I guess it's optimistic currency with MVCC

type approach.

However, there's ways if you have a schema, you can also

look at the commutivity

between the 2 layers and see if they commute. If they commute,

then you can actually just stick it on the head without rerunning the transaction.

That's a merge quality. It has to do with maintaining the constraints on the database

simultaneously. So there's a lot of stuff that can happen here. We're not currently doing that trick with the commutation,

but it is possible.

And a lot of things are possible in this sort

of merge and reordering of commits that I think are going to be very interesting.

In terms of merge on big datasets as well, when you get a very large dataset, you have a merge, you have some kind of conflict because of the merge, Maybe somebody added a like, you could do something like add a cardinality constraint

that then is violated after the merger, and there's, like, a 100, 000, 000 things are no longer correct.

And then you have to have some way of viewing that in a way that would enable you to fix it. And so I think there's a lot of challenging

work that has to go in here, and we have some ideas about how to do it. So for instance, we can have error reporting that actually also is written to a graph and a specific error reporting graph, and then you could do some of your fix ups by querying the error reports and then making those into queries that then fix up the original.

We currently have a way to have intermediate layers

that don't meet

the schema constraints.

So there's also possibilities of sort of like CICD

type

situations where you have

layers that are actually in inconsistent states

and you allow them to be that way, and then then you can do some kind of fix up on that inconsistent state to bring it into a consistent state later.

And maybe it doesn't move the branch head, and you just have a tip that's sitting around or I'm not sure. There's various ways we could deal with that. But I think there's a load of possibilities here, and I think as we go forward,

that's gonna be a really exciting area of development and design.

And for people who are looking for

some of the data versioning capabilities or the graph capabilities that are in Terminus DB, what are the cases where it's the wrong choice and they might be better served by another system?

If you don't have a lot of concerns about the queryability

of the data, then you probably you're gonna have an easier time with something else like DBC or something like that. If you want a live database, then, you know, we're we're gonna be probably a better choice, but a more complicated 1.

If you have loads of write transactions,

if you, you know, you have constant updates like a stream of logging or something like that,

then we should not be your main transaction processing database because we will fall over. You could use us as an OLAP

type

server for a scenario, but you'd need to batch the updates into sort of reasonable chunks.

And reasonable being, you know I I don't know. You could have a 100 a day or something like that, but, you know, tens of thousands of rights a day would Terminus just would not do very well with those types of scenarios.

And so as you look to the future of the project, what are some of the things that you have planned for the near to medium term, both for Terminus DB and the Terminus Hub business that you've built around it?

In this Terminus DB core, 1 of the things we really want is content addressable hashing.

It's in our road map for the near term, and this is gonna allow signed and encrypted layers. And I think this is gonna be an important addition in terms of functionality.

We have some ideas about even more succinct data structures

to improve scale up and ease of collaboration.

We've been exploring some of these other alternative to sync data structures for graphs. I think that's gonna be quite cool.

We want to make Terminus Hub more browsable. So that's right now, it's quite you know, it's just a stub. It works really well with your Terminus CV, but it's not very accessible to the web. We want to make it way more web accessible so that people can see which datasets you put up and give people an idea of what's

being used inside of Terminus DB for public datasets.

And then I think, you know, we're we're gonna see more attention

to the user interface for merge strategies is is something that we're gonna see going forward.

We're constantly improving the document editing interface, the sort of curation

facilities,

and the model building facilities

for the database. And I think those will keep getting better, and we'll have lots of

new stuff to see

coming out in each version.

Are there any other aspects of the Terminus DB project or the overall space of knowledge engines and graph databases and data versioning that we didn't discuss yet that you'd like to cover before we close out the show?

I think the space of distributed data collaboration

is gonna be absolutely huge. Like, in GitHub changed the world of software engineering,

and this kind of approach hasn't come to data engineering, but I think it should come to data engineering. So, like, these CICD,

approaches where you have continuous integration,

continuous deployment.

You have a place to sort of have staged commits

before going to production.

These sorts of things are really, I think, really important for data.

I'm actually kind of surprised that they haven't gotten there yet,

but I think they will come. And I think thermostat is is gonna be 1 of the solutions that sort of fill the space, but it's gonna be a very big space. So if you think about the number of people who write code

versus the number of people who are curating data,

the data curation aspect is just a lot bigger. So you have, you know,

100 of thousands of Excel users, for instance.

It's just a much larger space

in terms of software engineering.

This space is gonna be absolutely massive, and I think there's data collaboration, CICD for data, data meshes.

There's a lot of tooling that needs to go in

to plug this absolutely massive chasm,

and I think this is a place to look in the future.

Well, for anybody who wants to get in touch with you or follow along with the work that you're doing or contribute to Terminus DB, I'll have you add your preferred contact information to the show notes. And as a final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I think it's this distributed data collaboration aspect. It's that's what we went after, and it wasn't just because,

you know, we were just grasping at something. We had a problem that we needed to solve, and we just couldn't find anything

that was filling that space in a way that we thought was

appropriate. So, you know, if you have a problem with data curation, data management,

with a complicated schema,

or you just have lots of people who are all editing the same data,

then I think you should give thermosdb

a try and see if it fills your

needs. Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with Terminus DB. As you said, collaboration around data and being able to version it and have safety for being able to determine

when to publish and what you might need to roll back is very important area. So I appreciate all the time and effort you've put into that, and I hope you enjoy the rest of your day.

Great. Thank you very much. Thanks for having me.

Listening.

Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave review on iTunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links