Developer Friendly Application Persistence That Is Fast And Scalable With HarperDB

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's

l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

So now your modern data stack is set up. How is everyone going to find the data they need and understand it? Select Star is a data discovery platform that automatically analyzes and documents your data.

For every table in select star, you can find out where the data originated, which dashboards are built on top of it, who's using it in the company, and how they're using it, all the way down to the SQL queries.

Best of all, it's simple to set up and easy for both engineering and operations teams to use.

With SelectStar's

data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets.

Try it out for free and double the length of your free trial at data engineering podcast.com/selectstar.

You'll also get a swag package when you continue on a paid plan. Your host is Tobias Macy. And today, I'm interviewing Steven Goldberg about Harbor DB, a distributed database engine designed to scale across edge and cloud environments. So, Steven, can you start by introducing yourself?

For sure. First of all, Tobias, thanks for having me here. I'm really excited to be here. I'm Steven Goldberg. I'm 1 of the cofounders and I'm the CEO of Harper DB.

My background is in enterprise data management, enterprise

architecture, and large scale integrations.

I started my career at Red Hat. I've worked at a number of different startups and with a number of different companies.

And do you remember how you first got involved with working in data?

Well, I started programming when I was about 13. My uncle ran a startup in the Bay Area

in the early nineties,

and

I

started working for him there and then just kind of working for software companies.

I think I got

started in data and integration

through work with implementing CRM systems,

which often have a lot of integrations

into other systems like ERP systems, billing systems,

subscription management,

inventory.

And so I really started to run into my first set of challenges around data integration, data management

through doing CRM consulting and implementation.

Now as you mentioned, you helped to found and you're running the Harbor DB company, which is also the business that is building and managing the Harper DB software. I'm wondering if you can describe a bit about what that is and some of the story behind how it came about and what it is about this problem space that made you want to dedicate your time and energy to it. As I mentioned, like, I started my career doing, you know, large scale business system implementation and integration.

I actually met Kyle, my cofounder, who's our CTO, over 10 years ago.

He was at a customer

that I was doing implementation

for,

and we were doing some integration of salesforce.com

into some different systems.

And I had been doing that a lot of companies like Nissan and in a lot of different spaces.

We had a lot of challenges there. About a year later, I started my own consulting company called Cloud Roots. Kyle was my first hire.

And at the time, Kyle and I thought that, really, we were interested in building a middleware technology

to do sort of dynamic integration

of different systems. This was before MuleSoft or Zapier existed, but we were kind of thinking of something like that

because we saw that a lot

of companies were experiencing the same challenges with integration.

But we were trying to run a consulting company and build a product, and that's really, really hard. And so we kind of got acquired into a different company where we were the engineering team and we rebuilt all of their software

and their back end. And they were focused on large scale

social media and analytics around sports and entertainment.

So kind of like monitoring every single tweet about the World Cup or the Super Bowl or Beyonce concert.

And so that was, you know, millions of rows of data a second, and we ended up building out this insanely crazy data infrastructure to manage all of that.

And it was really hard and complex

to manage. We spent most of our time just trying to keep the databases from crashing.

We're spending a ton of money on the cloud, and we weren't really getting developed anything.

And we also felt like databases were not very developer friendly. They were extremely focused on DBAs and infrastructure folks.

And, you know, if you're a developer, the first thing you need to do is get your schema together, get your database set up so that you can go build your app. But the databases out there were not

easy to do as a developer. And so what we wanted to do,

we got very frustrated with kind of what was available, and what we wanted to do was build

a database

that was the easiest database in the world to use, but they could scale

to meet massive data challenges.

And we came up with the idea 1 night while we were on a business trip. We were hanging out in a Airbnb in Palo Alto, and we were just goofing around. And then we figured, hey. Someone else will solve this problem. Someone else will build a database that has a scale of NoSQL, but with

the analytic capability of SQL, it'll be super developer friendly and, you know, we're not database guys. We're not smart enough to do this. Oh, you went to Stanford and who has a PhD in data management

and computer science will do that. And so we just kind of forgot about it, but it stuck in the back of our mind for almost 2 years,

and we eventually just decided to do it. And we took a leap of faith 5 years ago today. Today is actually the 5th year anniversary of our per DB, and here we are.

I think we've lived up to what we were trying to do, but it has not been an easy road to get here. That's pretty funny that it's 5 years to the day that we happen to be recording this interview, so I'm grateful to be able to spend that time with you. And as you mentioned, you don't have the background in sort of database systems and the theoretical underpinnings that go into it. And so I'm curious

how you've approached

the process of being able to design and build the database engine that has these

fairly extensive

requirements and capabilities

and being able to make that developer friendly and just some of the overall process of understanding

how to approach such a large and thorny problem?

I think that

I,

you know, often underestimate the into

this

into this problem that we didn't really understand.

I think, ultimately, that is a lot of the reason why we are successful and why we've brought something unique to market is sort of Kyle and I were too stupid to know what we were doing. And so as a result, like, we kinda felt like, hey, You know, an iPhone is a very complex device,

but it internalizes

that complexity and it exposes simplicity to the end user.

Databases should be able to do the same thing. And so we sort of live with this mantra that

keep it simple stupid.

You know, we wanted the interface to be simple. Like, we have a REST API with 1 endpoint

that just accepts a post body and you change the JSON of the post body.

We tried to make it so that it was as simple as possible, but to do that, to gain performance, to gain scalability, to gain,

consistency,

We had to educate ourselves a lot. And I'll be honest,

at a certain point, I realized I was well outside my technical depth. But Kyle really embraced that challenge, and he's basically,

over the last 5 years, given himself a PhD

in database systems, and he's probably 1 of the most educated people in the world about them now. But he spent the last 5 years, you know,

sort of trying on different things, moving fast, seeing what fits, moving on to the next thing. And we made a ton of mistakes along the way. Our first version of the product was written on the file system, and a lot of people told us that was insane. And we did it anyway, and it didn't work. We're now

like, we've rewritten the product, so it's on top of LMDB, which is lightning mapped key value store built by Howard Chiu.

And, you know, that was a huge learning. We wrote the first version of the products using socket IO as our clustering mechanism, then we had to move to socket cluster, and now we're moving to something else, which we'll talk about

in the future once it's publicly available. That's even better. But, like, we've made some mistakes, but those mistakes taught us a lot. And we've sort of focused on

while we're doing all these very complex things, make it simple for the end developer so they can just write their code. They don't care about database.

They're not geeking out about how cool all the internal complexity is. They just wanna throw some Python or Node. Js or whatever code around the rest API and focus on the thing they do care about, which is their application.

And we try to make it so that they can do that without really even having to worry that much about the database or how it works. Yeah. It's always pretty remarkable what you could achieve when you didn't know you're not supposed to be able to do it.

Yeah. I don't know that I would do this again, but

our stupidity was probably the biggest key to our success. But now I have a lot of gray hair in both my beard and head, from it, but that's okay.

And over the past 5 to 10 years, there's been a pretty remarkable

explosion in the availability of different database engines with different areas of focus and different technological underpinnings,

and they all have their own particular niche that they're trying to address. And I'm wondering if you can just give the framing of what HarborDB is designed to do well. You mentioned the developer friendliness, but from a sort of database engine storage management perspective, what do you think are the unique capabilities that Harbor DB brings that will edge out a MongoDB

or a CouchDB

or a Postgres?

So 1 of the reasons we started the company

is that we were kind of frustrated by the notion that you need 5 to 7 different databases to have be the infrastructure for an application.

When I started programming,

you know, you had Oracle, you had MySQL, you built your app on that, and you figured the rest out. You didn't spend

1, 000, 000 of dollars on databases and managing them and integrating them and having your data be out of sync.

And so we kind of built Harper DB to be the new workhorse like MySQL was back in the day and that it's not the best at everything. Like, you know, you take a product like Volt DB, which is a new memory product, and you definitely

can do reads at scale,

like, faster than Harper DB. But if you're trying to do reads and writes, Bolt's gonna crash at a certain level.

And so we kind of said, hey. Let's build a workhorse that's solid as a rock that's never gonna crash, and then you can do everything you wanna do. It may not be the best, but it'll work for almost every workload.

That said,

while that was the goal, and it still is true, and for developers and for building applications, that's why I think it's an awesome fit. Where we found that Harper DB really does have the most competitive advantage

really is, like, at an enterprise level, the best fit

is for

low latency distributed applications. So if you think about something like a gaming use case where you've got end users all around the world,

it's really easy to distribute your APIs to distribute your application. You know, containers

make that super easy.

But all those APIs in your application still ultimately are in a callback to a centralized database.

And physics makes that a problem because you can only get from

Tokyo to, you know, Ohio

at the speed of light, which

for a lot of applications doesn't matter, but for other applications, it does. And that adds a lot of latency.

And so by distributing Harper DB all over the world, having it be super fast at a node level, it's really great for those distributed use cases where latency does matter, where you wanna, you know, reduce costs. So that could be gaming, streaming media.

You know, other use cases like that are a really great fit.

In terms of the focus on being

accessible and pleasant for developers as the target end user, how has that influenced the ways that you've designed the

interfaces

and feature capabilities of Harper DB?

We are developers, and we built a database for developers. Like I said, we don't have PhDs in data science or anything like that. And so we really thought about it. We're like, if we wanted to use this, what would it look like? And so at our last company, we had a lot of APIs in our product, hundreds of APIs, and it was super hard to maintain. It was super complex

to, like, find what you're looking for. And so that goes back to what kinda I already mentioned.

Harper DB, if you're running it locally, is local host, you know, colon 9925

forward slash,

and it's always that. And you always hit that with your post body. And then if you wanna do, like, a NoSQL search, you

just put that as the operation in your JSON body. If you wanna do a SQL search, you put that in the operation.

So it makes coding super simple.

We've also partnered with companies like Postman where you end up getting all these awesome libraries that you can use. And we've also

started a bounty program where we pay developers to build add ons to Harper DB like different SDKs, different applications. And so

we've just really always focused on

our end customer being a developer, whereas a lot of other databases

focus on their end customers at DBA,

which is great for that DBA, but it's not ideal when you go to write code on it. And so

also have really focused on this idea of collapsing the stack.

And so as a developer, like, the less tools I have to use behind my application,

the better and the easier my life is. And so that's why we rolled out things like custom functions

where you can write your own application code right in arperdv right next to your data and manage and build an entire application

right on top of Harper DB without anything else. There was a product back in the day called Haruku, which I really liked and did a lot of that. And then they were bought by salesforce.com,

which kinda made them less

ideal for most applications outside their ecosystem.

But a lot of that concept was in their product. And SAP HANA honestly tried to do something similar, but really more focused on their ecosystem.

We wanted to build sort of a development platform and database all in 1 that you could build your entire application on. There was a more generally available thing for everyone outside of any ecosystem,

and that's kind of what we've done to make developers' lives easier. Yeah. There's definitely been a

fairly cyclical view on the role of the database in the overall

stack of software and sort of application delivery where

for a while, it was the 3 tier architecture where you had your

load balancer, your web application, and your database. And then, you know, there have been approaches of push all the logic into the database, have that be the actual

runtime for your application as well. I'm wondering what your thoughts are on the role of the database

over the next 2, 5, 10 years,

and how your thoughts on that have manifested in the way that you've approached the design and functionality of Harper DB.

Yeah. I think that goes back to your other question, which was, you know, now that there are all these databases with all these different niche, I think it was kinda laziness on the part of a lot of companies to say, hey.

We're gonna make this niche thing and use the developer then have to niche together 5 to 7 databases and 5 to 7 different middleware

tools and 6 to 7 different other technologies.

That is unfair. No 1 wants to do that. And I think that was a trend for a while, but I think that a lot of, like, top to bottom, you know, whether you're an entry level developer, you're a CIO or CTO, I think that's become frustrating.

I think that people,

you know, are moving more towards managed services and APIs.

You know, they want infrastructure as a service,

infrastructure as code. And I think that people want things to work together. I think they want it to be more seamless.

And I think the trend you'll start to see is, like,

some of

these very niche offerings that, yes, maybe they're really good at 1 thing, but that is a whole expense and team and complexity and resources.

Do you really need that? Is it justified?

I think people are getting smart about that and starting to realize that they would, you know, like to have tools that are interoperable

and that, like, can be used ubiquitously.

And I think that

you shift more towards products that are successful where that you they fit that pattern. And that's why we've added custom functions. It's why we have SQL and NoSQL. It's why we have GDBC drivers while also having WebSockets. It's why we're trying to accommodate

everything that the very small startup would want all the way to the extremely large enterprise so that

you can have all of that in 1 place. And with our custom functions, you really can code anything. You could do machine learning. You can do a further website. You can integrate into a third party system. You can manage sub processes.

You can build your own APIs.

So

I know that doesn't work for every use case, but I think that there's a lot there that solves for. And I, as a developer, that's what I want. And I think at least some of the market is like me.

1 of the things that I often run into as an engineer when I'm starting to play play in some of these different ecosystems

is

particularly when you have

a kind of collision of concerns that isn't as

widely adopted in industry. So

as an example, when I was going through your documentation, 1 of the options that you have in your API is being able to actually write a SQL query

as part of a API request to the database engine.

And so if I'm trying to

develop that as an end user in my IDE,

I might get some support for being able to highlight the SQL syntax to see you know, do some linting to see where it's wrong. But then if I try to embed that into a JSON structure,

then I'm trying

to sort of collapse too many concerns into 1, and the tooling doesn't always support that well. I'm curious what your thinking is in terms of being able to

effectively kind of collapse those concerns into a single experience,

provide a

good experience to the end user being able to have all of the different tooling and ecosystem capabilities that they're used to

and wrap that all into a single

product that is

sort of easy for people to pick up and use, but doesn't

force them to maybe switch the tools that they're used to developing with?

I think that is

a really good point, and

I'll give you an example. So when we first started the company, first couple months, like, Kyle and I were designing

what

searching would look like in NoSQL. And we started working on adding multiple conditions and multiple operators, and we started to end up with this JSON object that was a mile long.

And I just looked at Kyle and our whiteboard was covered. We looked like the guy from Beautiful Minds and, like, we were just crazy.

And And I looked at him and said, you know, there's a really good way

to do complex searching that's been around for about 40 years. It's called SQL.

And trying to do this is insane. Like, you know, asking a developer to understand the syntax that we're building here is

just crazy town. And so

we have always kind of adopted it just like when things get too complicated, stop. And that might mean that we're gonna make some trade off from a performance perspective or from a feature perspective, but

there are other products. You know, Oracle has 40, 50 years of crazy features in it. And if you wanna go do

some sort of really complicated SQL

query that has a trigger and, you know, some sort of cascading delete afterwards and with a stored procedure,

go for it. But that's, like, not what we're trying to achieve. And so

part of it is just staying true to what we built. But sometimes we make mistakes. Right? Like, we've

made mistakes like you mentioned in the SQL piece. And so then that's why if you go into our DB studio, you can see we do have some of that built in in the UI. And then you can also use database management tools like the MySQL workbench and things like that on top of hardware DB if you use ODBC or JDBC drivers because we don't want you to have to learn new stuff. Like, we don't want you to have to learn

HSQL.

That's why we type tailored the ANSI standards. We don't want you to have to learn, you know,

new things. We are trying to make your life as easy as possible

while also knowing that when you have a 1, 000, 000, 000 rows a second being written, that ArpadDB will still work. Because a lot of times it's what's easier and developer friendly on 1 side and what's enterprise grade on the other, and that's not fair. And so we're trying to do the balance between much as both as possible, but it's hard. And we get in fights about it. Jackson, Kyle, and my Jackson's our head of product, and Kyle and I will get in arguments. And it's normally Jackson and I arguing with Kyle because Jackson's

tagline is, like, easy button and mine is keep it simple stupid.

And, like, Kyle's very focused on performance and scale, and we fight it out and then hug it out and come up with a good solution is kinda our answer. I definitely appreciate the

availability of the studio solution as a way to unify that experience and have a kind of first class interface into the engine for people who do want to lean heavily into Harbor DB as a platform opportunity.

And I have to say that that was all, Jackson. Kyle and I, like, never thought of that.

We didn't think it was important. And Kyle and I, our background is in integration and, you know, data management. And so UIs were like command line and rest of API. And so

Jackson has brought that experience to the table, and we're very lucky to have him.

Today's episode is sponsored by prophecy dot I o, the low code data engineering platform for the cloud.

Prophecy provides an easy to use visual interface to design and to deploy data pipelines on Apache Spark and Apache Airflow.

Now all the data users can use software engineering best practices.

Git tests and continuous deployment with a simple to use visual designer. How does it work? You visually design the pipelines and Prophesy generates clean spark code with tests and stores it in version control, then you visually schedule these pipelines on Airflow.

You can observe your pipelines with built in metadata search and column level lineage.

Finally, if you have existing workflows in AB Initio, Informatica, or other ETL formats that you want to move to the cloud,

you can import them automatically into prophecy, making them run productively on Spark. Learn more at data engineering podcast.com

/ prophecy.

Digging into Harper DB itself, can you talk through some of the implementation

and some of the ways that the design and goals of the engine have changed or evolved from when you first started working on it? The goal of, like, read, write performance,

simultaneous read, write has always been there and something we really excel out and continue to get better at.

But, like, touting our replication capability was not something we ever really thought that, like we didn't start out to think, hey, distributed is an area that we're gonna excel at. To be honest,

because we wrote the product in Node. Js,

we just kinda fell into that because it's a web framework and the web is very good at horizontally scaling,

and node is extremely good at horizontal scale.

And things like Slack were built on socket. Io. And when you think about it, that is a extremely distributed system.

And so we inherited

that off of sort of the shoulders of people like Slack,

and

that then became a major selling point for us. On the flip side, we really thought things like JDBC and ODBC drivers would be super important, and

no 1 cares. Like, they use it to test stuff out and, like, maybe they'll connect it to Tableau, but, like, realistically,

people just don't care that much about them. Like, we could get rid of them tomorrow. I think, like, 5 people would complain of the, you know, 25, 000 some people using the Harper DB.

And but Kyle and I were so obsessed and thinking that was gonna be so important.

So it's interesting. But then you'll have a customer who's really important, who does care, and so we're glad we have them.

But we definitely didn't

do enough market research on what people would care about not in the beginning.

As to the

data model, I know that it is designed to be very flexible and dynamic in terms of the schema definitions. And I'm wondering if you can

talk to some of the ways that you've approached the actual underpinnings

to be able to support things like joins given your

investment in SQL as an interface to the database? Because I know that for document oriented

databases, it can become very difficult to actually do

performant joins across different document collections and just some of the

edge cases and engineering challenges that you've run into as you've been developing this database?

Yeah. I mean, that was ultimately the problem we solved from day 1, and so we've really stayed true to that.

The reason was we were using DynamoDB

at our last company and we had millions and millions of rows in DynamoDB and it was great for scaling up for rights and doing simple searching.

But then, you know, trying to join across things is hard. We looked at Hadoop and Hive and things like that and

super slow. We tested a bunch of other stuff and so ultimately the solution that we ended up having in our last company was we capture everything in Dynamo,

when we move it over to

a memory SQL database

to do all the analytics. And that was just super annoying to us. And it got out of sync because we were doing all the stuff real time for live television.

And so if it's out of sync for a minute, you know, we couldn't go live on broadcast and trying to sync 2 databases with a 1, 000, 000, 000 rows of data and then within a minute is hard.

And so that was sort of the impetus of creating r 4 d b. So our storage algorithm is a document store, but it is different than most other document stores. At a high level, the way it works is

it is not an unstructured database because as you insert data, we dynamically look at that data and we index all top level attributes at the on right.

So then as a result,

combining a column from table a versus table b, querying those 2 columns is about the same performance as querying 2 columns from just table a because of the way we store data. And the goal of that was so that you don't have to know what your scheme is gonna look like at a time. Whatever analytics you wanna do on it, Harper DB has indexed in a smarter way possible

that that's possible. And so everything is indexed, which creates some challenges around storage. Right? Like, your storage is a little higher than it would be with some other stuff. I think it's about 20%.

You know, my cousin is a developer at a very, very large social media company

and he's a lead Android developer there and he once asked 1 of the teams. He said, hey. I need to be able to query on date of birth because we need this content to be 18 plus.

And they said, oh, we didn't index that column, figure out a different way because it'll take us a month to do that. And as developer, you don't want that answer. You want just give me the column I wanna query in. And so that is kind of why we designed the storage engine the way we did. And it has the trade offs, but it makes developers' lives a lot easier. You know, you can never predict what you're gonna need

till it's too late. You have several 1, 000, 000, 000 rows of data in there, and you wanna be able to query it. 1 of the complexities

that can often occur when you do have this dynamic schema is that

1 time you have an application that's writing in this structure

and 1 of the fields is an integer,

and then somebody makes a code change, and now it's writing out a string value. And I'm curious what types of

enforcement you have for being able to say, okay. Nope. This was an integer. You can't write a string there anymore. You're gonna have to actually

pay some attention to the data modeling, or is it just write whatever you want, and we'll figure it out later? Because there are definitely pros and cons to both approaches.

We have some intelligence.

We do not force anything right now. And so you can do whatever you want. You can put a string and an integer in the same column. We do have some intelligence around the indexing, and we'll look at what the majority of that is there and sort of store it based on what we think that data is. But ultimately, you can still do whatever you want. And so we felt like that was a decision that allowed you to still have the flexibility of NoSQL but with better performance than you get from NoSQL.

We do have things built into, like, the ODBC and JDBC drivers where it'll look at the columns and it'll take its best guess so that when you pull in the Tableau, it's like this column's an integer. This column is a date. This column is lat long.

But we don't enforce anything right now.

We do plan to, in the future when we have some more resources and time, allow that as an optional feature where you can turn on schema enforcement.

But the nice thing about ArborDB is you can describe the schema. And so unlike MongoDB

where, you know, you'll have a 1, 000, 000 rows that have this column, but in the same collection, you might have a 1, 000, 000 rows that don't. Once that column's created, it's created. And it's there for all the objects, and it'll be null if you don't put anything in. And

whether or not you have it, you know it's there. You can describe the schema, see the schema, see what it looks like. So you have some better management around that. But, yeah, we're kind of trying to balance that, and it's a hard thing to balance because

as soon as you turn on strict scheme enforcement, that's gonna create a whole another set of problems.

And so, you know, we also,

like, strongly believe in keeping ARPA DB as stateless as possible. And so we're also very wary of ever putting background processes in because background processes

are often what cause databases to crash.

And so, like, there's a lot of things that go into that thought process.

And it's not as like, hey. We don't wanna do that as much as it's there are so many trade offs in a database when you make 1 decision that you really have to carefully think about everything. And then also,

databases are a hard thing to update and maintain. And so we can't just roll out features willy nilly because that can be extremely disruptive. So

things have to be really thought out before we do them.

This exact problem you're talking about is 1 of the things we've been talking about for years

and

carefully planning how we'll roll that out.

Another interesting

architectural aspect of Harpy DB is, as you mentioned, it is distributed database engine. And I know that the replication

method is using a PubSub model where you can subscribe to different table updates and decide when to replicate that. And I'm curious

what types of deployment topologies

and

unique use cases that enables and just some of the ways that that has manifested in the overall sort of product design of Harper DB?

Yeah. That gives a lot of optionality,

and with a lot of optionality

comes a lot of trade offs. And so

there are not infinite, but there are a lot of different topologies in which you can deploy. So you can do a hub and spoke. You can do a circle. You can do,

you know, like, many, many different ways. You can have, like, a multi tiered hub and spoke. And so it very much depends on the use case.

In IoT, we've seen a lot of the hub and spoke be sort of like a successful model. If you think about it, we've got things writing very high volumes of data on the edge, but maybe as you get closer to the core,

that data doesn't matter that much. And so you wanna kind of buffer decide and sort of have a multi tiered architecture

in how your data moves, you know, from the edge into the core.

Then you've got more like gaming media where all of the data matters all the time. And so, like,

some smarter version of a circle makes more sense on a fully distributed,

like, peer to peer. It is extremely use case dependent, but you also have to think about what that means. Right? Because,

like, it is an exponential problem because as you add nodes and those nodes talk to each other, if you have a 100 nodes, that's a 100 connections you could potentially have. And then what does that do to your network and how much information is moving back and forth? And so that is why we are actually rolling out. And I can't unfortunately

talk about it too much, but we're rolling out a new clustering topology which solves for a lot of that, It makes things also more consistent across that

because what we've realized is that

while that optionality is great and we're gonna keep that, like, we need a more standard

methodology

to keep people safe, honestly,

because sometimes choices

can be a problem. And so it is a really interesting problem. And I'll be honest, but Kyle is much smarter about that than I am and can talk to your head off about it for 4 and a half hours.

And it's a very geeky, cool problem, but it gets pretty complicated

between network and also, like, available compute on a per node level

and, you know, open connections and things like that. It is 1 of the 2 most complex engineering problems in our for DB besides the storage of them. Yeah. And particularly when you start dealing

with replicated data structures and figuring out what are the transaction boundaries. Do I want to actually get into the space of doing distributed transactions where I know that with Harper DB, you have opted for last right wins. And so whichever

record ends up being replicated to a given node,

whichever 1 was the most recent 1 is going to be the winner in that sort of right competition.

Yeah. And we did that on purpose because the use cases that we're working with, that is probably the best way to do it. But that doesn't work for other use cases. And Cockroach,

it's weird because we're sort of competitive with Cockroach, but we'll often recommend,

if this doesn't work for you, you should look at CockroachDB

because

Cockroach does a really good job of solving the other side of the problem for us, and they're a much better fit for sort of like a Fintech use case where that really matters.

And, obviously, we're gonna be much faster than Cockroach in a distributed fashion

because we did the other side of the problem and so we focused on speed

and lower latency for the end user, but the guarantee of consistency is gonna be lower And whereas cockroaches

focus on that guarantee.

And so, like, you have to decide what you want. And if you're looking for that guarantee,

Cockroach is a better choice than Workforce DB for that.

In the context of the deployment topologies,

you referenced

IoT, and I'm curious what are the available deployment targets for Harper DB,

and how has that

influenced the way that you think about the actual

packaging and deployment of the database engine to be able to

fit on some of these smaller or lower powered devices?

So we spend a lot of time working in IoT. HarborDB can be deployed on anything with a Unix based operating system. So your Mac, a Raspberry Pi 0, you know, a huge bare metal machine and kind of anything in between.

And so we spent a lot of time in the beginning

focused on it has to be able to do everything on a Raspberry Pi that it can do on a Cray supercomputer.

To be honest, we spent too much time worrying about that because the way that IoT has moved, we're partnered with Verizon now and we're deployed on all of their mech locations throughout the United States.

And so, like, we're doing projects with them where you have smart devices over 5 g talking to Harper DB

on, you know, wavelength locations spread across the United States.

And that latency from that device to Harper DB on that Mac is like typically 1 to 2 milliseconds, maybe 5, 10 in worst case.

And

we've realized that for IoT, that's a way better pattern is just, you know, having more edge data centers closer

to the end user

makes a lot more sense. Definitely putting some application on the device, but the more you put on the device, you you create a lot of risk and also then you have, like,

problems with upgrading devices. You know, we're looking at 1 customer where if they would upgrade their devices, it would cost them a $100, 000, 000.

And so

moving that compute as close to that device as possible and putting as much logic in

and Harper DB really solves that problem

without creating these risks around putting it directly on device. That said, if you wanna run it, there are use cases

more specifically

in

areas

where you have poor connectivity,

and that will remain that while for a long time. Those could be utilities,

industrial use cases, military use cases where you can install Harper DB on a mobile command center running, you know, on the battlefield. And so we still solve for that. We still have a lot of capability because Harper DB works totally offline,

doesn't require the Internet, and can run on those devices. And there can be value in that, but that is not

sort of our primary focus,

but still something we're able to accommodate.

There are definitely a lot of interesting sort of subtopics we could dig into such as the kind of versioning and updates of the function definitions, some of the ways that you think about the kind of design and capabilities of those embedded functions,

the sort of upgrade process of managing Harper DB deployments. But another interesting element is the fact that you do have this open source database engine and this commercial entity

behind it. I'm curious what the

sort of governance and sustainability

approaches are for being able to

manage the boundaries between that open source and commercial capabilities.

So Harbor

DB itself currently is not open source. So

the core of Harbor DB is a freemium premium model.

You know, I'm a former red hatter. I'm a huge fan of open source, and we've tried to open source as much of the technology as we possibly can and, you know, everything surrounding it.

1 of the things when we launched the company we looked at, and actually we looked at Cockroach quite a bit, was that

launching an open source database was hard. No 1 wanted to pay for database,

and

I would love Harper to be the open source. I think honestly, it would help us quite a bit. We'd have a lot more people

contributing to it, be in the hands of more people. You know, it's written in Node. Js, and it's deployed via NPM, which are 2 of the most popular

things in the world. So I think that would go over well.

But we also need to be a sustainable company and grow and get paid. And so that has been just a real balancing act for us.

We constantly think about open sourcing ARPA DB completely,

and hopefully, we get to a point where we can do that, but we're not quite there yet. And

I don't know that we ever will be, but we would love to, I guess, is the honest answer.

In your experience

of building the Harper DB

technology and the business around it and working with your customers, what are some of the most interesting or innovative or unexpected ways that you've seen it applied?

So many.

So 1 time and I mean, this is very old, but I remember a guy in a rural area in India built an application on Harper DB where he could do legal document sharing. And so basically

he would pass the database around on these devices so that they, without Internet, have this ability to sort of track for consistency

and manage that and do version control, but off line completely.

And I thought that was really interesting.

I think gaming companies have pushed us into some really interesting

areas around doing things like fighting bots at scale on Harper DB, use peer to peer capability.

We've seen projects with the US military where they've done, like, facial recognition stuff inside of Harper DB, which I thought was fascinating.

People are constantly just building really cool stuff on it. That's 1 of the fun things about being a database is that the community builds a crazy variety of different things.

It's honestly, for Kyle and I, probably the most fun thing is every time we see somebody build something new and wild on Harper DB, it's like,

hey. We built that thing under that,

and now they're building that. And it's, like, the most affirming and most fun part of running Harper

DB. In terms of the

features and capabilities of Harper DB, it's a fairly

extensive project. I'm curious. What are some of the capabilities

that you think are either

overlooked or

misunderstood

or underutilized that you wanna highlight?

I think this sounds stupid.

It is true. I think, really, for me, it's 2 things. I think, 1,

when we talk to a lot of companies, they believe it's gonna be this tremendous migration process to a new DB because they're used to sort of gDBC drivers, no ORMs, and all of that stuff. And

Harper,

because it's a single endpoint, and you can literally copy

the code examples in any language,

Building an app on arbordb

and any developer that's ever done it will tell you is it's easier to build an app on arbordb than anything else in the world, but it's because it's that single endpoint. I I don't think people really understand what that means until they use it.

And so that's

something I think that people need to get their heads around, but when they do, it is a huge value add.

And then I think how extensible custom functions make Harper DB. It's a fast device server essentially living inside of Harper DB. And what that means you can then do,

I think people

they're starting to see it now and are blown away by it, but I think that that's gonna take some time to catch on. I'm excited to watch that catch on. And I saw a tweet from a guy yesterday and he was like, I built my whole app on ARPA DB. I did not believe that I could do that, but then I did. And that was very cool to see. So we're excited about that. In your experience of running the business and building the technology and working with the community around it, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

I think 1 thing we learned was that the developer community,

we love them

and they drive a tremendous amount of innovation and value to us, but they are not our they don't pay us money.

And so trying to target both the developer communities as well as large enterprise has been,

you know, tailoring our message and our documentation

and our feature set to accommodate both those folks at the same time.

That was not something we expected and we had to learn a lot about.

I think that's probably been number 1

for me.

We've touched on this a little bit with some specific examples, but broadly speaking, for people who are interested

in being able to build on a flexible and scalable database, what are the cases where Harper DB is the wrong choice and they might be better suited with a different engine? I think Fintech is definitely 1.

I think if you're primarily building something like a data warehouse and you have low volumes of rights and you're just doing massive

reads

on huge volumes of data,

you're really not looking for

like an HTAP database where it's both an operational database as well as an analytical database. Harper DB is probably not a great fit for that. There are better things to solve that problem.

Yeah. Those are kind of the 2 major ones. I think, like, if you're looking at Splunk

and comparing our part of it to be to that,

Splunk is definitely gonna have, like, more features for analytics than we will. If you're really focused on solving 1 problem and the benefit of being able to do

everything in 1 place isn't there for you, you don't care about that, then it might not be the best fit. But if you wanna solve 1 problem

and be able to do most other things, then Harper DB is a really solid fit.

In terms of the near to medium term future of the project and the business, what are some of the things that you have planned or any projects that you're particularly excited to dig into?

As I mentioned a few times, we're rolling on the new version of the replication

engine, which I'm super excited about.

That's gonna be in our upcoming release.

We've spent a lot of time doing hardening.

I think we're gonna continue to enhance custom functions.

I want to start rolling out, like, a library of prebuilt stuff for the community on custom functions that are sort of ready to go so that's something I'm excited about.

We've had some great sort of live streams and web streams recently that I think were pretty exciting. We have several more coming there, and we're announcing some really big partnerships that I think

are gonna make Harper DB even easier to use and give developers more about where they deploy their applications.

So we're very excited about that as well. And those are from all the way from the 5 g edge to sort of hyperscaler cloud and everywhere in between, and so I'm pretty excited about those as well.

Are there any other aspects of the work that you're doing at Harpreet DB or the database market or the use cases of building your application logic in the database engine that we didn't discuss yet that you'd like to cover before we close out the show? I did think of something just now and sort of relevant to that. That's a brand what's overlooked feature, but that also

ties in this question.

I think the other thing

that people don't realize and that I even forget

is

HarborDB is decoupled from its storage and it treats containers as a first class citizen.

That is very unique to HarborDB in the market. And so from a deployment perspective,

when you wanna deploy your app,

you can deploy Harper DB on Kubernetes and attach it to storage, and it started in a few milliseconds,

detach it and

deploy it somewhere else and reattach it to storage and it's up and running in a few milliseconds. So as the world moves to infrastructure as code

and

containers,

you know, become the norm, I think Carpare DB is probably the most container friendly database in the world. And I think that that gets often overlooked

and I think will become, like, a very interesting part of our story in the near future.

Absolutely.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

That's a great question. Storage is the answer,

and it's an easy answer. You know, compute is very flexible

now, but storage is not. And storage needs to be disrupted. And, honestly, if I wasn't doing this, I think I'd do that, but I didn't know that before this.

So, I'm just learning that, like, it's very frustrating, the available storage options in the market. And

unless you're inside of AWS or GCP

or, like, you know, Linode, it's very hard to have flexible storage options outside of the hyperscaler

clouds.

I think that makes data management really hard. And right now, it's only affecting, like, the very large players in the space who have huge volumes of data. But as everyone starts to have huge volumes of data,

I think that's gonna be a problem that needs to be solved. Highly recommend someone start a startup and disrupt that space.

Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Harper DB and your overall perspective on the database market. It's definitely a very interesting project,

great product. Definitely excited to see the capabilities that you're offering to the community. So I appreciate all the time and energy that you and your team have put into that, and I hope you enjoy the rest of your day. Thank you so much, and I really appreciate it. And thank you for having me here.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links