Summary
Databases are an important component of application architectures, but they are often difficult to work with. HarperDB was created with the core goal of being a developer friendly database engine. In the process they ended up creating a scalable distributed engine that works across edge and datacenter environments to support a variety of novel use cases. In this episode co-founder and CEO Stephen Goldberg shares the history of the project, how it is architected to achieve their goals, and how you can start using it today.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.
- So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan.
- Are you looking for a structured and battle-tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all your questions? Join Pipeline Academy, the worlds first data engineering bootcamp. Learn in small groups with likeminded professionals for 9 weeks part-time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer. Plus we have AMAs with world-class guest speakers every week! The next cohort starts in April 2022. Visit dataengineeringpodcast.com/academy and apply now!
- Your host is Tobias Macey and today I’m interviewing Stephen Goldberg about HarperDB, a developer-friendly distributed database engine designed to scale across edge and cloud environments
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what HarperDB is and the story behind it?
- There has been an explosion of database engines over the past 5 – 10 years, with each entrant offering specific capabilities. What are the use cases that HarperDB is focused on addressing?
- What are the issues that you experienced with existing database engines that led to the creation of HarperDB?
- In what ways does HarperDB address those issues?
- What are some of the ways that the focus on developers has influenced the interfaces and features of HarperDB?
- What is your view on the role of the database in the near to medium future?
- Can you describe how HarperDB is implemented?
- How have the design and goals changed from when you first started working on it?
- One of the common difficulties in document oriented databases is being able to conduct performant joins. What are the considerations that users need to be aware of as they are designing their data models?
- What are some examples of deployment topologies that HarperDB can support given the pub/sub replication model?
- What are some of the data modeling/database design strategies that users of HarperDB should know in order to take full advantage of its capabilities?
- With the dynamic schema capabilities allowing developers to add attributes and mutate the table structure at any point, what are the options for schema enforcment? (e.g. add an integer attribute and another record tries to write a string to that attribute location)
- What are the most interesting, innovative, or unexpected ways that you have seen HarperDB used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on HarperDB?
- When is HarperDB the wrong choice?
- What do you have planned for the future of HarperDB?
Contact Info
- @sgoldberg on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
- HarperDB
- @harperdbio on Twitter
- Mulesoft
- Zapier
- LMDB
- SocketIO
- SocketCluster
- MongoDB
- CouchDB
- PostgreSQL
- VoltDB
- Heroku
- SAP/Hana
- NodeJS
- DynamoDB
- CockroachDB
- Fastify
- HTAP == Hybrid Transactional Analytical Processing
- Splunk
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
So now your modern data stack is set up. How is everyone going to find the data they need and understand it? Select Star is a data discovery platform that automatically analyzes and documents your data. For every table in select star, you can find out where the data originated, which dashboards are built on top of it, who's using it in the company, and how they're using it, all the way down to the SQL queries. Best of all, it's simple to set up and easy for both engineering and operations teams to use. With SelectStar's data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets.
Try it out for free and double the length of your free trial at data engineering podcast.com/selectstar. You'll also get a swag package when you continue on a paid plan. Your host is Tobias Macy. And today, I'm interviewing Steven Goldberg about Harbor DB, a distributed database engine designed to scale across edge and cloud environments. So, Steven, can you start by introducing yourself?
[00:01:53] Unknown:
For sure. First of all, Tobias, thanks for having me here. I'm really excited to be here. I'm Steven Goldberg. I'm 1 of the cofounders and I'm the CEO of Harper DB. My background is in enterprise data management, enterprise architecture, and large scale integrations. I started my career at Red Hat. I've worked at a number of different startups and with a number of different companies.
[00:02:17] Unknown:
And do you remember how you first got involved with working in data?
[00:02:20] Unknown:
Well, I started programming when I was about 13. My uncle ran a startup in the Bay Area in the early nineties, and I started working for him there and then just kind of working for software companies. I think I got started in data and integration through work with implementing CRM systems, which often have a lot of integrations into other systems like ERP systems, billing systems, subscription management, inventory. And so I really started to run into my first set of challenges around data integration, data management through doing CRM consulting and implementation.
[00:03:04] Unknown:
Now as you mentioned, you helped to found and you're running the Harbor DB company, which is also the business that is building and managing the Harper DB software. I'm wondering if you can describe a bit about what that is and some of the story behind how it came about and what it is about this problem space that made you want to dedicate your time and energy to it. As I mentioned, like, I started my career doing, you know, large scale business system implementation and integration.
[00:03:33] Unknown:
I actually met Kyle, my cofounder, who's our CTO, over 10 years ago. He was at a customer that I was doing implementation for, and we were doing some integration of salesforce.com into some different systems. And I had been doing that a lot of companies like Nissan and in a lot of different spaces. We had a lot of challenges there. About a year later, I started my own consulting company called Cloud Roots. Kyle was my first hire. And at the time, Kyle and I thought that, really, we were interested in building a middleware technology to do sort of dynamic integration of different systems. This was before MuleSoft or Zapier existed, but we were kind of thinking of something like that because we saw that a lot of companies were experiencing the same challenges with integration.
But we were trying to run a consulting company and build a product, and that's really, really hard. And so we kind of got acquired into a different company where we were the engineering team and we rebuilt all of their software and their back end. And they were focused on large scale social media and analytics around sports and entertainment. So kind of like monitoring every single tweet about the World Cup or the Super Bowl or Beyonce concert. And so that was, you know, millions of rows of data a second, and we ended up building out this insanely crazy data infrastructure to manage all of that. And it was really hard and complex to manage. We spent most of our time just trying to keep the databases from crashing.
We're spending a ton of money on the cloud, and we weren't really getting developed anything. And we also felt like databases were not very developer friendly. They were extremely focused on DBAs and infrastructure folks. And, you know, if you're a developer, the first thing you need to do is get your schema together, get your database set up so that you can go build your app. But the databases out there were not easy to do as a developer. And so what we wanted to do, we got very frustrated with kind of what was available, and what we wanted to do was build a database that was the easiest database in the world to use, but they could scale to meet massive data challenges.
And we came up with the idea 1 night while we were on a business trip. We were hanging out in a Airbnb in Palo Alto, and we were just goofing around. And then we figured, hey. Someone else will solve this problem. Someone else will build a database that has a scale of NoSQL, but with the analytic capability of SQL, it'll be super developer friendly and, you know, we're not database guys. We're not smart enough to do this. Oh, you went to Stanford and who has a PhD in data management and computer science will do that. And so we just kind of forgot about it, but it stuck in the back of our mind for almost 2 years, and we eventually just decided to do it. And we took a leap of faith 5 years ago today. Today is actually the 5th year anniversary of our per DB, and here we are.
[00:06:28] Unknown:
I think we've lived up to what we were trying to do, but it has not been an easy road to get here. That's pretty funny that it's 5 years to the day that we happen to be recording this interview, so I'm grateful to be able to spend that time with you. And as you mentioned, you don't have the background in sort of database systems and the theoretical underpinnings that go into it. And so I'm curious how you've approached the process of being able to design and build the database engine that has these fairly extensive requirements and capabilities and being able to make that developer friendly and just some of the overall process of understanding how to approach such a large and thorny problem?
[00:07:09] Unknown:
I think that I, you know, often underestimate the into this into this problem that we didn't really understand. I think, ultimately, that is a lot of the reason why we are successful and why we've brought something unique to market is sort of Kyle and I were too stupid to know what we were doing. And so as a result, like, we kinda felt like, hey, You know, an iPhone is a very complex device, but it internalizes that complexity and it exposes simplicity to the end user. Databases should be able to do the same thing. And so we sort of live with this mantra that keep it simple stupid.
You know, we wanted the interface to be simple. Like, we have a REST API with 1 endpoint that just accepts a post body and you change the JSON of the post body. We tried to make it so that it was as simple as possible, but to do that, to gain performance, to gain scalability, to gain, consistency, We had to educate ourselves a lot. And I'll be honest, at a certain point, I realized I was well outside my technical depth. But Kyle really embraced that challenge, and he's basically, over the last 5 years, given himself a PhD in database systems, and he's probably 1 of the most educated people in the world about them now. But he spent the last 5 years, you know, sort of trying on different things, moving fast, seeing what fits, moving on to the next thing. And we made a ton of mistakes along the way. Our first version of the product was written on the file system, and a lot of people told us that was insane. And we did it anyway, and it didn't work. We're now like, we've rewritten the product, so it's on top of LMDB, which is lightning mapped key value store built by Howard Chiu.
And, you know, that was a huge learning. We wrote the first version of the products using socket IO as our clustering mechanism, then we had to move to socket cluster, and now we're moving to something else, which we'll talk about in the future once it's publicly available. That's even better. But, like, we've made some mistakes, but those mistakes taught us a lot. And we've sort of focused on while we're doing all these very complex things, make it simple for the end developer so they can just write their code. They don't care about database. They're not geeking out about how cool all the internal complexity is. They just wanna throw some Python or Node. Js or whatever code around the rest API and focus on the thing they do care about, which is their application.
And we try to make it so that they can do that without really even having to worry that much about the database or how it works. Yeah. It's always pretty remarkable what you could achieve when you didn't know you're not supposed to be able to do it. Yeah. I don't know that I would do this again, but our stupidity was probably the biggest key to our success. But now I have a lot of gray hair in both my beard and head, from it, but that's okay.
[00:10:05] Unknown:
And over the past 5 to 10 years, there's been a pretty remarkable explosion in the availability of different database engines with different areas of focus and different technological underpinnings, and they all have their own particular niche that they're trying to address. And I'm wondering if you can just give the framing of what HarborDB is designed to do well. You mentioned the developer friendliness, but from a sort of database engine storage management perspective, what do you think are the unique capabilities that Harbor DB brings that will edge out a MongoDB or a CouchDB or a Postgres?
[00:10:42] Unknown:
So 1 of the reasons we started the company is that we were kind of frustrated by the notion that you need 5 to 7 different databases to have be the infrastructure for an application. When I started programming, you know, you had Oracle, you had MySQL, you built your app on that, and you figured the rest out. You didn't spend 1, 000, 000 of dollars on databases and managing them and integrating them and having your data be out of sync. And so we kind of built Harper DB to be the new workhorse like MySQL was back in the day and that it's not the best at everything. Like, you know, you take a product like Volt DB, which is a new memory product, and you definitely can do reads at scale, like, faster than Harper DB. But if you're trying to do reads and writes, Bolt's gonna crash at a certain level.
And so we kind of said, hey. Let's build a workhorse that's solid as a rock that's never gonna crash, and then you can do everything you wanna do. It may not be the best, but it'll work for almost every workload. That said, while that was the goal, and it still is true, and for developers and for building applications, that's why I think it's an awesome fit. Where we found that Harper DB really does have the most competitive advantage really is, like, at an enterprise level, the best fit is for low latency distributed applications. So if you think about something like a gaming use case where you've got end users all around the world, it's really easy to distribute your APIs to distribute your application. You know, containers make that super easy.
But all those APIs in your application still ultimately are in a callback to a centralized database. And physics makes that a problem because you can only get from Tokyo to, you know, Ohio at the speed of light, which for a lot of applications doesn't matter, but for other applications, it does. And that adds a lot of latency. And so by distributing Harper DB all over the world, having it be super fast at a node level, it's really great for those distributed use cases where latency does matter, where you wanna, you know, reduce costs. So that could be gaming, streaming media. You know, other use cases like that are a really great fit.
[00:12:50] Unknown:
In terms of the focus on being accessible and pleasant for developers as the target end user, how has that influenced the ways that you've designed the interfaces and feature capabilities of Harper DB?
[00:13:07] Unknown:
We are developers, and we built a database for developers. Like I said, we don't have PhDs in data science or anything like that. And so we really thought about it. We're like, if we wanted to use this, what would it look like? And so at our last company, we had a lot of APIs in our product, hundreds of APIs, and it was super hard to maintain. It was super complex to, like, find what you're looking for. And so that goes back to what kinda I already mentioned. Harper DB, if you're running it locally, is local host, you know, colon 9925 forward slash, and it's always that. And you always hit that with your post body. And then if you wanna do, like, a NoSQL search, you just put that as the operation in your JSON body. If you wanna do a SQL search, you put that in the operation.
So it makes coding super simple. We've also partnered with companies like Postman where you end up getting all these awesome libraries that you can use. And we've also started a bounty program where we pay developers to build add ons to Harper DB like different SDKs, different applications. And so we've just really always focused on our end customer being a developer, whereas a lot of other databases focus on their end customers at DBA, which is great for that DBA, but it's not ideal when you go to write code on it. And so also have really focused on this idea of collapsing the stack.
And so as a developer, like, the less tools I have to use behind my application, the better and the easier my life is. And so that's why we rolled out things like custom functions where you can write your own application code right in arperdv right next to your data and manage and build an entire application right on top of Harper DB without anything else. There was a product back in the day called Haruku, which I really liked and did a lot of that. And then they were bought by salesforce.com, which kinda made them less ideal for most applications outside their ecosystem.
But a lot of that concept was in their product. And SAP HANA honestly tried to do something similar, but really more focused on their ecosystem. We wanted to build sort of a development platform and database all in 1 that you could build your entire application on. There was a more generally available thing for everyone outside of any ecosystem, and that's kind of what we've done to make developers' lives easier. Yeah. There's definitely been a
[00:15:30] Unknown:
fairly cyclical view on the role of the database in the overall stack of software and sort of application delivery where for a while, it was the 3 tier architecture where you had your load balancer, your web application, and your database. And then, you know, there have been approaches of push all the logic into the database, have that be the actual runtime for your application as well. I'm wondering what your thoughts are on the role of the database over the next 2, 5, 10 years, and how your thoughts on that have manifested in the way that you've approached the design and functionality of Harper DB.
[00:16:10] Unknown:
Yeah. I think that goes back to your other question, which was, you know, now that there are all these databases with all these different niche, I think it was kinda laziness on the part of a lot of companies to say, hey. We're gonna make this niche thing and use the developer then have to niche together 5 to 7 databases and 5 to 7 different middleware tools and 6 to 7 different other technologies. That is unfair. No 1 wants to do that. And I think that was a trend for a while, but I think that a lot of, like, top to bottom, you know, whether you're an entry level developer, you're a CIO or CTO, I think that's become frustrating. I think that people, you know, are moving more towards managed services and APIs.
You know, they want infrastructure as a service, infrastructure as code. And I think that people want things to work together. I think they want it to be more seamless. And I think the trend you'll start to see is, like, some of these very niche offerings that, yes, maybe they're really good at 1 thing, but that is a whole expense and team and complexity and resources. Do you really need that? Is it justified? I think people are getting smart about that and starting to realize that they would, you know, like to have tools that are interoperable and that, like, can be used ubiquitously.
And I think that you shift more towards products that are successful where that you they fit that pattern. And that's why we've added custom functions. It's why we have SQL and NoSQL. It's why we have GDBC drivers while also having WebSockets. It's why we're trying to accommodate everything that the very small startup would want all the way to the extremely large enterprise so that you can have all of that in 1 place. And with our custom functions, you really can code anything. You could do machine learning. You can do a further website. You can integrate into a third party system. You can manage sub processes. You can build your own APIs.
So I know that doesn't work for every use case, but I think that there's a lot there that solves for. And I, as a developer, that's what I want. And I think at least some of the market is like me.
[00:18:19] Unknown:
1 of the things that I often run into as an engineer when I'm starting to play play in some of these different ecosystems is particularly when you have a kind of collision of concerns that isn't as widely adopted in industry. So as an example, when I was going through your documentation, 1 of the options that you have in your API is being able to actually write a SQL query as part of a API request to the database engine. And so if I'm trying to develop that as an end user in my IDE, I might get some support for being able to highlight the SQL syntax to see you know, do some linting to see where it's wrong. But then if I try to embed that into a JSON structure, then I'm trying to sort of collapse too many concerns into 1, and the tooling doesn't always support that well. I'm curious what your thinking is in terms of being able to effectively kind of collapse those concerns into a single experience, provide a good experience to the end user being able to have all of the different tooling and ecosystem capabilities that they're used to and wrap that all into a single product that is sort of easy for people to pick up and use, but doesn't force them to maybe switch the tools that they're used to developing with?
[00:19:36] Unknown:
I think that is a really good point, and I'll give you an example. So when we first started the company, first couple months, like, Kyle and I were designing what searching would look like in NoSQL. And we started working on adding multiple conditions and multiple operators, and we started to end up with this JSON object that was a mile long. And I just looked at Kyle and our whiteboard was covered. We looked like the guy from Beautiful Minds and, like, we were just crazy. And And I looked at him and said, you know, there's a really good way to do complex searching that's been around for about 40 years. It's called SQL.
And trying to do this is insane. Like, you know, asking a developer to understand the syntax that we're building here is just crazy town. And so we have always kind of adopted it just like when things get too complicated, stop. And that might mean that we're gonna make some trade off from a performance perspective or from a feature perspective, but there are other products. You know, Oracle has 40, 50 years of crazy features in it. And if you wanna go do some sort of really complicated SQL query that has a trigger and, you know, some sort of cascading delete afterwards and with a stored procedure, go for it. But that's, like, not what we're trying to achieve. And so part of it is just staying true to what we built. But sometimes we make mistakes. Right? Like, we've made mistakes like you mentioned in the SQL piece. And so then that's why if you go into our DB studio, you can see we do have some of that built in in the UI. And then you can also use database management tools like the MySQL workbench and things like that on top of hardware DB if you use ODBC or JDBC drivers because we don't want you to have to learn new stuff. Like, we don't want you to have to learn HSQL.
That's why we type tailored the ANSI standards. We don't want you to have to learn, you know, new things. We are trying to make your life as easy as possible while also knowing that when you have a 1, 000, 000, 000 rows a second being written, that ArpadDB will still work. Because a lot of times it's what's easier and developer friendly on 1 side and what's enterprise grade on the other, and that's not fair. And so we're trying to do the balance between much as both as possible, but it's hard. And we get in fights about it. Jackson, Kyle, and my Jackson's our head of product, and Kyle and I will get in arguments. And it's normally Jackson and I arguing with Kyle because Jackson's tagline is, like, easy button and mine is keep it simple stupid.
And, like, Kyle's very focused on performance and scale, and we fight it out and then hug it out and come up with a good solution is kinda our answer. I definitely appreciate the
[00:22:16] Unknown:
availability of the studio solution as a way to unify that experience and have a kind of first class interface into the engine for people who do want to lean heavily into Harbor DB as a platform opportunity.
[00:22:30] Unknown:
And I have to say that that was all, Jackson. Kyle and I, like, never thought of that. We didn't think it was important. And Kyle and I, our background is in integration and, you know, data management. And so UIs were like command line and rest of API. And so Jackson has brought that experience to the table, and we're very lucky to have him.
[00:22:52] Unknown:
Today's episode is sponsored by prophecy dot I o, the low code data engineering platform for the cloud. Prophecy provides an easy to use visual interface to design and to deploy data pipelines on Apache Spark and Apache Airflow. Now all the data users can use software engineering best practices. Git tests and continuous deployment with a simple to use visual designer. How does it work? You visually design the pipelines and Prophesy generates clean spark code with tests and stores it in version control, then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage.
Finally, if you have existing workflows in AB Initio, Informatica, or other ETL formats that you want to move to the cloud, you can import them automatically into prophecy, making them run productively on Spark. Learn more at data engineering podcast.com / prophecy. Digging into Harper DB itself, can you talk through some of the implementation and some of the ways that the design and goals of the engine have changed or evolved from when you first started working on it? The goal of, like, read, write performance,
[00:24:00] Unknown:
simultaneous read, write has always been there and something we really excel out and continue to get better at. But, like, touting our replication capability was not something we ever really thought that, like we didn't start out to think, hey, distributed is an area that we're gonna excel at. To be honest, because we wrote the product in Node. Js, we just kinda fell into that because it's a web framework and the web is very good at horizontally scaling, and node is extremely good at horizontal scale. And things like Slack were built on socket. Io. And when you think about it, that is a extremely distributed system.
And so we inherited that off of sort of the shoulders of people like Slack, and that then became a major selling point for us. On the flip side, we really thought things like JDBC and ODBC drivers would be super important, and no 1 cares. Like, they use it to test stuff out and, like, maybe they'll connect it to Tableau, but, like, realistically, people just don't care that much about them. Like, we could get rid of them tomorrow. I think, like, 5 people would complain of the, you know, 25, 000 some people using the Harper DB. And but Kyle and I were so obsessed and thinking that was gonna be so important.
So it's interesting. But then you'll have a customer who's really important, who does care, and so we're glad we have them. But we definitely didn't do enough market research on what people would care about not in the beginning.
[00:25:29] Unknown:
As to the data model, I know that it is designed to be very flexible and dynamic in terms of the schema definitions. And I'm wondering if you can talk to some of the ways that you've approached the actual underpinnings to be able to support things like joins given your investment in SQL as an interface to the database? Because I know that for document oriented databases, it can become very difficult to actually do performant joins across different document collections and just some of the edge cases and engineering challenges that you've run into as you've been developing this database?
[00:26:09] Unknown:
Yeah. I mean, that was ultimately the problem we solved from day 1, and so we've really stayed true to that. The reason was we were using DynamoDB at our last company and we had millions and millions of rows in DynamoDB and it was great for scaling up for rights and doing simple searching. But then, you know, trying to join across things is hard. We looked at Hadoop and Hive and things like that and super slow. We tested a bunch of other stuff and so ultimately the solution that we ended up having in our last company was we capture everything in Dynamo, when we move it over to a memory SQL database to do all the analytics. And that was just super annoying to us. And it got out of sync because we were doing all the stuff real time for live television.
And so if it's out of sync for a minute, you know, we couldn't go live on broadcast and trying to sync 2 databases with a 1, 000, 000, 000 rows of data and then within a minute is hard. And so that was sort of the impetus of creating r 4 d b. So our storage algorithm is a document store, but it is different than most other document stores. At a high level, the way it works is it is not an unstructured database because as you insert data, we dynamically look at that data and we index all top level attributes at the on right. So then as a result, combining a column from table a versus table b, querying those 2 columns is about the same performance as querying 2 columns from just table a because of the way we store data. And the goal of that was so that you don't have to know what your scheme is gonna look like at a time. Whatever analytics you wanna do on it, Harper DB has indexed in a smarter way possible that that's possible. And so everything is indexed, which creates some challenges around storage. Right? Like, your storage is a little higher than it would be with some other stuff. I think it's about 20%.
You know, my cousin is a developer at a very, very large social media company and he's a lead Android developer there and he once asked 1 of the teams. He said, hey. I need to be able to query on date of birth because we need this content to be 18 plus. And they said, oh, we didn't index that column, figure out a different way because it'll take us a month to do that. And as developer, you don't want that answer. You want just give me the column I wanna query in. And so that is kind of why we designed the storage engine the way we did. And it has the trade offs, but it makes developers' lives a lot easier. You know, you can never predict what you're gonna need till it's too late. You have several 1, 000, 000, 000 rows of data in there, and you wanna be able to query it. 1 of the complexities
[00:28:40] Unknown:
that can often occur when you do have this dynamic schema is that 1 time you have an application that's writing in this structure and 1 of the fields is an integer, and then somebody makes a code change, and now it's writing out a string value. And I'm curious what types of enforcement you have for being able to say, okay. Nope. This was an integer. You can't write a string there anymore. You're gonna have to actually pay some attention to the data modeling, or is it just write whatever you want, and we'll figure it out later? Because there are definitely pros and cons to both approaches.
[00:29:12] Unknown:
We have some intelligence. We do not force anything right now. And so you can do whatever you want. You can put a string and an integer in the same column. We do have some intelligence around the indexing, and we'll look at what the majority of that is there and sort of store it based on what we think that data is. But ultimately, you can still do whatever you want. And so we felt like that was a decision that allowed you to still have the flexibility of NoSQL but with better performance than you get from NoSQL. We do have things built into, like, the ODBC and JDBC drivers where it'll look at the columns and it'll take its best guess so that when you pull in the Tableau, it's like this column's an integer. This column is a date. This column is lat long.
But we don't enforce anything right now. We do plan to, in the future when we have some more resources and time, allow that as an optional feature where you can turn on schema enforcement. But the nice thing about ArborDB is you can describe the schema. And so unlike MongoDB where, you know, you'll have a 1, 000, 000 rows that have this column, but in the same collection, you might have a 1, 000, 000 rows that don't. Once that column's created, it's created. And it's there for all the objects, and it'll be null if you don't put anything in. And whether or not you have it, you know it's there. You can describe the schema, see the schema, see what it looks like. So you have some better management around that. But, yeah, we're kind of trying to balance that, and it's a hard thing to balance because as soon as you turn on strict scheme enforcement, that's gonna create a whole another set of problems.
And so, you know, we also, like, strongly believe in keeping ARPA DB as stateless as possible. And so we're also very wary of ever putting background processes in because background processes are often what cause databases to crash. And so, like, there's a lot of things that go into that thought process. And it's not as like, hey. We don't wanna do that as much as it's there are so many trade offs in a database when you make 1 decision that you really have to carefully think about everything. And then also, databases are a hard thing to update and maintain. And so we can't just roll out features willy nilly because that can be extremely disruptive. So things have to be really thought out before we do them.
This exact problem you're talking about is 1 of the things we've been talking about for years and carefully planning how we'll roll that out.
[00:31:25] Unknown:
Another interesting architectural aspect of Harpy DB is, as you mentioned, it is distributed database engine. And I know that the replication method is using a PubSub model where you can subscribe to different table updates and decide when to replicate that. And I'm curious what types of deployment topologies and unique use cases that enables and just some of the ways that that has manifested in the overall sort of product design of Harper DB?
[00:31:58] Unknown:
Yeah. That gives a lot of optionality, and with a lot of optionality comes a lot of trade offs. And so there are not infinite, but there are a lot of different topologies in which you can deploy. So you can do a hub and spoke. You can do a circle. You can do, you know, like, many, many different ways. You can have, like, a multi tiered hub and spoke. And so it very much depends on the use case. In IoT, we've seen a lot of the hub and spoke be sort of like a successful model. If you think about it, we've got things writing very high volumes of data on the edge, but maybe as you get closer to the core, that data doesn't matter that much. And so you wanna kind of buffer decide and sort of have a multi tiered architecture in how your data moves, you know, from the edge into the core.
Then you've got more like gaming media where all of the data matters all the time. And so, like, some smarter version of a circle makes more sense on a fully distributed, like, peer to peer. It is extremely use case dependent, but you also have to think about what that means. Right? Because, like, it is an exponential problem because as you add nodes and those nodes talk to each other, if you have a 100 nodes, that's a 100 connections you could potentially have. And then what does that do to your network and how much information is moving back and forth? And so that is why we are actually rolling out. And I can't unfortunately talk about it too much, but we're rolling out a new clustering topology which solves for a lot of that, It makes things also more consistent across that because what we've realized is that while that optionality is great and we're gonna keep that, like, we need a more standard methodology to keep people safe, honestly, because sometimes choices can be a problem. And so it is a really interesting problem. And I'll be honest, but Kyle is much smarter about that than I am and can talk to your head off about it for 4 and a half hours.
And it's a very geeky, cool problem, but it gets pretty complicated between network and also, like, available compute on a per node level and, you know, open connections and things like that. It is 1 of the 2 most complex engineering problems in our for DB besides the storage of them. Yeah. And particularly when you start dealing
[00:34:19] Unknown:
with replicated data structures and figuring out what are the transaction boundaries. Do I want to actually get into the space of doing distributed transactions where I know that with Harper DB, you have opted for last right wins. And so whichever record ends up being replicated to a given node, whichever 1 was the most recent 1 is going to be the winner in that sort of right competition.
[00:34:44] Unknown:
Yeah. And we did that on purpose because the use cases that we're working with, that is probably the best way to do it. But that doesn't work for other use cases. And Cockroach, it's weird because we're sort of competitive with Cockroach, but we'll often recommend, if this doesn't work for you, you should look at CockroachDB because Cockroach does a really good job of solving the other side of the problem for us, and they're a much better fit for sort of like a Fintech use case where that really matters. And, obviously, we're gonna be much faster than Cockroach in a distributed fashion because we did the other side of the problem and so we focused on speed and lower latency for the end user, but the guarantee of consistency is gonna be lower And whereas cockroaches focus on that guarantee.
And so, like, you have to decide what you want. And if you're looking for that guarantee, Cockroach is a better choice than Workforce DB for that.
[00:35:38] Unknown:
In the context of the deployment topologies, you referenced IoT, and I'm curious what are the available deployment targets for Harper DB, and how has that influenced the way that you think about the actual packaging and deployment of the database engine to be able to fit on some of these smaller or lower powered devices?
[00:36:01] Unknown:
So we spend a lot of time working in IoT. HarborDB can be deployed on anything with a Unix based operating system. So your Mac, a Raspberry Pi 0, you know, a huge bare metal machine and kind of anything in between. And so we spent a lot of time in the beginning focused on it has to be able to do everything on a Raspberry Pi that it can do on a Cray supercomputer. To be honest, we spent too much time worrying about that because the way that IoT has moved, we're partnered with Verizon now and we're deployed on all of their mech locations throughout the United States. And so, like, we're doing projects with them where you have smart devices over 5 g talking to Harper DB on, you know, wavelength locations spread across the United States.
And that latency from that device to Harper DB on that Mac is like typically 1 to 2 milliseconds, maybe 5, 10 in worst case. And we've realized that for IoT, that's a way better pattern is just, you know, having more edge data centers closer to the end user makes a lot more sense. Definitely putting some application on the device, but the more you put on the device, you you create a lot of risk and also then you have, like, problems with upgrading devices. You know, we're looking at 1 customer where if they would upgrade their devices, it would cost them a $100, 000, 000. And so moving that compute as close to that device as possible and putting as much logic in and Harper DB really solves that problem without creating these risks around putting it directly on device. That said, if you wanna run it, there are use cases more specifically in areas where you have poor connectivity, and that will remain that while for a long time. Those could be utilities, industrial use cases, military use cases where you can install Harper DB on a mobile command center running, you know, on the battlefield. And so we still solve for that. We still have a lot of capability because Harper DB works totally offline, doesn't require the Internet, and can run on those devices. And there can be value in that, but that is not sort of our primary focus, but still something we're able to accommodate.
[00:38:13] Unknown:
There are definitely a lot of interesting sort of subtopics we could dig into such as the kind of versioning and updates of the function definitions, some of the ways that you think about the kind of design and capabilities of those embedded functions, the sort of upgrade process of managing Harper DB deployments. But another interesting element is the fact that you do have this open source database engine and this commercial entity behind it. I'm curious what the sort of governance and sustainability approaches are for being able to manage the boundaries between that open source and commercial capabilities.
[00:38:51] Unknown:
So Harbor DB itself currently is not open source. So the core of Harbor DB is a freemium premium model. You know, I'm a former red hatter. I'm a huge fan of open source, and we've tried to open source as much of the technology as we possibly can and, you know, everything surrounding it. 1 of the things when we launched the company we looked at, and actually we looked at Cockroach quite a bit, was that launching an open source database was hard. No 1 wanted to pay for database, and I would love Harper to be the open source. I think honestly, it would help us quite a bit. We'd have a lot more people contributing to it, be in the hands of more people. You know, it's written in Node. Js, and it's deployed via NPM, which are 2 of the most popular things in the world. So I think that would go over well.
But we also need to be a sustainable company and grow and get paid. And so that has been just a real balancing act for us. We constantly think about open sourcing ARPA DB completely, and hopefully, we get to a point where we can do that, but we're not quite there yet. And I don't know that we ever will be, but we would love to, I guess, is the honest answer.
[00:39:58] Unknown:
In your experience of building the Harper DB technology and the business around it and working with your customers, what are some of the most interesting or innovative or unexpected ways that you've seen it applied?
[00:40:10] Unknown:
So many. So 1 time and I mean, this is very old, but I remember a guy in a rural area in India built an application on Harper DB where he could do legal document sharing. And so basically he would pass the database around on these devices so that they, without Internet, have this ability to sort of track for consistency and manage that and do version control, but off line completely. And I thought that was really interesting. I think gaming companies have pushed us into some really interesting areas around doing things like fighting bots at scale on Harper DB, use peer to peer capability. We've seen projects with the US military where they've done, like, facial recognition stuff inside of Harper DB, which I thought was fascinating.
People are constantly just building really cool stuff on it. That's 1 of the fun things about being a database is that the community builds a crazy variety of different things. It's honestly, for Kyle and I, probably the most fun thing is every time we see somebody build something new and wild on Harper DB, it's like, hey. We built that thing under that, and now they're building that. And it's, like, the most affirming and most fun part of running Harper
[00:41:27] Unknown:
DB. In terms of the features and capabilities of Harper DB, it's a fairly extensive project. I'm curious. What are some of the capabilities that you think are either overlooked or misunderstood or underutilized that you wanna highlight?
[00:41:43] Unknown:
I think this sounds stupid. It is true. I think, really, for me, it's 2 things. I think, 1, when we talk to a lot of companies, they believe it's gonna be this tremendous migration process to a new DB because they're used to sort of gDBC drivers, no ORMs, and all of that stuff. And Harper, because it's a single endpoint, and you can literally copy the code examples in any language, Building an app on arbordb and any developer that's ever done it will tell you is it's easier to build an app on arbordb than anything else in the world, but it's because it's that single endpoint. I I don't think people really understand what that means until they use it. And so that's something I think that people need to get their heads around, but when they do, it is a huge value add.
And then I think how extensible custom functions make Harper DB. It's a fast device server essentially living inside of Harper DB. And what that means you can then do, I think people they're starting to see it now and are blown away by it, but I think that that's gonna take some time to catch on. I'm excited to watch that catch on. And I saw a tweet from a guy yesterday and he was like, I built my whole app on ARPA DB. I did not believe that I could do that, but then I did. And that was very cool to see. So we're excited about that. In your experience of running the business and building the technology and working with the community around it, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process? I think 1 thing we learned was that the developer community, we love them and they drive a tremendous amount of innovation and value to us, but they are not our they don't pay us money.
And so trying to target both the developer communities as well as large enterprise has been, you know, tailoring our message and our documentation and our feature set to accommodate both those folks at the same time. That was not something we expected and we had to learn a lot about. I think that's probably been number 1 for me.
[00:43:42] Unknown:
We've touched on this a little bit with some specific examples, but broadly speaking, for people who are interested in being able to build on a flexible and scalable database, what are the cases where Harper DB is the wrong choice and they might be better suited with a different engine? I think Fintech is definitely 1.
[00:44:01] Unknown:
I think if you're primarily building something like a data warehouse and you have low volumes of rights and you're just doing massive reads on huge volumes of data, you're really not looking for like an HTAP database where it's both an operational database as well as an analytical database. Harper DB is probably not a great fit for that. There are better things to solve that problem. Yeah. Those are kind of the 2 major ones. I think, like, if you're looking at Splunk and comparing our part of it to be to that, Splunk is definitely gonna have, like, more features for analytics than we will. If you're really focused on solving 1 problem and the benefit of being able to do everything in 1 place isn't there for you, you don't care about that, then it might not be the best fit. But if you wanna solve 1 problem and be able to do most other things, then Harper DB is a really solid fit.
[00:44:54] Unknown:
In terms of the near to medium term future of the project and the business, what are some of the things that you have planned or any projects that you're particularly excited to dig into?
[00:45:05] Unknown:
As I mentioned a few times, we're rolling on the new version of the replication engine, which I'm super excited about. That's gonna be in our upcoming release. We've spent a lot of time doing hardening. I think we're gonna continue to enhance custom functions. I want to start rolling out, like, a library of prebuilt stuff for the community on custom functions that are sort of ready to go so that's something I'm excited about. We've had some great sort of live streams and web streams recently that I think were pretty exciting. We have several more coming there, and we're announcing some really big partnerships that I think are gonna make Harper DB even easier to use and give developers more about where they deploy their applications.
So we're very excited about that as well. And those are from all the way from the 5 g edge to sort of hyperscaler cloud and everywhere in between, and so I'm pretty excited about those as well.
[00:46:00] Unknown:
Are there any other aspects of the work that you're doing at Harpreet DB or the database market or the use cases of building your application logic in the database engine that we didn't discuss yet that you'd like to cover before we close out the show? I did think of something just now and sort of relevant to that. That's a brand what's overlooked feature, but that also
[00:46:21] Unknown:
ties in this question. I think the other thing that people don't realize and that I even forget is HarborDB is decoupled from its storage and it treats containers as a first class citizen. That is very unique to HarborDB in the market. And so from a deployment perspective, when you wanna deploy your app, you can deploy Harper DB on Kubernetes and attach it to storage, and it started in a few milliseconds, detach it and deploy it somewhere else and reattach it to storage and it's up and running in a few milliseconds. So as the world moves to infrastructure as code and containers, you know, become the norm, I think Carpare DB is probably the most container friendly database in the world. And I think that that gets often overlooked and I think will become, like, a very interesting part of our story in the near future.
[00:47:10] Unknown:
Absolutely. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:47:26] Unknown:
That's a great question. Storage is the answer, and it's an easy answer. You know, compute is very flexible now, but storage is not. And storage needs to be disrupted. And, honestly, if I wasn't doing this, I think I'd do that, but I didn't know that before this. So, I'm just learning that, like, it's very frustrating, the available storage options in the market. And unless you're inside of AWS or GCP or, like, you know, Linode, it's very hard to have flexible storage options outside of the hyperscaler clouds.
I think that makes data management really hard. And right now, it's only affecting, like, the very large players in the space who have huge volumes of data. But as everyone starts to have huge volumes of data, I think that's gonna be a problem that needs to be solved. Highly recommend someone start a startup and disrupt that space.
[00:48:17] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Harper DB and your overall perspective on the database market. It's definitely a very interesting project, great product. Definitely excited to see the capabilities that you're offering to the community. So I appreciate all the time and energy that you and your team have put into that, and I hope you enjoy the rest of your day. Thank you so much, and I really appreciate it. And thank you for having me here. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Steven Goldberg: Introduction and Background
Founding and Vision of Harper DB
Challenges and Learnings in Building a Database
Unique Capabilities of Harper DB
Developer-Friendly Design and Features
Database Trends and Future Outlook
Implementation and Evolution of Harper DB
Replication and Deployment Topologies
Open Source and Commercial Strategy
Use Cases and Fit for Harper DB
Future Plans and Exciting Projects