Summary
Databases are the core of most applications, whether transactional or analytical. In recent years the selection of database products has exploded, making the critical decision of which engine(s) to use even more difficult. In this episode Tanya Bragin shares her experiences as a product manager for two major vendors and the lessons that she has learned about how teams should approach the process of tool selection.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
- You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
- This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold
- Data projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at dataengineeringpodcast.com/miro. That’s three free boards at dataengineeringpodcast.com/miro.
- Your host is Tobias Macey and today I'm interviewing Tanya Bragin about her views on the database products market
Interview
- Introduction
- How did you get involved in the area of data management?
- What are the aspects of the database market that keep you interested as a VP of product?
- How have your experiences at Elastic informed your current work at Clickhouse?
- What are the main product categories for databases today?
- What are the industry trends that have the most impact on the development and growth of different product categories?
- Which categories do you see growing the fastest?
- When a team is selecting a database technology for a given task, what are the types of questions that they should be asking?
- Transactional engines like Postgres, SQL Server, Oracle, etc. were long used as analytical databases as well. What is driving the broad adoption of columnar stores as a separate environment from transactional systems?
- What are the inefficiencies/complexities that this introduces?
- How can the database engine used for analytical systems work more closely with the transactional systems?
- When building analytical systems there are numerous moving parts with intricate dependencies. What is the role of the database in simplifying observability of these applications?
- What are the most interesting, innovative, or unexpected ways that you have seen Clickhouse used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on database products?
- What are your prodictions for the future of the database market?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Clickhouse
- Elastic
- OLAP
- OLTP
- Graph Database
- Vector Database
- Trino
- Presto
- Foreign data wrapper
- dbt
- OpenTelemetry
- Iceberg
- Parquet
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Miro: ![Miro Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/1JZC5l2D.png) Data projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at [dataengineeringpodcast.com/miro](https://www.dataengineeringpodcast.com/miro).
- Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png) Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)
- Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png) You shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. That is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing. Go to [materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access) today and get 2 weeks free!
- Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png) This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting [dataengineeringpodcast.com/datafold](https://www.dataengineeringpodcast.com/datafold) today!
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable enriched data to every downstream team. You specify the customer traits, then profiles runs the joints and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack. You shouldn't have to throw away the database to build with fast changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades old batch computation model for an efficient incremental engine to get complex queries that are always up to date.
With Materialise, you can. It's the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it's real time dashboarding and analytics, personalization and segmentation, or automation and alerting, Materialise gives you the ability to work with fresh, correct, and scalable results, all in a familiar SQL interface. Go to data engineering podcast dotcom/materialize today to get 2 weeks free. Your host is Tobias Macy, and today I'm interviewing Tania Braggen about her views on the database products market. So, Tania, can you start by introducing yourself?
[00:01:30] Unknown:
Thank you, Tobias, and it's great to be on the show. So as you mentioned, my name is Tania Bragin. I've been in the data space for roughly a decade and a half now. My beginnings were really coming into the space more from a consulting perspective. I was a student of computer science, and I worked for Deloitte and then went back to grad school. And kind of how I got into the data space is I was looking for my next job out of grad school, and the advice I got was go and interview for product management jobs. And I happened to land at a startup in the Seattle area called ExtraHop Networks, and this was my first data startup. It was specifically in the networking kind of niche, but I learned a lot about building, analytics for, you know, large amounts of data. And from there, I went on to Elastic, the company behind Elastic Search, and this is really where I would say the majority of my experience in the data space has formed.
And in the past couple of years, I moved on to a company called ClickHouse, which is another company, similarly to Elastic, focused on data analytics.
[00:02:29] Unknown:
And you mentioned a bit about your history. Do you remember where in that journey you first started working in the data space and what it is about it that made you wanna keep going in that trajectory?
[00:02:40] Unknown:
Yeah. So at ExtraHop, I didn't think of myself as really working in a data space because we were building a solution specifically for network engineers. But, of course, a big aspect of it was capturing all this networking data. And we actually had, a custom database that we built specifically to run on these network appliances. This was in the era when, really, a lot of companies still were on premise, and how they captured network data was in these big appliances. And to run efficiently inside that appliance, Extra Hub built a custom database. And I knew, of course, a lot about it, but it wasn't something that we sold to the general market. With Elastic, things were very different. Elastic was 1 of the first, I would say, really popular analytical databases that was open source and just widely adopted first for search and then for logging. And that's when I really sort of got very interested in the aspect of what a database, simply just a database, can enable in terms of use cases, because the kind of use cases Elastic enabled were really, really broad and wide. And this is also where I really just started enjoying working with open source technologies and communities. For me, this was a big, just revelation, of how much you can learn from just somebody picking up your product and using it for something unexpected.
And that was a large reason for why I joined ClickHouse. This is also an open source database and growing in popularity primarily due to the open source distribution.
[00:03:54] Unknown:
And as somebody working in the product side of a database vendor, what are some of the aspects of the database market, the technology that you're focused on and that are that you're focused on and that are the pieces of the technology and the ecosystem
[00:04:12] Unknown:
that are most relevant to your specific role and the types of end users that you're interacting with to get feedback on the product? So as you kind of pointed out, I think even just by asking this question, database in the end is simply infrastructure. It enables storing data. In the end, what users want to do with it is enable real world use cases, something that they're building, an application that they're building. And those are the things that I really look at. What are people building? Why are they building it? Why this specific technology and not that becomes a lever for them to build it faster, better, and why this sometimes, just causes a completely new technology to to come to market. But the last thing, the interesting part was, search. Right? This was in the era when websites were still kind of new to having search as an experience on their website. Of course, now we're all very used to having a search bar. Like, if you come to a website and there's no search bar, you'll be like, this is nuts. Like, everybody must have a search bar. But when Elasticsearch became popular, it wasn't yet the case. Right? And so explosion in interest in building search technologies or search experiences rather, right, and that enabled by search technologies is what really caused Elasticsearch to appear as, you know, a really prominent player there. And for me, like, I continue to watch new applications. To me, what's really interesting is what is the next trend? What is the next application that everyone is going to build, and what will they need for that? Because that's what ultimately a database technology enables.
[00:05:37] Unknown:
And going from Elastic to ClickHouse, they're very different engines, very different target use cases. I'm sure that there's some overlap in terms of the ways that they're being applied. I'm wondering what are some of the aspects of your learnings from your time at Elastic that you've been able to bring into ClickHouse to help inform some of the product direction that you want to drive towards? Yeah. It's interesting that you say that Elastic and ClickHouse are different. They're actually very similar in many ways. Elastic started off
[00:06:07] Unknown:
as, again, known as primarily the search technology. So the main data structure that it used was an inverted index to get a bunch of documents indexed for very fast search. But then very quickly, it added a columnar store to enable analytics. And why? It's because a search bar usually then results in an experience of them looking at, you know, the actual results that are brought back and analyzing them. So it made sense to pair this inverted index with a columnar store for analytics. And so during my time at Elastic, I was actually responsible for what was then called the logging product line. We really thought of analytics as just analyzing logs. Any event was a log. And that's where the biggest overlap is with technologies like ClickHouse and other OLAP databases. So while Elastic didn't call themselves an OLAP database, they were absolutely 1, and they still are. Right? Just they called themselves a search engine and just kinda stuck with them. They called everything a search use case. But in reality, they had a very and they still have a very popular analytics solution. In terms of ClickHouse, I know I'll get back to it, but kind of going to your original question, like, what, aspects of my Elastic experience apply now at ClickHouse? Again, a lot. Both databases are open source. And so what I find is that in product management, working with open source products versus fully commercial products, it's a very different ballgame. In open source, you have this community of users that you may never meet and you cannot necessarily interview. So it's almost like elements of consumer oriented product management come in. You have to almost measure the sentiment in your user base as opposed to knowing every, you know, commercial user of your product. You have to look again at adoption trends versus buying trends, and it's it's really interesting. Certainly, my learnings there from Elastic map very much onto my experience currently at ClickHouse. The second part that maps very well is working for a venture backed fast growing company. Once you have venture investment, it's just a very different ballgame versus, say, bootstrapping a company or simply working on open source project that doesn't have that aspect. At Elastic, again, this was a really great learning. It was just a rocket ship in terms of growth. And so learning how to, stick with the pace of the company growth, how to evolve during that time, was something that I took forward with me. And the last part is leading teams, which I think kinda comes with growth. If you work for a high growing company, often you are in a position to step into a leadership role if you wish. Certainly, there's opportunity. And then how do you then bring new talent to the into the company? How do you motivate new people
[00:08:34] Unknown:
to take on the challenges that maybe you're doing today? Those aspects absolutely map. And another interesting aspect of this particular area of the industry is that databases are kind of their own category of product where there's a lot of pieces of data infrastructure, but the data base is typically something that requires a certain amount of time and diligence before just bringing into an infrastructure because it is likely going to outlast pretty much every other aspect of the application that's being built on top of it because of the weight of the data that is stored there. And for people who are thinking about database technologies, how they want to structure their applications, can we start by just enumerating the overarching categories within the database product market as it exists today?
[00:09:24] Unknown:
Yeah. You and, by the way, you're absolutely right about databases being so sticky. Right? Like like being the center of gravity, almost, of the infrastructure. So, yeah, like, where to start? So first of all, I would say transactional databases are still the workhorse of just a typical workload, a typical data workload. And why? Because a lot of the data is well served by transactional databases. This is why, and this is why Postgres, MySQL, also traditionally the document databases that have evolved to have more capabilities like MongoDB, those are commonplace. If you're picking up a new application, this is what you start with today. However, I do see, and since the last days, I have seen already in the past 10 years, the analytical database evolve as the, you know, the second database that likely you put next to a transactional database. And the reason for that is, more and more users are building interactive data driven applications. I mentioned search before. This was the first interaction. You don't want your search to be slow. Right? If you type something in the search bar, you press enter, you expect results immediately. And so these interactive data driven applications sometimes are not well served with transactional databases that are really, meant to store and retrieve individual rows very quickly.
Analytical workloads that are meant to be interactive need to analyze a lot of historical data and yet bring a result back. And so all of databases that become known where data is organized in columns serves these workloads much better. But I would say that in a typical company that where you're building anything at all today, you will need both a transactional and an analytical database. So and then there's, of course, specialized databases. I would say that still graph databases are quite specialized. For some use cases, you may need a graph database. In the evolving space of Gen AI applications, there's a question, an open question right now is will a vector database become the next, like, 3rd database that everybody needs, or will those capabilities for vector search actually be built into either OLTP or OLAP databases? So that's 1 big open question that I think everybody is asking. I think those are the the the the looking at the 4 major categories of databases that you might kind of wanna think about today. And then even within those categories, there are subcategories. So with transactional, there are document stores, which often down back don't actually have transactions,
[00:11:40] Unknown:
which is kind of ironic, and you might also use those for analytical purposes. And as you alluded to, row versus column based storage for transactional versus analytical engines. Yeah. It's an interesting market. And then there's also the questions of, is it a scale out? Is it a scale up? Is it embedded? Is it cloud native? That's right. In terms of the overall trends of the industry, you mentioned that when you started at at Elastic, it was still fairly early on. Search was an up and coming experience that could consumers were starting to grow accustomed to and expect. I'm wondering what are some of the major trends in the industry, both as far as the consumer patterns, the ways the databases are being incorporated into applications and infrastructure that have driven the development and growth of some of these new and emerging categories, particularly for the very niche use cases? Mhmm.
[00:12:33] Unknown:
Yeah. So I think, you know, in addition to search, as I mentioned for even during Elasticsearch, this area of analyzing data was already becoming big, and there are so many subuse cases there. And the trend, again, of needing an analytical database for, some of these interactive application continues. I'll give a couple of examples. And, actually, here, I'll start with ClickHouse just because, again, it's a newer technology drive driven a little bit by some of the newer trends. So, originally, ClickHouse, and the the name stands for clickstream data warehouse, was developed for a web analytic workload, basically. So Google Analytics is probably the most, like, common example that might come to mind. If you want to analyze the performance in your website, you put, you know, something in your in your website, like a snippet of JavaScript, and that sends events back as to who visits your website and and why, and you can go and analyze that data. So that kind of data, which is append mostly, right, and, not changing. Usually, again, it's like a log of data.
But it comes at a really high rate, and the the the results and the kind of analysis that you do looks both at the most recent data and historical data and asks questions of just a few columns of the data. So it's a very typical kind of all up workload. This is the workload for which ClickHouse was originally, kind of built. But interestingly, like, the kind of use cases and applications I see now, that are being, built on top of ClickHouse and similar technologies are really driven by this trend to build, I would say, productivity tools across all industry verticals. So marketing professionals as an example. More and more tools are being built to make marketing professionals more effective. And why? Because ad tech continues to grow. There's so many things that a marketer needs to do today to optimize spend in terms of, like, driving leads. It is absolutely a data oriented job. There's no way for you to do kind of a good job as a marketer without having access to data and effective tools on top of that data to make decisions. Basically, it's a must.
So everybody that has an effective marketing department is buying these tools, and it drives development of all of these SaaS startups in the marketing space. Same in the sales space. If you're a seller today, in order for you to become to be effective and to have an edge over competition, again, is to use data, to really understand the trends in your region, to really understand some of the view maybe that your marketing colleagues have, but with a kind of a a lens of a salesperson. So, again, all of these sales productivity startups need to analyze a lot of data, and they have to choose a database to do it at scale and also efficiently. Because if these are SaaS services, it's not just about delivering fast results. The database has to be optimized for your workload for you to have positive margins. And so this is why more specialized analytical databases are getting adopted for building some of these very data intensive interactive applications that ultimately drive ROI for many businesses. And I can talk about more kind of more applications, but I wanted to, like, hit on that because, again, it's really kind of intent like, data intensive applications, but that need need interaction, need, like, real time decision making.
[00:15:31] Unknown:
This episode is brought to you by DataFold, a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. DataFold leverages data diffing to compare production and development environments and column level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. DataFold integrates with DBT, the modern data stack, and seamlessly plugs in your DataCI for team wide and automated testing. If you are migrating to a modern data stack, DataFold can also help you automate data and code validation to speed up the migration.
Learn more about DataFold by visiting data engineering podcast.com/datafold today. Absolutely. And within the different particularly newer segments of the database market, what are the pieces that you see growing most rapidly or requiring or or at least gaining the most attention and potentially leading to accelerated growth? Mhmm. Yeah. So,
[00:16:34] Unknown:
again, going back to some of the newest trends, again, unless you've been under a rock, you know, you've heard of OpenAI, you've heard of, chatgpt, you've heard of, Gen AI applications. I think a lot of people are asking themselves right now, first of all, like, how much attention should I be paying to this trend? Like, is this something that's gonna completely change, you know, the way I build products in my sector, or is it just incremental? And if it's more disruptive, like, does it mean that I need to change the way I build applications? Like, what does it mean to consume results from a large language model? Do I have to actually train myself? So a lot of people are asking those questions. And, in terms of application building, what's becoming really clear is that while hosted large language models are quite adept, in order to get really good results for any particular domain, you do have to fine tune those results. And in order to fine tune those results, at at some point, you have to again, if if you know the space, you'll know the terminology, but you have to develop these embeddings, based on the data that you have and combine that with results that are coming back from a, a pre trained model that maybe you're consuming.
So there's a question right now of whether, you know, to build an application that is, somehow powered by an LLM, that you have to have a way to host your own embeddings, or can you do this in some other hosted scenario? So it becomes kind of a question for a lot of, engineers and developers out there is, do I need a specialized vector store, or can I just use Postgres and the built in Postgres kind of vector store? Is that gonna be enough? Same, you know, with an all app database. You know, if you're using ClickHouse, the question becomes, well, is ClickHouse vector search sufficient for my purposes, or do I need something even more specialized like Pinecone? I believe it's still an open question. However, if there's anything I've seen kind of in terms of trends in technology space in in general, it is usually toward simplicity and consolidation. So I think if it's possible for existing databases to build in those capabilities in a way that's sufficiently performant and resource efficient, then it will happen. If it's simply impossible, if the architectures are so divergent and these workloads are that important, there may be a third class class of databases that gets developed. But I think it's an open question.
[00:18:47] Unknown:
Yeah. It's definitely interesting and early days for the the vector database market. And, yes, everybody has their opinions as to which 1 is going to win out, particularly if you happen to work for a vet vector database vendor. For sure. For sure. And, again, I the way I I see it is, like, certainly, again, I think
[00:19:06] Unknown:
transactional and analytical databases should be developing these capabilities. Because if it's possible for you to serve even a fraction of that market, somebody doesn't have to get a new database. I'll give you an example for why our customers ask for it. So we have customers in a fraud analytics space where they're analyzing a lot of information in order to make a decision as to say a transaction is fraudulent or some behavior is undesirable. And they do it based on heuristics, so they have an analytical database for that purpose, and it was working very well for them. And now they want to augment it with a couple of sort of, like, fraud detection methods that are maybe reliant on LLMs. They don't want to move all of this data, and they don't want to ideally they don't want to host 2 databases with overlapping data.
If possible, they just wanna host embeddings in ClickHouse and combine that with the data they already have in ClickHouse. So if we can deliver them performance that is sufficient for their use case, of course, we will try to do that. Does it mean that there's no, like, you know, even more advanced use case for which a vector database is necessary? No. It doesn't mean that. So so it's possible that both need to exist, that existing databases need to add embeddings and vector search capabilities. But still, for more specialized use cases, you may need a dedicated vector database.
[00:20:13] Unknown:
Circling back to the stickiness of databases as a piece of infrastructure, We've touched on a few of the types of questions that teams should be thinking about in that selection process, but wondering if you can just talk through some of the core elements of performing proper due diligence on this technology selection. Some of the technology concerns, some of the organizational concerns, and just some of the ways that teams should be approaching this step of identifying, do I even need a new database? Do I need a database at all? And if so, which is the right 1 for this particular use case?
[00:20:50] Unknown:
Right. I was thinking about this question ahead of time, and it's a tough challenge actually because in order to select a database, you have to really understand your workload, and sometimes you don't. Like, you start building an application, and you don't yet know what the shape of your workload is gonna look like until you've built the app or prototyped the app or really kinda got to a point where real world usage is driving certain, like, shapes of your workload. Like, you may not know ahead of time exactly how many columns you're gonna have in your data or how, like, which, you know, column, for instance, will end up having, a certain cardinality of value. So you just simply don't know. You can have a hypothesis, but you may not know. So 1 thing I will say upfront, like, you probably will make the wrong decision at some point, like, and you have a database that simply doesn't scale. The question then is how quickly can you, you migrate or move some of that workload to another technology. This is why at ClickHouse, we actually do focus specifically on making that part of the journey easier. We just anticipate that, of course, a lot of existing Postgres users at some point will hit a scaling limitation, and they will need to quickly onboard onto ClickHouse. And making that path very simple is important. And then, you know, as far as, like, trying to do it upfront, I guess, I would say that, yes, just knowing that there is even a transactional versus analytical workload is important, because they are quite different. Transactional workloads ultimately are more static. Right? You have rows. Of course, you're kind of growing, but you're kind of you're you're updating existing data in place. It's it's it's a slower, I would say, growing workload, whereas analytical workloads are kind of more like changes. Like, imagine, like, you've got, like, a more static, like, inventory of products. Your analytical workload would be anything that has to do with changes in inventory. And of course, that dataset, you know, kind of time index, is gonna grow a lot faster. So anything that grows really fast because it's really more about changes in some other static dataset, that is an analytical workload. So knowing that is the case, I would say from early on establishing this pattern where you have both a transactional analytical database is valuable and then kind of basing your technology decisions with based on that and kind of anticipating, that that is the case. I'm also seeing kind of increasingly, again, database vendors and database technologies anticipating that for users in the first place. So there's transactional databases building more and more foreign data wrappers for analytical databases and even almost helping their users detect when they hit some sort of scaling limits in the transactional database and saying, okay, like, move it to an analytical database, and we'll still give you ability to kind of query across both and vice versa. Analytical data stores build CDC capture like, change data capture capabilities to very quickly detect changes in transactional databases and and onboarding those workloads. So hopefully, that helps. Like, I would say just even knowing that transactional versus analytical workloads exist already helps a lot.
[00:23:36] Unknown:
And another interesting aspect of this overall question of which engine do I need, particularly in that divide between OLTP or online transactional processing and online analytical processing is, do you need both? If so, how do you make them work better together where transactional engines have long been the solid workhorse of application development? For a long time, they were even the engines used for data warehousing in the early days of data warehousing before we got column stores and MPP databases. And now that we do have column restores available, we do have NPP databases for being able to paralyze paralyze that analytics.
What do you see as the major motivators for having that be a separate set of technologies, separate pieces of infrastructure,
[00:24:24] Unknown:
and some of the inefficiencies and complexities that are driven as a result of that. It's it's true. Right? The thing the only thing I can think of is just the the size of analytical workloads grew, you know, again, exponentially or, you know, grew to such a point where transactional databases became just not feasible for the type of analysis that people wanna do. And, also, the expectations of the the type of applications you wanna build changed because I think for a while, when it came to analytics, it was sufficient to have, the kind of experience where you produce a report. Right? Like, you analyze something and you produce a report and it gets emailed to you every day or every week or even every month. So imagine kind of an internal workload, that is analytics focused. That was just kind of how internal teams worked for a long time. And that, of course, would not work for any sort of you know, SaaS applications where interactive experience is required. So I think the revolution actually started with the SaaS part. People wanted to build more interactive experiences on the websites that kind of introduced technologies, again, first, like Elasticsearch, you know, many others that powered these applications. And now the question is being asked internally by internal teams is why shouldn't we adopt the same for internal users? Why should they wait for, you know, a report? Or why should they have the kind of query that you run and then kind of go away and come back to in many minutes? Those questions are being asked. And so what we're seeing now is, I think, some of the things that have made some of these SaaS services successful, internal teams are asking themselves, why shouldn't our internal users have that experience? Because if they don't, they actually will go and try to consume those SaaS services. Right? And internal teams are seeing kind of more and more demands for interactive dashboards, interactive applications. I would say with internal teams where this started, at least in my experience, was on the financial side. So financial sector for a while really led, in terms of just, like, having high expectations for internal users. If you're a trader, at the end of the day, you need to have an interactive application that helps you make a decision of what best to, you know, to place the next day, and you can't wait, you know, for the following day. Like, you need that decision now. So any internal stakeholders where they needed to consume that data very quickly and interactively, I think this is what really introduced the need for more specialized, databases and data stores for analytical workloads that could support these interactive use cases.
[00:26:42] Unknown:
Data projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and tool chains, even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first 3 Miro boards are free when you sign up today at data engineering podcast.com/miro.
That's 3 free boards at data engineering podcast dotcom/mir0. 1 of the shortcomings that is introduced by virtue of splitting out the analytical engine for its speed of analysis and computation from the transactional store that is getting the data as it is generated is the need for able to either say we're going to batch this, and this is how long you're you can expect to have data delayed when you're running this report, or you need to bring in something like change data capture or some other streaming technology to be able to feed the data directly over to the analytical system. And a third approach that I've seen applied in some cases is federation of queries where this is where things like Trino, Presto come in. Yeah. I know ClickHouse sub has some support for, things like foreign data wrappers. I'm wondering what you see as the overall trade off, some of the ways the team should be thinking about how best to make the analytical system work as closely as possible with the transactional store without introducing arbitrary breakage when network connections fail. Mhmm.
[00:28:27] Unknown:
Yeah. There's, like, several very interesting topics here. So on the change change change data capture side, I believe this needs to be, again, just a built in capability of analytical databases. ClickHouse handles it by we have this concept of a materialized postcode, a materialized MySQL engine, where we basically yeah, we can create almost like a logical, view of your MySQL or Postgres database and just, you know, query it as well as capture changes from, from these databases using these these engines that basically act as our CDC. I believe it just needs to be built in and vice versa. Old LTP databases should have foreign data wrappers for the most popular analytical databases that they see kind of in their ecosystem. But you mentioned, object stores and kind of the data lake use case. This is another really interesting evolution that we're seeing. So, again, primarily on internal analytics side, what we've seen is that cloud data warehouses, that like like Snowflake, Redshift, BigQuery, They, of course, have come to prominence in the past, you know, say, 5 years, and their big accomplishment was moving all of these on premise, more traditional data warehouse workloads from Teradata, Oracle, and so on into the cloud. And it's great because now that these workloads are in a cloud environment, teams, and, again, primarily, these might be internal teams working on, internal analytical use cases, are asking themselves, well, does it make sense to keep these workloads in a monolithic data warehouse?
Or does it make sense, for instance, to put some of these workloads into a data lake and, to query using different, engines? And we are I would say that what what I'm seeing is really more the trend toward unbundling these cloud data warehouses. Again, not every organization has bought into it yet, but we're definitely seeing that trend in some of the organizations that we work with where, they're saying, okay, now that we have this data in a more open environment, in a cloud provider of choice, we can start again moving the pieces where they belong. And the way ClickHouse fits into it is it's becoming more like a real time engine to work on top of data lakes, as well as next to data lakes, and helping kind of that trend of unbundling what has become kind of a monolithic version, like, of an on premise data warehouse, but in the cloud, the cloud data warehouse.
[00:30:42] Unknown:
Another element of database engines, the ways that they fit into, in particular, analytical use cases is that they're not the only operator in that space. There's typically a complex web of dependencies between different systems. Data's flowing in and flowing out for different use cases, and so it can be difficult to understand what is actually happening at any moment in time when I need to debug something, which brings in the question of data observability, and that is a whole other market. But from the perspective of somebody working with teams building database engines, what do you see as the role of the database itself in cooperating and enabling the observability aspects from an analytical perspective so that people who are operating these infrastructures can have more confidence that they're looking at the right things, that they understand what's going on, and that they can tune the workloads as needed.
[00:31:36] Unknown:
As you mentioned, I'm more on the side of a database vendor, like working with that observability tool. So the first thing I will mention just, how important data observability tools are starting to become, to stakeholders. It does seem like there's been an inflection where it's just an expectation. And this is in addition to other data management tooling that we see. So, you know, data versioning, data orchestration. So that tooling, I would say, we're we're seeing a movement where it starts to be used, I would say, much earlier in the adoption of a data store, especially, again, for internal analytics when you're you've got, many stakeholders, and they all need to understand what is the data catalog, what is data lineage, like, how are changes propagated.
Even for our own internal data warehouse team, you know, like, you might imagine, like, we're you know, our commercial focus is around our cloud offering, so our finance team just lives and dies by this MRR number, monthly recurring revenue. Well, this number gets generated from many sources of data and any change, right, that may affect, like, how this number gets calculated. It's critical for us to understand, like, if there's anything that occurs that may taint, like, the kind of how how we view this number. We report it to the board, but, of course, report it internally. So companies have similar, important metrics and and and, data fields that they need to understand kind of their their integrity. And so this is driving adoption of tools that I've already mentioned, kind of on specifically kind of, like, data orchestration, data versioning, and data observability.
The database vendor so what we do, right, to kind of enable these tools and there are tools that, you know, integrate with ClickHouse. Some of them work, by the way, on top of other tools. So for instance, dbt, a pretty big player in this space. You know, some of them work very natively on top of that. What they ask of us is a few things. 1 is really good observe kind of self observability. So every time, you know, anything changes in a database, it needs to be observable. And within ClickHouse, the way it's accomplished is, you know, we're a database. Where would we put data about ourselves? We'll put it in ourselves. Like, when you spin up ClickHouse, it has these system tables, as we call them. Everything is in there. Any DDL statement that you run, any, like, log about anything that happens is in our internal system tables. You can query it. It's very easy. It's right there. And we just happen to be also very efficient at storing them, so it's not a big overhead on the database itself. But that is what makes it very easy for data observability partners to integrate with us. There's nothing we have to add for them. All the data is there, and they can query it on day 1. And then, the second part, I would say, is, ability to go deeper if need be. So there needs to be some ability to turn on kind of more advanced tracing and profiling if something goes wrong.
This is where, you know, ClickHouse and other vendors are starting to build in open standards based ways to kind of self monitor more internals of the database. So OpenTelemetry is kind of a really popular increasingly popular way of of monitoring specifically, say, traces within a database product. That is something you would turn on optionally and, you know, use only if needed.
[00:34:41] Unknown:
And then from somebody who's working on the product side, dealing with people who are trying to understand how a given database engine fits within their stack and within their use case, what what are some of the elements of customer education that you find yourself coming back to the most or areas of maybe misunderstanding or misconceptions that people have going into the tool selection process?
[00:35:05] Unknown:
So that's a really interesting question. And this 1 may surprise you a little bit because, like, with ClickHouse, and again, Elasticsearch was a little bit different because we were a search technology and kind of our terminology was very search oriented, and people came to Elasticsearch with an expectation that it was search engine first primarily and then everything else second. With ClickHouse, you know, we're mostly, like, ANSI SQL compatible. And, like, from a syntax perspective, for the most part, you can kind of take your queries and just kinda port them over. We do have some SQL extensions for analytics. That's extra. But if you're, like, coming over from transactional world, you might look at ClickHouse and say, ah, you know, I just take my workloads there and everything's fine. But where things kind of break up a little bit, and this is something to pay attention to when adopting any new database, is in the end, the devil's in the details when it comes to, specifically, data organization and semantics. So I'll give you 1 1 example. We have a concept of a primary key. We call this a primary key in ClickHouse. What it means in ClickHouse is actually the key by which we sort data. And why is it important? It's because for analytical workloads, how you've sorted the records based on which key, basically, the data is organized kind of in order versus not, has a huge effect on how fast you can query columns back, like, for specific types of aggregation. So for analytical workloads like ClickHouse, the sorting order of records basically on disk is very important. So we call when you create a table, we say you should use a primary key, and that primary key should be something by which you will query.
And that's that's that's what we say. But of course, an international word, a primary key means something completely different. Right? It it's all about sort of constraints and, you know, and so users get very confused. They say, like, you look like SQL, and you walk like SQL, but we have this primary key that means something completely different. So I guess for folks building databases, my advice is don't take terms that mean something else in very popular databases and make them mean something else entirely in your database. That's going to be confusing. For us, I think it's too late to unroll that 1. But if, like, I was a creator of ClickHouse back in the day, I probably would have made a different decision on on the name of, like, primary key. And there's a few other small examples like this.
[00:37:04] Unknown:
Yeah. Naming things is hard.
[00:37:06] Unknown:
Always very hard. But back to education, like, how do we educate users? So, yes, we educate them on some of these nuances. But actually, yeah, a lot of the education goes into them understanding that, ultimately, when you're adopting an analytical database, there's some thought that has to happen. Some thought has to go into how you actually organize the workload because do you really just wanna take your, like, highly relational workflow or transactional world into analytics? Most likely not. You could. It would work. But, actually, this is now how you get the most out of an analytical database. You typically will do a little bit more flattening of the data. Not completely, like how it supports joins, but to get the most out of your use case, you may do a little bit more, again, processing of that data before querying it. And this is where ClickHouse has a concept of materialized views. We can take actually a highly sort of, you know, denormalized data and then help you almost like using ELT, and this is where DBT becomes important to transform it to something you would actually want to query. So that is kind of built in, but you have to understand that you have to do that, and that's where a lot of the education happens.
[00:38:08] Unknown:
And in your experience of working in this space, working with end users now at ClickHouse, also with Elastic, what are some of the most interesting or innovative or unexpected ways that you've seen people applying database technologies, whether specific to the tools that you worked on or just more generally?
[00:38:27] Unknown:
So with Elastic, there was actually a very interesting use case. I remember it struck me. Our first user conference for Elasticsearch, we had somebody from NASA present on the Mars rover use case, and that just blew my mind. Right? I mean, like, the telemetry that was created on Mars, right, got sent to Earth and put into Elasticsearch, and that was just like that. It was just very surprising to me that like, a search technology or an analytical technology would get adopted in that context. It shouldn't surprise me. And in the end, from a technical perspective, that workload probably actually even wasn't the most challenging because you don't have that much bandwidth to transmit that much data. But it was just very cool and very exotic. Let's just put it that way. You know, for ClickHouse, it ultimately what blows my mind is just the scale at which this product can run. As I mentioned, it was developed for an Internet scale, kind of web analytics use case. It can ingest 1, 000, 000, 000 and 1, 000, 000, 000 of rows. Just today, we published a case study with Ahrefs, which is, again, another vendor that does basically crawls the whole Internet and stores their data in ClickHouse. And it's just amazing the scale at which you can run. But it doesn't mean that you don't need it at a small scale. You still do. Right? And and there's still these inflection points where, you know, even for a much smaller dataset, you need an analytical database just based on the types of queries, interactive experiences that you can run.
[00:39:41] Unknown:
And in your own experience of working in this space, what are the most interesting or unexpected or challenging lessons that you've learned?
[00:39:47] Unknown:
Unexpected lessons from me. I think the main 1, maybe I mentioned in the beginning, which was when you transition from commercial to open source databases, as a product person, you do have to think very differently. And, like, how you leverage the community is something that you shouldn't underestimate, that it's a huge, huge value. The community is not just a free kind of distribution channel, a bunch of free users. It's a big channel, first of all, for innovation. You just mentioned interesting use cases. A lot of these users just come from downloading the product. Somebody just has an idea, and they just wanna download a product and use it for free to prove out their idea. They don't have any budget. Often, it's a passion project. So these types of community users are just gold, and this is something that I love about working with open source products that these types of individuals and their ideas get nurtured by the fact that the technology is free at scale. Like, this is a difference from a freemium product. A freemium product typically is sort of scale limited, whereas an open source distribution model in databases, which, by the way, I think has won out. I think it's pretty clear. It the typical sort of, like, distribution is an at scale solution you can run. So that's 1 thing that was kind of surprising to me. The second thing was actually what at Elastic, when we got to kind of be an at scale company, we had this kind of fork in the road in terms of how do we grow. Like, from a platform perspective, we were a really popular platform for search and certain types of analytics. But how do we grow as a company? And the direction that the company ultimately took was to add more vertical solutions based on this open platform. And so if you look at Elastic's website right now, they talk about observability and security and what they call enterprise search. And how you kind of do this kind of growth is you actually need to build out a solution based on this database.
You can try to build it organically, but typically, actually, you kind of pursue an acquisition strategy. And what was surprising to me was with an open source product, when you do m and a, when you look at companies that build, you know, that build products and solutions, can actually try to find companies that have already built a product based on your open database. And then the integration costs are very low because you just bring in this team. They already know your technology. They've already built a solution on your stack, on your on your technology stack, and so then the integration play is much faster. And that really helped us out at Elastic.
[00:41:58] Unknown:
And as you continue to iterate on the product that you're involved with, as you keep an eye on the broader database market from a competitive standpoint, from an educational standpoint? What are some of the predictions that you have for the future trends in the database market?
[00:42:15] Unknown:
Okay. So a couple of things. We we talked about OLAP versus OLTP, and my prediction is that OLAP does continue to grow in prominence. Still today, I think that if most users start with OLTP and then sort of almost through trial and error arrive at meeting OLAP. I do think that in the course of a few years, we'll see more of a pattern where you just simply start with both. So that's 1 of my predictions. I don't know that it's gonna happen this year, but I I do believe just the the the amount of investment that's happening in the OLAP space and by the way, right right now, usually folks call it, like, the real time analytics space. I think it's going to lead to a lot more awareness.
Again, that's only specifically ClickHouse. There's so so many other technologies in this space. But I think, generally, like, the the space of all app and real time analytics is going to lead to developers starting with both. They're gonna start with OLTP and OLAP, and this is how they just they build out their products. So that's number 1. My second prediction, more on the internal team side, is this cloud data warehouse unbundling trend continues. I do think that data lakes will will continue to rise in prominence just because it just makes sense. Like, there's so many things that make sense about a data lake. You have 1, kind of object store that's, powering, you know, many use cases, and you can leverage different open technologies on top of it. Just that pattern makes sense to me. This is why it's important for us to invest in it. It doesn't mean that you won't have some specialized storage because in the end, like even with ClickHouse, we work pretty fast on top of object stores, say, with Parquet or Iceberg format. But in the end, our native format is even faster. So for some workloads, you may still leverage specialized store. But for other use cases, you probably don't want to. Like, if you have a use case where you want both a data scientist and an app to have access to the same data, why would you duplicate it? Like, you'll want just to keep it in 1 place and have 2 kind of analytical engines pointing to it. So I think that trend is going to continue. And finally, from the perspective of when we talked about vector source and gen AI, I mean, something's going to happen. I don't think the hype is gonna completely flame out, and we're just all gonna say, like, this was nothing. I think it's gonna lead to new applications. I don't know that it's gonna be quite as disruptive as, you know, some sometimes sometimes say. I think in the end, it comes back to, like, what experiences do we wanna build? So, again, say I'm building a product for marketing professionals.
Okay. Like, I'm going to leverage large language models to, again, incorporate more aspects of natural language into kind of my suggestions. But I don't think it's going to be everything. I think there's still gonna be a lot of domain knowledge that remains outside of a large language models, and I think that it's gonna be kind of a blend of approaches. But, again, I think this is the third thing I would watch, obviously, what's happening with Gen AI use cases.
[00:44:56] Unknown:
So for anybody who wants to get into touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:45:11] Unknown:
Biggest gap. We we talked about data observability, and I actually think this is where right now, while there's more more and more use cases have data observability tools built in, I think 1 of the bigger gaps is just the kind of the maturity of these solutions. I do think that they need to get even more mature and have deliver even more value. It's clear that the promise of these tools is pretty great, but I think it's early days for this tooling. And there's a few players, but I think that there's still a lot more that these tools can do. And flipping it more on the side of database vendors, I think database vendors need to have more built in observability, of the database itself. So it's easier to build these tools across offerings. So that's, I would say, 1 of the bigger gaps that I would note.
[00:45:58] Unknown:
Well, thank you very much for taking the time today to join me and share your perspective and experience and expertise on database product development and ways to be thinking about the incorporation of databases into applications and infrastructure. It's definitely a very interesting problem domain, and, it's great to see the trajectory of ClickHouse. And, so appreciate the time and energy that you're putting into that, and I hope you enjoy the rest of your day. Thank you for having me, Tobias. Have a good day.
[00:46:31] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the machine learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com. Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story.
And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Tania Braggen: Introduction and Background
Exploring the Database Market and Technology Trends
Categories and Use Cases of Databases
Trends in Analytical Databases and Real-Time Applications
Vector Databases and Gen AI Applications
Challenges and Solutions in Data Infrastructure
Data Observability and Database Integration
Customer Education and Misconceptions
Innovative Use Cases and Lessons Learned
Future Trends in the Database Market
Biggest Gaps in Data Management Tooling
Closing Remarks and Additional Resources