Summary
The "data lakehouse" architecture balances the scalability and flexibility of data lakes with the ease of use and transaction support of data warehouses. Dremio is one of the companies leading the development of products and services that support the open lakehouse. In this episode Jason Hughes explains what it means for a lakehouse to be "open" and describes the different components that the Dremio team build and contribute to.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- You wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free "In Data We Trust World Tour" t-shirt.
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
- Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
- Your host is Tobias Macey and today I’m interviewing Jason Hughes about the work that Dremio is doing to support the open lakehouse
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Dremio is and the story behind it?
- What are some of the notable changes in the Dremio product and related ecosystem over the past ~4 years?
- How has the advent of the lakehouse paradigm influenced the product direction?
- What are the main benefits that a lakehouse design offers to a data platform?
- What are some of the architectural patterns that are only possible with a lakehouse?
- What is the distinction you make between a lakehouse and an open lakehouse?
- What are some of the unique features that Dremio offers for lakehouse implementations?
- What are some of the investments that Dremio has made to the broader open source/open lakehouse ecosystem?
- How are those projects/investments being used in the commercial offering?
- What is the purchase/usage model that customers expect for lakehouse implementations?
- How have those expectations shifted since the first iterations of Dremio?
- Dremio has its ancestry in the Drill project. How has that history influenced the capabilities (e.g. integrations, scalability, deployment models, etc.) and evolution of Dremio compared to systems like Trino/Presto and Spark SQL?
- What are the most interesting, innovative, or unexpected ways that you have seen Dremio used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Dremio?
- When is Dremio the wrong choice?
- What do you have planned for the future of Dremio?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Dremio
- Dremio Sonar
- Dremio Arctic
- DML == Data Modification Language
- Spark
- Data Lake
- Trino
- Presto
- Dremio Data Reflections
- Tableau
- Delta Lake
- Apache Impala
- Apache Arrow
- DuckDB
- Google BigLake
- Project Nessie
- Apache Iceberg
- Hive Metastore
- AWS Glue Catalog
- Dremel
- Apache Drill
- Arrow Gandiva
- dbt
- Airbyte
- Singer
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.
You wake up to a Slack message from your CEO who's upset because the company's revenue dashboard is broken. You're told to fix it before this morning's board meeting, which is just minutes away. Enter Metaplane, the industry's only self serve data observability tool. In just a few clicks, you identify the issue's root cause, conduct an impact analysis, and save the day. Data leaders at Imperfect Foods, Drift, and Vendor love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free forever plan at data engineering podcast.com slash metaplane, or try out their most advanced features with a 14 day free trial. And if you mentioned the podcast, you get a free in data we trust world tour t shirt. Your host is Tobias Macy. And today, I'm interviewing Jason Hughes about the work that Dremio is doing to support the open lakehouse. So, Jason, can you start by introducing yourself?
[00:01:46] Unknown:
Yeah. Thanks, Tobias. Yeah. My name is Jason Hughes, and I'm a director of product management here at Dremio. We're running the developer advocacy function. I've been at Dremio for just over 4 years. And before that, a few different roles in Teradata, and before that, building data driven applications, like a custom CRM for auto dealers in Michigan.
[00:02:05] Unknown:
And do you remember how you first got started working in data?
[00:02:08] Unknown:
Yeah. It actually was at that auto dealer where they actually needed a all the CRMs off the shelf. The the CEOs wanted to get far more control and customization over it, so they decided to build it custom. They brought me in to build that from the ground up. So after building the application, that immediately became the questions around, alright, well, what about the data? Right? So building all of that into a very non tech savvy group was interesting. And I saw all the problems firsthand, probably made all the mistakes myself. And then, yeah, ever since then, I just always found it really interesting, the stuff that you can actually do with data.
Right? So being able to I think I saw it at small scale, what can go well and what can go poorly, but the also the impact that it can actually have on folks. And then when I went to Teradata after that, really saw the what it can do at scale for for large organizations as well.
[00:02:58] Unknown:
So in terms of the Dremio project, I'm wondering if you can give your pitch about what it is and maybe some of the ways that that pitch has changed since you first joined 4 plus years ago?
[00:03:09] Unknown:
Twilio currently, we really call ourselves the the easy and open lakehouse platform to really enable you to run the data warehousing workloads directly on your data lake, but also some additional benefits that traditional data warehouses and even cloud data warehouses really don't have. But also around really being able to leverage your data lake as a true lakehouse. So really that's also where traditionally, in Dreamo Sonar, I know you've talked to Tomer about the new product as well called Dremio Arctic, which really manages the kind of the storage engine aspects of a data warehouse, but directly on the data lake in a fully open way that anybody can leverage. As far as how we've changed over the past, yeah, 4 plus years, certainly quite a bit. I'd say we started out as kind of trying to be, you know, we call it data as a service service platform, which is really trying to bridge the gap to the end users and disparate data sources. And that worked fairly well for some things, but what we found was that we were very hamstrung by the source systems themselves. So, like, if you issue a query against Oracle and Postgres, themselves. So, like, if you issue a query against Oracle and Postgres and you need to join that data, well, you can only run as fast as the slowest 1 of those machines. A lot of times those are bogged down. So we really saw and companies really trying to get towards this is really the centralization on the data lake. And then once we really saw that and saw all the advantages and the value that we could bring to organizations when we are running directly on the data lake, which originally was Hadoop, but now with much more, especially for a while now is is cloud data lakes like, you know, s 3. Really being able to provide them a lot more value when their data is in that data lake. So really, since then, it's been a march down that route, just probably 3 years now. So we've been heavily invested very much in the data lake and now, you know, really bringing data warehouse capabilities directly to the data lake in terms of the lakehouse.
So not really being able to provide all of that direct in the data lake. So we've been focused on that now for for quite a while.
[00:04:53] Unknown:
As far as the advent of the lakehouse architecture as a paradigm, you mentioned that your sort of early days were just, we wanna be able to federate queries across different data sources. And I'm curious how the kind of introduction of lake house as a paradigm changed the way that you thought about the Dremio product and the direction that you were trying to push it into and the kind of markets and users that you were trying to sell to? For sure. So I would say that at a high level,
[00:05:23] Unknown:
while that has definitely changed some things, it hasn't changed too much, I think, when you look at it from a value perspective. The way we've always viewed Gremio is really being able to provide the user experience that you have with data in your consumer life. Right? So using your smartphone, Google, weather, all that kind of stuff. It's just at your fingertips. Being able to provide that on corporate data. So the way that we saw that, Jigay, make that easiest to use it before was, alright. We got data everywhere. Let's provide that bridging of the gap. Right? But really now what we're seeing is all of the advantages, the the technical advantages to really being able to do that in a data lake and really now lake house are are now far outweighing that. So I would say, like, the mission hasn't changed a whole lot, but the way that we get there certainly has.
And so I would say that it certainly has changed it from a technical and road map perspective around where we focus. Right? Like, what we invest in technologically, and therefore, also the kind of different organizations of where they are in their journey that we're able to meet them. But I would say at a high level, it's changed certainly some. Like, for instance, technically, like, we do you know, like, we started doing, like, DML directly on the data link. Right? Like, for instance, capabilities like that. But as far as our overall vision and the value that we ultimately bring to the users, it hasn't changed a lot. I think it's just a much, much better way to actually deliver on that.
[00:06:43] Unknown:
In terms of that lake house pattern and paradigm, I'm curious what you see as some of the main benefits that that design approach offers to a data platform, particularly as compared to just federated queries or, you know, a cloud data warehouse or some of the other architectural patterns that teams might be relying on?
[00:07:03] Unknown:
Yeah. For sure. So I think each of those, they have different differences, I would say, or how it achieves them differently. The data virtualization data service platform, that kind of approach, it makes the self-service and actual delivering of value, like, much more realistic. Like, it actually makes it much easier. Before, again, you'd be boldened to all these different sources and their workloads, which especially if you're connecting to a big EDW like Teradata, you were going pretty slow because there was far too much load on that system almost always. So I'd say it makes it much more realistic there.
I would say the other area, especially on both the data virtualization and the really cloud data warehouse or data warehouse approach is that it really makes it much better from a self-service standpoint. But also 1 of the big things is that it really eliminates or significantly reduces the amount of data copies that are around. So even in, you know, a data warehouse, like, usually you're landing that data somewhere like Cloud Data Lake, And then you're usually loading that data into it. So now you already have 2 copies. But then once you're in the data warehouse, generally, you're creating even more copies within that whether for, you know, semantic modeling purposes or for more more commonly is performance reasons. Whether you can't run it directly on the raw data because of cost as something like Snowflake or capacity, something more on prem like Teradata.
You know, your users think in terms of, like, sales. Right? Like, individual sales, and I wanna sum up all of them and I wanna group it by month. Right? Well, in these platforms, what you almost always end up doing is creating a sales monthly, a sales weekly, a sales daily. Right? So you have all of these different ones and you're like, alright. Which 1 do I actually pick from? A, also, if you're picking on, like, let's say monthly, but then you actually wanna join that with items, you know, or something like that. Well, you can't do that at the aggregation level. So now you gotta go back down to the raw level, and then you gotta put a ticket data engineering and they're swamped, so that takes forever. Right? And this is just the traditional architecture and process that we've been doing for a very long time. So I think the lakehouse architecture, and especially when you bring in Dremio to that lakehouse architecture, really enables you to provide this kind of self-service of capabilities without needing to create all of these data copies and actually being able to deliver that performance directly
[00:09:08] Unknown:
on. In terms of the implementation of lake house patterns, there are a few different technologies that are driving at that sort of focus area, most notably things like Spark and Databricks, and then projects like Trino and Presto. And I'm wondering if you can talk to some of the ways that the Dremio architecture or implementation or feature set differentiates it from some of those other platforms and some of the ways that the Dremio kind of interface influences the way that you think about the broader platform development and platform architecture around this lakehouse pattern.
[00:09:47] Unknown:
Glad that you brought that up. Because 1 of the things as soon as I stopped talking, I was like, oh, yeah. And 1 other big piece to that to the last question is that lakehouse has really enabled multi engine access directly on the same data. Right? So you're not exporting your data. So exactly to that point, like, Dremio, to answer your question initially, I would say that Drevio Sonar, especially, 1 of the big things that it offers is truly enabling, like, interactivity and all for all your SQL workloads on the lakehouse. Right? So, like, Spark, for instance, like, is okay at some ad hoc SQL if you can wait a lot. I've worked with a bunch of different customers in a lot of different industries. I have never heard anyone be successful with Spark SQL for actual, like, all of your SQL workloads like interactivity and ad hoc and all that stuff at any sort of scale. Like, unless the user is willing to wait a while. Like, it just wasn't built for that. Right? It was built to replace MapReduce.
I would say Presto and Trino, 1 of the areas is, like, they're pretty good at, you know, things like ad hoc sequel. They're okay at that stuff. I would say that the main area where Dremio really excels is trying to tackle all of your, like, SQL user facing workloads. Right? And, generally, what that workflow is like is, alright, you start out doing some exploratory analytics. Right? So we are very good at ad hoc and exploratory analytics. In fact, there's also the benchmarks out there, which I won't get into here, that we're very good at ad hoc and exploratory analytics, which is usually the first step. Right? And that's very iterative. Right? You run a query, you kinda get the results back and you need to dive into this or maybe you need to join it here. You need to filter it differently. And usually, maybe all 3 of those, right, many times. But But, eventually, you get to the point where you get more and more tailored, and now you're ready to kind of productionalize whatever this thing is. Right? It's a report. It's a dashboard. It's a user facing app. It's whatever. Right? So then, actually getting that productionalized in Dremio is super easy. You basically just look at the logic and you can build up the chain of views and then you can use a technology, a capability that we call data reflections, which you can think about as materialized views, but they're transparently substituted.
So you build up your logic in the logical kind of way. But then when you need to physicalize that, because let's say you're working on a 100 terabyte table, right, at some point, you cannot, like, just based on physics, scan that fast enough to provide a sub second response. Right? So what we do behind the scenes is you just say, hey. This is the query I wanna optimize. This is the kind of workloads. Cool. We go ahead and behind the scenes build these materialized views. But then the key is that that's all behind the scenes. So then your users or the queries and the reports that you've just developed don't change at all. You just run your queries and now instead of taking 10 seconds or whatever it may be, it now takes sub second with 0 application changes and 0, like, user changes.
And the other thing is that if any other user comes in and issues a query somewhat like that, just based on, like, relational algebra, we'll actually reuse the same thing as well. I think that's 1 of the biggest differentiators there is really if you look at it not just from like a, hey, I wanna run this query and I want it to go fast. It's like, well, yeah, that's part of it. But really looking at the user's workflow of what their experience is gonna be like with the platform, that's 1 thing that's always excited me about Dremio, you know, in the past, again, 4 and a half or what I'm whatever I'm at years here.
[00:13:03] Unknown:
Digging into that interactive workflow, I'm wondering if you can talk to some of the ways that that translates into some of the kind of tooling and supporting systems that people are likely to integrate with and build on top of Dremio as compared to maybe the way that they would interact with a Trino lakehouse or a Databricks lakehouse?
[00:13:29] Unknown:
For sure. So there's definitely some differences. So I'd say that there's some similarities, some differences. I would say that in general for ad hoc, you know, exact tooling, like you might use, like, for instance, Databricks' UI since they bought retool. Right? You could also use like Dbever or DB Visualizer on any of these. Right? So you could use that and you could run SQL. And depending on what your SLAs are, how performant you need it to be, that aspect is fairly similar. Right? It looks like a database, effectively. But I would say the major difference is when you look at getting more operational and more productionalized app like analytics.
Right? So, like, for instance, if you were to try to build a dashboard on either of those, right, you're generally gonna build it, start it, like, in your tool. Because if you try to use Tableau on that, it's gonna be too slow. So when you're doing your iterative kinda exploratory analytics building this dashboard, you're generally gonna use their query or something like, you know, deep visualizer or something like that, data grip, something. And then you might run your queries, but then periodically, you likely need to physicalize that because it's getting too slow. Or certainly at the end, you're gonna need to physicalize that to get that really truly interactive response time for your users. So then it's alright. Well, now I need to build an ETL pipeline.
Right? Alright. So I have my logic and I need to go build this. Well, alright. I need to manage scheduling. I need to manage failure. I need to manage swapping of the new ones. I need to manage Bluegreen deployments if I really want to provide that. That would be like CTAS, for instance, versus inserts. So now you need another tool to do that. And so now you need to worry about, alright, these ETL pipelines that I'm building. And not to mention that this is for 1 application, you already likely have 100, if not 1, 000 of these depending on the size of your organization already running, possibly duplicating 90% of what you're trying to do. So then you finally get that scheduled and then you point Tableau at it and either you've got to the point where the physicalized version is small enough where you can do it Or what we often see is that usually people are still pulling that into a Tableau extract. So now you still are bringing that in whether it's on desktop or server and then your Tableau queries actually run off that extract.
I would say that that's probably the most common architecture we see for something like a dashboard versus in Dremio, where you would do something like that. And, again, you can put Tableau again directly against Dremio. We're doing pretty performant. But at a certain scale, right, you may want to do it yourself in SQL. You can certainly do that. And you go through somewhat similar steps of building the logic. But, really, the key is that in Dremio, you're focused on building the logical logic. Right? You're actually focused less on the actual physical nature. Because if there's at any point where you do need to physicalize it, let's go to the example that we were talking about before where it's just at the end, You build up your logic. And then what you can do is you can look at that job that it ran, and Dremio will actually tell you, hey, there's actually already this optimization that exists that pretty much matches what you were trying to do. It just needs 1 more column.
Right? Like, this 1 column isn't included as a dimension. Like, okay. Cool. I can go change that. Obviously, it's permission based. You can build process around this. A lot of our customers do. You can just go add that 1 column to the, what we call, reflection of generalized view, and then you click okay, you wait for that thing to build, and you're done. So then you can just go in Tableau and you issue all those queries, they're all sub second. And now you have that optimization that's decoupled from your BI tool as well. So if another user wants to come in, because especially in organizations, pretty rarely does the entire organization at scale uses the single BI tool. Right? And if they're using a single BI tool, they're not using a single SQL tool, right, as far as interactive, like, writing their own queries. So that's the other advantage here is that your optimizations aren't, a, they're logical. So you just don't need to know which, you know, sales order, sales or sales monthly, sales weekly, sales whatever, or there's really custom aggregation table that they have to use. They can just operate with a logical data model. Right? They can just interact with sales, and they can join that with pipes, and then they can aggregate it, and they get the performance options. So they can do that from any tool, which also means that any BI tool leverages that performance without even to duplicate this work. And it means that in the future, you can change BI tools.
You're not locked into these BI tools at all. You can use multiple today or tomorrow.
[00:17:38] Unknown:
So that's sounding a lot like the work that's being done on tools like Metricool or Transform, trying to introduce the metrics layer on top of the warehouse to be able to have that kind of semantic representation so that it can span across the different BI tools. And I'm curious what you see as the juxtaposition
[00:17:57] Unknown:
of what you've built into Dremio versus some of the ways that those tools are representing that idea. I think at a high level, what they call a metrics layer, I think we've been doing for 4 years. No 1 just called it that. A lot of our customers, like, buy us specifically for what we used to call semantic layer. Right? So building the business rules into it. And I guess if you wanna split the business rules versus the KPIs, but I think at a certain level, how I generally view it is that if you can put it in the engine, obviously, I'm biased, but if you can put it in the centralized access layer, great. And these tools might work out. Right? If you can create this decoupled version where where then they have integration with everything, great. That may work pretty well. But it's gotta be aware of a lot of stuff. Because it's not just the KPIs and these physical tables. It's, alright. Well, what if you wanted to do something like access to a logical data model, so in something like Dermio? Well, now you need those mappings as well. Right? So it sits at that junction where I think if it can work out, great. I'd be perfectly happy with that. But I don't see that, at least happening very pretty soon.
At least, I don't see it happening well soon. I think for small POCs and small kind of isolated things, I think it can work well as with how most tech new tech starts out. Right? But I think enterprise wide is where you hit issues. And so for instance, that's basically what if we've been talking internally about, you know, calling this, you know, semantic layer, because everyone has different terms for it. But that is 1 thing that people do build in Dremio is what those companies would call metrics layer. So there's a couple advantages to that being mainly that you have centralized that access, now you have centralized governance, and you have this performant access, but you can still access that from any tool. You don't you have to just use Dremio. Like, we just had 1 of our customers actually present to us Friday about how they actually use Spark and leverage the Dremio kind of, like, blessed, you know, business level datasets then integrate that into the machine learning platforms on Spark. So they pull parallel from Spark. They pull that into Dremio, and they get still get the performance that they need. So they're able to pull parallel, but they actually are still able to leverage the business core definitions like you would a metrics layer without having to recreate that. So people are already doing that today with Gremio.
[00:20:13] Unknown:
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state of the art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up for free or just get the free t shirt for being a listener of the data engineering podcast atdataengineeringpodcast.com/rudder. In terms of the kind of lakehouse terminology, at the beginning, I mentioned that Dremio is focused on supporting the open lakehouse. And so I'm curious if you can draw some distinctions about the the nomenclature there. Like, what differentiates a lake house as a pattern from an open lake house as an implementation?
[00:21:05] Unknown:
This is a very topical 1. And so I would say the distinction there is if you really look at what people wanted to get away from with the data warehouse. Right? They wanted to get away from this closed system where it always ended up being expensive. Even if it wasn't at the beginning, we see it time and time again. It ends up getting there just based on usage and that it was fairly limited in what workloads you could actually run. Right? You were behold to that vendor or you were exporting all of your data or a lot of it out after it been transformed, then you end up with all those data drift problems. Like, all of these problems that we're dealing with for the past, you know, 20, 30 years, whatever you wanna call it. With the lake house, a lot of those things have gone away. Right? You've really been able to, like, run multiple different workloads on the same data, right, which is great. You've really been able to do this scalably and cost efficiently. There's not really a cheaper storage out there right now than than CloudObject storage there. But what I would say is that really how Databricks ended up defining the lake house was was good at the time. Right? We were talking about it, and I think it was a good shift. And, like, there was something Dermot was already kinda doing, but it wasn't really called that. But I would say now it's really evolved into this true kind of architecture that you can run really all your business analytical workloads on. But I would say the distinction there is that 1 of the things that Databricks and their lake house hasn't really gotten away from is that kind of being tied to 1 vendor.
And that's really where I think the distinction is that we make between the open lakehouse and lakehouse is that with Delta Lake, and that's really where Databricks, they're owning that term lakehouse. We're always trying to make it distinguishable without trying to confuse more people by being, like, lakehouse but, you know, open or something. So I should come up with a new term. But Cloudera is also talking about the term, like, it's starting to get traction. It's basically that you're not beholden to a single vendor. It's truly vendor agnostic. So regardless of what that vendor decides to do tomorrow, basically, the community drives the actual direction of the project. And so if the community wants to take it, not necessarily a different way, but in additional way, for instance, there's nothing stopping anyone. Like, no 1 is going to stop them from doing that versus really with Delta Lake, and that's really where Databricks fully controls that project. They said they open sourced it a while ago, and then, you know, what, 3 months ago, they were like, no. No. No. But for real this time, we're open source again. I'm like, oh, I thought it already was.
So and we've actually looked at the commits. We've done the analysis on, like, the GitHub, you know, repo and everything. And, like, at least 80% of all commits are by Databricks. I should think 85, but there's, like, 10% that's unknown that at least 8%, if not more of that in Databricks. So, like, it's clear they can completely control the development of the project, but also the equivalent of an Apache project management committee, the Delta layer, the Linux Foundation's TSC technical steering committee is all data rich people too. Even though it's open source, they completely control the direction. So I think it's kinda open source in the way that, like, Impala was open source and how, like, technically it is, but not really in the ways that matter. So that's really where we're distinction we're already talking. I've heard from multiple prospects and customers. They're like, they do view Databricks as just another lock in. It's more open. Don't get me wrong. But they view it as Delta Lake wants to be everything for everyone. Right? You can see that clearly in what they do. Not only anyone can argue that, but they even view it as, yeah, it's just another way to lock me in like I had with Teradata. Like, I just had with Snowflake. Right? And they're really trying to get away with that, and that's really where 1 of the aspects of Lakehouse or Databricks in version of Lakehouse that hasn't really gotten away from the traditional definition of a data warehouse. So that's really what our distinction is. And, again, other people are picking this up now at Cloudera and others, that real distinction of the open lakehouse versus lakehouse.
[00:24:52] Unknown:
In terms of the work that Dremio is doing, obviously, there's a commercial project and a commercial layer. I know you recently introduced a cloud service. But in terms of that kind of open aspect and the open source foundational pieces, what are some of the investments that Dremio has made to that broader ecosystem?
[00:25:11] Unknown:
Quite a few, really. Starting all the way back when we first started Dremio. So, you know, we co created Apache Arrow. It's actually originally, like, Dremio's internal memory format, but then we released that, not just open source, but it's a full on Apache project. That's now just taken off wildly, like, even beyond what we even expected or hoped. Burley is a lot faster than we expected or hoped. So that's been really great to see. Now it's like the defacto standard. You see any new tool that's coming in like dot DB or anything like that, even Google's big lake, they're standardizing on Arrow too, which is pretty cool. Even tools like Snowflake are leveraging it for their data. So that's been really cool to see. So we still contribute heavily to that. Like, new projects, things that we build internally contribute to the community. The other 1 that we really this 1 we actually created was project Nessie. So really bringing being able to manage your data as code right beyond the data lake. So we fully open sourced that as well. That's been a really cool 1 to see as well because it's a it's a fundamental new thing. Us figuring out, like, how do we talk about it? What actually makes sense to people has been interesting and and iteration for sure as well.
I would say the 3rd big 1 is we really contribute a lot to Apache Iceberg. I think we're now, like, the top 4, maybe 5 contributed to Apache Iceberg, which is, I think, really cool because I think it's also a foundational component to us. And so we're not just building on top of it and doing, like, you know, enterprise features directly in Premiere. Like, we are fully bought into this open lake house, contribute to the community aspect, which you've seen, you know, as as I mentioned before. And I mean, even most recently, we just contributed some Arrow drivers, which we'll be talking about more pretty soon, but to contribute that directly to the community, fully open source. So it's something that we've always done, but I think it's it's always been well received. And especially driving some of these projects has been fun and interesting to see. Taking a moment to dig into the NSEE project for my own edification,
[00:27:04] Unknown:
having looked at it a few times, I guess the first question is, how would you juxtapose that with the Lake FS project that is aimed at providing sort of branching and versioning at the s 3 layer. And, also, the other direction, it looks like Nessie is aiming at, is the Hive meta store or AWS Glue catalog. And so I'm wondering if you can kind of draw a picture about how it fits in both of those directions.
[00:27:31] Unknown:
I'd say the biggest fundamental difference between Nessie and Lake FS is where that get, like, capabilities come in. So Nessie approaches it from the catalog layer of, like, an iceberg table or a beltway table or whatever. In that, you actually have these very lightweight, depending on just pointers to whatever the current version of that table is or the location of it. It's truly table aware. So you can say, alright. Yeah. Like, in s c, you have tables, but like f s, you really have a file system. Like, it's just buckets and folders. It's less table than it is files, and that's where the fundamental difference comes in there for what what really what you're able. And then as far as the kind of utility
[00:28:15] Unknown:
as a table catalog or table reference, is it something that's aimed at supplanting the Hive Metastore or the AWS Glue catalog, or is it something that is intended to sit alongside and augment those services?
[00:28:28] Unknown:
Yeah. I would say that it is aimed at the same layer as Hivemetastore and Glue and those kinds of things. So we do view it as a sort of replacement for those, and that really it's we're looking at it certainly from Hive Metastore. I think everyone's kind of trying to find a new version of that anyways, but it just you end up being, like, Hive compatible, right, like, Glue is. But, really, it's Glue is actually just backed by, like, DynamoDB. It's actually not even Hivemaster behind the scene. So I would say that, yes, it is aimed at providing that kind of centralized catalog because that's really the main area where you can provide that. And we do end up providing it. You can run Nessie if you want and go ahead and run it that way. Or you can actually run our cloud service, which is now in public preview, which is Dremio Arctic, which is basically kinda targeting that more like, alright. If you have Glue, I might as well be running it yourself anyway. So, like, you can use Nestle or Arctic if you want. But for Glue, like, Glue makes it very easy to get started. Right? You don't need to run anything. And that's really where the SaaS service from your architect comes in of actually being able to provide that as a service. So you can start very easily, very quickly. Right? You just go click a button, then you're in. So that's what that does aim to do. And so, yeah, I would say that that's definitely the layer that Nessie Architects sit at.
[00:29:37] Unknown:
From a kind of product and commercialization standpoint, what are some of the kind of purchase models or the purchase and usage model that customers are typically looking to engage with when they are buying lake house components and deciding, I don't want to run this myself because running open source is hard and time consuming, and that's not my focus. I just wanna be able to buy this as a service. What are some of those kind of consumption and pricing models that they're aiming for that they expect to be able to get from a product like Dremio or, you know, Trino or Presto or what have you? Yeah. I'd say it's certainly, as you mentioned, the consumption model. That's what basically everyone even some
[00:30:18] Unknown:
even some customers on prem, We usually don't do, but they're, like, really big customers. Like, they really wanted the consumption. I was like, okay. Like, yeah. We'll, like, we'll make it work. So let's say for sure consumption because it's you know, especially when you're talking the cloud, it's yeah. You're not running 10 nodes all the time. Right? You're scaling it up, scaling it down. You're bursting beyond what you have licenses for and then scaling it back down. Right? So I would say that for sure that's that's 1 area from a consumption from a from a billable bill model that they're interested in. But the other piece is that once you get to a certain point on that, they do end up kinda wanting, or we've certainly had discussions around a kind of more fixed rate or certainly a fixed rate in the beginning. Where they're like, hey. We don't know the exact usage. And it might be a lot. It might be a little. And so, like, I'd say probably some want to still do the consumption in that way. But others we have some customers that they're like, listen. We gotta go through this whole thing. And, like, no. Just go ahead. Get an ELA, like, unlimited. Use what you need. And then at, you know, 6 months, a year, whatever it may be, let's kind of assess where you guys are at, what your growth is, and kinda figure out what actually makes sense. So, again, in the cloud, just kind of figuring out, not doing a big almost a waterfall kind of planning style where it's how much do we need, but more of just kind of more agile. Like, alright. Let's get you at least up and running. Let's get you up and moving quickly, and then we can figure out what makes sense for both of us from a billing perspective.
The only thing I would say is, yeah, definitely hosted it as a service is what, basically, everyone is coming to expect. More and more we see it. Right? Also, what we are seeing often is that users don't want to run it. They basically wanna run it in their own account where they actually want full control whether it's them or security and compliance. Right? They want all this data in their account versus having it in different vendors' account. So we're also seeing that pretty commonly as well.
[00:32:08] Unknown:
From the kind of fundamental technical aspects of Dremio, my understanding is that it actually has its ancestry in the drill project, which is you know, it originated in Google, was its first stab at federated querying, being able to kind of federate data access and do sort of the data virtualization approach around the same time that Facebook was building the Presto project, which then, you know, forked into Trino, and now they've got their own whole story to go into. But I'm wondering if you can maybe dig into some of the ways that that ancestry influences the types of integrations and scalability models and deployment approaches that some of the ways that that manifests in the Dremio product
[00:32:53] Unknown:
as compared to some of the Presto or Trino installations that other folks might have experience with? There are only a couple areas where it really still manifests itself. And I would say, like, before and if you kinda look at the history of, yeah, like, the Dremel paper and all that stuff, it was like, yeah. Like and especially with drill is that all sources are kinda equal. Right? You have a pluggable API model. It's also what Presto did. Right? And so you connect to any sources. So I'd say that that led us down that kind of route of data as a service where, like, people wanted to run SQL on Elasticsearch because you couldn't really do that. Right? So it's it's definitely let us down the model in the beginning, and some people wanted that. It worked out okay. But, again, the data virtualization thing, because of the limitations, ended up just not being successful at scale. Like, it works okay at small scale. Once you hit data user application concurrency scale, like, it just it falls apart regardless of if you have the best Android relation system in the world. So I'd say it kinda started us out down that path, and there are a couple, but really since then, we've basically ripped out almost everything else. Like, there's a couple of things really just around the edges of, like, for instance, like, profile, the the old profile that we have, like, profiling your query performance. Like, that's been in there because it's pretty good. It's very detailed.
It was at least good for a power user. But, like, for instance, we've actually already created a new profile that's much easier to read for the play user. Right? So I would say that, like, there are a couple areas like that that are still around just because we haven't needed to replace them. But, basically, the innards of everything else, we've replaced. Like, we've completely changed the like, basically, we have columnar execution now which drill didn't have. We have, for instance, another example would be like Gandiva of really being able to do, like, not execute everything in Java because Java is good at a lot of things, not good at executing a single instruction of 1, 000, 000, 000 or a 1, 000, 000, 000, 000. Right? So, like, compiling those instructions directly to c plus plus, right, to the machine code and actually executing that. So I'd say, like, there's a couple places around the edges where you'll see it, and I'd see it influenced us in the beginning, but it hasn't been we've basically ripped out everything important.
That's been the case for a while now.
[00:35:00] Unknown:
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend. Io, 95% reported being at or overcapacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it's no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. That's where our friends at Ascent. Io come in. The Ascent Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability.
Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark and can be deployed in AWS, Azure, or GCP. Go to data engineering podcast.com/ascend and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $5, 000 when you become a customer. From a kind of ecosystem integration perspective, there has been a lot of investment over the past couple of years, 2 to 5 years, in things like dbt and this, you know, growth of analytics engineering, integrations with things like the, you know, Fivetran and then Airbyte and Singer for more of the open source approach of how to do data integration, and then obviously sort of BI tools, which you alluded to earlier. But more from kind of like the tooling and process and, you know, ecosystem integration perspective.
What are some of the investments that you've been focused on in that space?
[00:36:41] Unknown:
So I'd certainly say, you know, I've been focused on a couple of main areas. 1 of which is certainly making it easier to get data into the data lake in or really into the lake house. So in table formats in, like, iceberg format. We've been partnering with Fivetran on that for instance. I've been talking to Airbyte about it as well. Fivetran, I think, already rolled it out or they are very soon. But you can see that we're talking about, you know, iceberg and all this stuff externally. So that's either done or very close to get it really into and into, like, a format that you can start making sense. You can just start doing data warehousing workloads directly on it as far as, like, think, like, DML. Right? But it's interesting that that's 1 area. But also on the transformation perspective, we are also partnering with DBT, and we've had a community connector for a long time. People have been using but we're getting the 1st class connector fairly soon. If it's not already out, like, very soon.
And so that's another area that we're really focusing on is kind of 1 of the things that we kinda believe in is is, 1, we certainly have an opinion about how you should do things, but we also believe in meeting people where they are. So, like, hey. Like, in general, for, like, I was just working with a a startup at the end of the day that they do all the transformation to drive. Like, they do all of that, the t, because if you really look at it, it's the t for business logic, all in Dremio. And they just use reflections where they need to physicalize it for whatever performance reasons. Right? But that's behind the scenes. But we have a lot of customers that also want to do it the dbt route with right? Which is building the t in there, but also physicalizing a lot of things. Right? There's trade offs to each, but, again, a lot of our customers are interested in the dbt style as well as the other styles. So that's where we've focusing on that integration as well. So, again, I think that's if not already out, should be out in the next, you know, very soon.
[00:38:27] Unknown:
From the kind of workflow perspective of somebody who's doing all of that transformation in Dremio, I'm wondering if you can talk to some of the kind of tactical elements of managing the kind of transformations, making it composable, how they manage kind of the logical complexity and organizational complexity of building those transformations and making them accessible and understandable to the people who are working with them, and then kind of managing the, like, testing and deployment aspects of it as well. Couple practices that are data warehousing practices
[00:39:03] Unknown:
that aren't specific to data warehousing. They're just kinda how you work with, you know, SQL and data. So, like, for instance, whatever you wanna call it. You know, we call it preparation business application. I forget what Terry did we used to call it, but it's effectively I think it was landing preparation or landing standardized curated, I think was what they called it. Like, Databricks calls it brown, silver, gold, like, whatever you wanna call it. Right? It's basically that concept of landing your data and then doing some, like, light cleaning and transformation to get a kind of a standardized business representation.
And then any application specific views of data, like, kinda think about the tables and API, like, how they need to interact with that data. Right? So that stuff doesn't change at all. And in Dremio, a lot of times they build that directly in the UI. You can write the SQL yourself. You can use some of the visual aspects. You can use things like SQL UDFs if you have some, you know, compartmentalized logic that you wanna reuse over time or in different places. But it's in general, it's a layered approach. So you maximize reuse there. The other aspect is once you kinda get that built up, different users obviously access different layers. Again, same as you would in any other data warehousing practices, not specific to the technology.
And then the other piece is actually once you have the build and then doing changes is the same thing as really any other system. Well, certainly around software is around dev and test and prod. Right? This might be the same physical system, but different logical areas. We have seen people do that. We have some people that do dev and test in 1 area and prod in another physical or all 3 physical. It's again, there's trade offs for each of them. But really developing it in dev and then promoting that to test after some checks, and we have automation built in to move it between environments, And then also doing some checks and maybe even a manual check when someone needs to go look at it, and then actually getting that promoted to prod eventually once all the checks and everyone's good with it. And then also as part of that process, there's also from a data steward perspective, you have the capabilities to really look at, like, dataset tagging as well as data set catalogs.
So you can really kind of build these things as it's starting to be more once you have these kinds of aspects, it's starting to treat more data, you know, as a product, which is now really coming up big, which I think is great. You can provide those kinds of SLAs and quality checks while you're promoting. But then also from a discoverability perspective and really making it a full fledged kind of product where people can find it easy and they make it easy to use. There's also that kind of documentation aspect, Wiki, which you can do images, you know, whatever, but as well as tags for for discoverability
[00:41:30] Unknown:
as well. In your experience of working at Dremio and working with your customers and seeing how people are building around this lakehouse paradigm, what are some of the most interesting or innovative or unexpected ways that you've seen the Dremio product and the surrounding ecosystem around it applied?
[00:41:48] Unknown:
I would say probably the most unexpected was 1 of our customers built an IoT platform. And generally, Dremio is a full MVP system. I think you're running it on many nodes. A lot of times it's, you know, anywhere from 5 to 700 or a 1000 nodes. Right? Large systems, MPP. 1 of our customers actually leverages it as a single node and runs it in each of their little IoT edge platform. So they can leverage and it's, again, more unexpected than anything. But then they bring it into their actual platform and then they can run the key is that it's the exact same Premio. So it's the exact same software. It's the same SQL. It's the same structure. It's the same everything. So it just makes things a lot easier for them. It's a leverage to our connectors with these MongoDB, and then they can leverage different connectors in prod. But all of the SQL, all of the semantic layers can be exactly the same between those 2.
Even though it's a single node versus their platform, which is a bunch. So to me, that was probably the most expected and they've actually been doing that for a bit now. So I think 1 of the most innovative, and this is more off the top of my head, but 1 of our customers is actually leveraging Spark to do parallel reads from Dremio leveraging Aeroflight to actually read that data in very quickly. So they can actually do machine learning on core business data, which usually requires you, like, exporting data out. Right? Because your data, like, can depending on, again, where you're at in your lighthouse journey, your core business data, your blast like, for instance, your finance data of, like, revenue or sales or whatever may not be in the data lake. It may be in something else like or it may not be fully physicalized. It may be virtualized.
So what this these people do is they actually leverage Spark to read fully in parallel, like, an executor to executor and really to Dremio. So they can actually leverage these core business data, but directly in machine learning and then augment that with, you know, images or video or whatever kind of these unstructured data they can actually leverage. That was pretty In your work of helping to
[00:43:58] Unknown:
manage the product development, work with your customers, understand the overall ecosystem and the problems that people are trying to solve with tools like Dremio and the lake house pattern. What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:44:12] Unknown:
I think I would say if I focus on challenging and it's less unexpected, you just come to relearn it every time, is that most problems that people have with analytics are not technological. They end up being, you know, process driven, organization driven. But a lot a lot of times those are, like, the the foundation of those problems are technological, or you can solve the technology problems. It makes solving those process organization problems much easier. So if you provided the building blocks, I think that's just something that we learn time and time again. And I think the other thing is that I can say, interesting unexpected lesson that, again, I guess I've learned quite a few times in Dremio, putting maybe a more uncommon, is that Dremio is a very powerful tool. It can be used for a lot of different things. Like, as I mentioned that, like, singleton IoT.
Okay? And it works for them. So you can use it for a lot of things. So it's also where do we wanna focus, right, from a corporate and a direction perspective. Like, where are we providing the most value and where can you really use Dremio the best where you can't really use other things as well for it? I would say that that's the other part that we've learned a few times. Really focus on where do we we wanna go to customers who not necessarily want us to go, where they need us to go. A lot of times, you know, there had never has been a silver bullet for absolutely everything. The likelihood that there is ever is is pretty low.
[00:45:35] Unknown:
And for people who are interested in implementing a lake house or being able to kind of build on top of some of these open projects and open standards, what are the cases where JEMEO is the wrong choice?
[00:45:47] Unknown:
I would say it's the wrong choice where if you already have a very traditional architecture and you're happy with it. But, again, I'm generally 1 of, like, don't implement technology to solve problems that don't exist just because it's like the cool new thing or something. Like, if you have another architecture that you're happy with. And what we see a lot of times is like, it works okay. Right? The traditional architecture has been that way for, you know, 10, 20, 30 years depending on your definition of it. And it works okay, and it can provide some business value. Right? Now there are a lot of trade offs that you make. Some of those you may not realize, but because it's just, you know, it's what you're always been used to. Then you don't really think that there is an alternative. So I'd say, like, if that's the case and you're happy and, you know, you have a million other projects that you wanna focus on, then I would say bringing in Dremio maybe is the right call there.
[00:46:35] Unknown:
In terms of the future direction of Dremio and the areas of focus that you're spending your time and energy on for the kind of open lake house principles. What are some of the things you have planned for the near to medium term?
[00:46:49] Unknown:
Our primary goals are making it I would say not because now we've kinda gotten to the point where lake houses are, like, functionally possible and doable. Right? Netflix, for instance. Like, they don't even call it a lake house. They just always call it a data warehouse. It just happens to be on s 3. Right? So I would say that, like, it's been functional, and it's certainly possible now and has been for, you know, a bit here. But I think now it's really, how do we make it easy? How do we make it so that you don't need an army of data engineers like Netflix does or like Facebook does, right, to actually make this possible and while making it truly self-service.
That's really the the primary focus that we are focusing on for the short to near term, even longer term, is really making it not just doable and possible, but easy, as well as, you know, staying true to that open message. So, you know, we have things like Germany Arctic coming out that's really providing that data as code capability, which is another thing we're really excited about and a lot of the people that we talked to are really excited about. But even though, you know, we're gonna do things behind the scenes for you like a storage engine would, it's still all gonna be an open formats. But you can still use any engine you want. You can use Dremio Sonar with or without Arctic. You can use Arctic without Sonar. We fully believe that, hey, if you use Dermio sonar and another SQL engine comes out, that's even better. Great. Like, go for it. Right? Like, we fully believe that, like, you should be able to to swap these things in and not repeat the same mistakes of data warehouses, which I saw firsthand many times at, Teradata.
[00:48:24] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:48:39] Unknown:
I would say that now if you've seen the explosion of kind of all these different tools and everything, that everyone's doing this 1 piece, but there's now also a lot of overlap. Right? Like data catalogs do certain things that data observability tools also do. So I think the biggest gap right now is, like, some level of integration or consolidation where you don't need to have relationships with, like, 40 vendors. Right? Like, you don't want all of your relationships with 1 vendor because then you're completely beholden to them. You do want that flexibility. But at this point, it's using fewer tools and it's really building that integration. And ultimately, what it really is is making it easy.
Right? And that's 1 thing that I think that we're focused heavily on is that currently it's possible, but really making it easier and easier and easier to, again, go back to our mission of making data as easy in your business life as it is in your personal life.
[00:49:33] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you and the other folks at Dremio are doing on investing into the Open Lakehouse ecosystem and Paradigm. It's definitely a very interesting product. Definitely great to see some of the developments that have been coming out over the past couple of years from you guys. So I appreciate all the time and energy that you're all putting into that, and I hope you enjoy the rest of your day. Thanks a lot, Spies. Appreciate
[00:49:59] Unknown:
it.
[00:50:03] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com. Podcast.com. Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Sponsor Messages
Guest Introduction: Jason Hughes from Dremio
Dremio's Evolution and Open Lakehouse Concept
Lakehouse Architecture and Benefits
Dremio vs. Other Lakehouse Technologies
Open Lakehouse vs. Proprietary Solutions
Dremio's Contributions to Open Source
Customer Use Cases and Product Applications
Ecosystem Integrations and Future Directions
Challenges and Lessons Learned
When Dremio Might Not Be the Right Choice
Future Plans for Dremio and Open Lakehouse
Closing Remarks