Summary
When your data lives in multiple locations, belonging to at least as many applications, it is exceedingly difficult to ask complex questions of it. The default way to manage this situation is by crafting pipelines that will extract the data from source systems and load it into a data lake or data warehouse. In order to make this situation more manageable and allow everyone in the business to gain value from the data the folks at Dremio built a self service data platform. In this episode Tomer Shiran, CEO and co-founder of Dremio, explains how it fits into the modern data landscape, how it works under the hood, and how you can start using it today to make your life easier.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Tomer Shiran about Dremio, the open source data as a service platform
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by explaining what Dremio is and how the project and business got started?
- What was the motivation for keeping your primary product open source?
- What is the governance model for the project?
- How does Dremio fit in the current landscape of data tools?
- What are some use cases that Dremio is uniquely equipped to support?
- Do you think that Dremio obviates the need for a data warehouse or large scale data lake?
- How is Dremio architected internally?
- How has that architecture evolved from when it was first built?
- There are a large array of components (e.g. governance, lineage, catalog) built into Dremio that are often found in dedicated products. What are some of the strategies that you have as a business and development team to manage and integrate the complexity of the product?
- What are the benefits of integrating all of those capabilities into a single system?
- What are the drawbacks?
- One of the useful features of Dremio is the granular access controls. Can you discuss how those are implemented and controlled?
- For someone who is interested in deploying Dremio to their environment what is involved in getting it installed?
- What are the scaling factors?
- What are some of the most exciting features that have been added in recent releases?
- When is Dremio the wrong choice?
- What have been some of the most challenging aspects of building, maintaining, and growing the technical and business platform of Dremio?
- What do you have planned for the future of Dremio?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Dremio
- MapR
- Presto
- Business Intelligence
- Arrow
- Tableau
- Power BI
- Jupyter
- OLAP Cube
- Apache Foundation
- Hadoop
- Nikon DSLR
- Spark
- ETL (Extract, Transform, Load)
- Parquet
- Avro
- K8s
- Helm
- Yarn
- Gandiva Initiative for Apache Arrow
- LLVM
- TLS
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline, you'll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40 gigabit network, all controlled by a brand new API, you've got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch, and join the discussion at data engineering podcast.com/chat.
[00:00:48] Unknown:
Your host is Tobias Macy. And today, I'm interviewing Tomer Sharon about Dremio, the open source data as a service platform. So, Tomer, could you start by introducing yourself? Sure. Yeah. My name is Tomer Sharan. I'm the, cofounder and CEO of Dremio. And Dremio is a a 3 year old, company that's, enabling lots of lots of companies around the world to, kinda deliver data as a service within their organizations, really making it so that it's not an engineering project every time a data consumer like an analyst wants to, wants to do something with their data. And do you remember how you first got involved in the area of data management? Sure. Yeah. I was I was actually 1 of the first employees and the VP of product at a company called MapR.
That was 1 of the, the Hadoop companies back in 2, 000 and 9.
[00:01:35] Unknown:
That's when kind of Hadoop was the, the cool thing. And so can you start off by explaining a bit about what the Dremio product is and how both the project and the business got started?
[00:01:46] Unknown:
Sure. Absolutely. So, you know, I think the, the goal for Dremio really is to empower business analysts and data scientists, we we call these data consumers, to be more self sufficient and enables data engineers to be a lot more productive. And if you think about it, you know, in our personal lives, we have this great experience with data. Even my kids, they they go online, they ask ask Google a question. They get an answer, and and life is great. But then, all of us, when we come to work and especially in in larger companies, getting anything done with data is is basically impossible. Right? It's an engineering project every single time. And so you think about what ends up happening is, you know, data lives in all these different places, and it has to be loaded into some kind of a maybe a data lake and, you know, can't be quarried fast enough there. So it has to be you know, a subset of that has to be moved into various data warehouses and data marts, and those aren't fast enough. So then you end up building cubes and aggregation tables and, all these different structures, you know, BI extracts, and then finally, the user maybe can do something with the data. But because of that complexity and all the copies of data, they really can't do anything on their own. And so they're dependent on, you know, a very small group of of data engineers, to do anything. That usually means that it takes weeks or months to actually answer a question.
And, of course, it's a very frustrating experience for the people that are responsible for the data within IT. So, yeah, as you think about what what, so so what Dremio does is we, it's a system that companies can deploy within their organization. About 60% of our customers run it, either on Amazon or on Azure, and the other 40% run it, run it on prem. But really connects to your your data sources. And and, typically, there's kind of 1 primary, data lake that it's, connecting to or running on. And then potentially there are other sources like relational databases that also have some data. And Dremio basically exposes that data and and makes it so that people can discover data and query it at speeds that they were just not able to do before. And
[00:03:38] Unknown:
from looking at the product and, tinkering with it a little bit, 1 of the things that really makes it stand out in my mind from a BI platform that can also connect some of these various data sources is the ability to merge the outputs of some of those data sources or be able to join across them, within a single query because of some of the reflection capabilities in the platform. So I'm wondering if you can talk a bit about some of the features of Dremio that are unique to it as a platform that you don't necessarily find in some of these other products such as, as I mentioned, some of these BI platforms or things like Presto or some of these,
[00:04:17] Unknown:
SQL engines that are able to connect to various data sources? Yeah. Absolutely. So if you think about ultimately what users are trying to accomplish, you know, they they they have data in in potentially 1 or more places, and they they they wanna query that data. And they want to, they wanna do it fast. But unfortunately, sometimes the the reality is just that, due to the size of the data or the speed of the the data source, whether that's an s 3 data source or it's a, you know, SQL Server database, There's only so much speed you can get by, you know, scanning the raw data to to to answer the query, right, and to to process it. And we have an extremely fast engine. We actually pioneered a project called Apache Arrow to make that faster than it's ever been. But, at the end of the day, to really get kind of orders of magnitude, maybe, you know, thousands of times better performance on large datasets, you really have to avoid scanning the the entire dataset as a as a as a query engine for every single query. Right? It's very expensive, and it's very slow if you do that. So if you think about kinda traditionally in databases, we had things like indexes and and, you cubes and OLAP technologies and materialized views in in some systems.
So all of those types of technologies and are really about making it so that, there are better data structures inside the system that enable the the the system to answer queries much faster, right, without having to scan all the records. And that's fundamentally what Dremio does with with data reflections is we can maintain various aggregations or various sorts or projections of the data behind the scenes in a way that's transparent to the user that's that's interacting with the data. And then the optimizer can automatically leverage those reflections to accelerate the queries. And the effect of that is that a query that might take you, let's say, 5 or 10 minutes or more than that in a solution like Presto or, you know, Hive or something like that, with Dremio could run-in in less than a second. And that means that the user who's consuming the data, whether they're kind of a a a less technical user using something like a a Tableau, Microsoft Power BI, or something like that, or maybe they're more technical using something like Jupyter Notebooks, they need an interactive response time. Because if every click of the mouse or every kind of, you know, step in the notebook takes many minutes to run, it just becomes impossible to do. Right? They're not gonna wait to, you know, spend 4 hours, you know, building some visualization. Right? It's just not realistic. And what ends up happening is they they just give up. And so with Dremio, for the first time, you're able to make querying of large datasets of of big data basically,
[00:06:39] Unknown:
interactive. And so the reflections are 1 of the ways that Dremio handles that interactive speed as well as the fact that you're using Arrow for being able to manage the datasets in memory in an efficient format. But for the first time that somebody goes to run a specific aggregation against a data source, do they still have that initial latency of having to scan the source data and build the aggregation so that it can be cached in that reflection. And I'm also curious if you have any mechanisms in place for being able to keep those aggregates fresh as new data flows into the various sources. So let let's start with, let's take a step back here for a second. When you when you think about how,
[00:07:22] Unknown:
companies, organizations use Dremio, at the core, we have an extremely fast, distributed execution engine. Right? So we've looked at, okay, what are what are the ways in which we can kinda take what has traditionally been pretty slow systems, you know, like Hive. Right? And and make that extremely fast irrespective of of of these reflections. Right? So we we actually started the Apache Arrow project in large part to make better use of CPUs, right, to to be very efficient, to leverage kind vectorized processing, which is now available on modern Intel CPUs and in the public clouds. And so it was the first thing, and Arrow has really taken off and and become an industry standard. So over a 1000000 downloads a month, now for for for Apache Arrow. And that's kind of the engine. Right? If you think of Dremio being the car, Apache Arrow is the engine. It's what allows us to process kind of raw data extremely fast. Right? If it's data in HDFS or data in s 3 or or ADLS.
Beyond that, you know, the the reality is that sometimes there is just the the physics are are such that, you know, if you have a trillion records or or 100 of 1, 000, 000, 000 of records on your, in your environment, it's it's just gonna be too slow no matter how fast the engine is. And for those cases, it's it's all about creating these data structures, these reflections. Right? And the way it happens typically in Dremio is that the users will, you know, unlike, say, traditionally, you know, you have you've had systems, that are very IT driven. Right? You'd have to, you know, go design and build cubes and and spend weeks or months kind of trying to figure out what questions people are gonna ask so you can build the right cubes. And by the time you build the cube, their questions were not the same anymore. And and it was it was obviously a a big problem in terms of agility and and and people hated that. And so we take a different approach, which is much more like it's almost like comparing Agile to kinda waterfall development models. Right? So our users will they'll start querying data, usually it's it's extremely fast, and then if they need something to be even faster, then reflections can be defined kind of behind the scenes though in a way that's transparent to the user. So the use the user doesn't have to change their, you know, their Tableau dashboard, for example. That stays the same. It's it's the responsibility of the of the Dremio optimizer to rewrite the query plan internally to leverage the reflection instead of the raw data. So that's kind of how it works. If you've just started using the system right now, there are no reflections in the system and, you know, the speed will be what we can achieve using the Apache Arrow based execution engine, which should give you several factors of of speed up over, kind of a typical, you know, sequence of, like, Hive, let's say. And taking a step back to the product level itself,
[00:09:51] Unknown:
1 of the 1 of the factors that makes Jameo very approachable for businesses of all sizes is the fact that you have chosen to release it as open source. So I'm curious what your motivation was for that decision and, what sort of governance model you have in place to shepherd the open source aspects of the project. Yeah. Absolutely. So,
[00:10:12] Unknown:
so when we thought about, kind of how we wanted to, to to make Gremio available, it was pretty clear to us that the right approach was to kinda have that, open core model. Right? So you think a lot of data infrastructure, is open source, and it makes it easier for companies to consume. And, and actually many of our existing customers now, be it, you know, Microsoft or or UBS, started by, you know, downloading the community edition of the product. And so that's really proven to be a very successful model for us. Dremio community edition, the the the code is actually in Dremio's GitHub. Right? And people can can come see that and and use the code. But, but it's not an Apache project. Right? So there isn't you know, it's more similar to Elasticsearch or to MongoDB than it is to, say, things like Hadoop. In that, you know, we are the we we own the copyright and and we're the the primary developers of it. That said, we do receive contributions from other organizations.
So we've, for example, with with data source data sources, we've seen contributions of new connectors, like a KDP connector from UBS or a, I just saw on GitHub Software AG published a a connector to 1 of their, databases. So we're starting to see a lot of momentum around companies wanting to contribute various aspects of it. And that's, of course, in addition to the over 200 developers that now work, outside of Jamia that now work on
[00:11:33] Unknown:
Apache Arrow, which is basically our engine. And so we touched a little bit on some of the unique aspects of Dremio as it compares to things, such as Presto or BI tools. But I'm wondering if you can such as Presto or BI tools. But I'm wondering if you can dig a bit more into how you view the way that Dremio fits into the overall landscape of data tools?
[00:11:53] Unknown:
Yeah. You know, it's, if you think about the situation that companies are in. Right? They have a there are a variety of tools out there and systems that that you can adopt. Right? There are query engines, and there's data prep tools, and there's data catalogs, and there's a there are some, you know, cube type products. And at the end of the day, companies can try to kinda stitch all these things together and try to provide kinda some ability for for data consumers to be able to to to access and and utilize data. The reality is that to be successful, you end up having to hire basically a data engineer for every data analyst for for every business analyst that you have in the company. And and that's just not realistic for, you know, 99% of the companies in the world. Right? And so if you think about what Dremio Dremio does is we we've actually created a new type of a product. Right? We call it we we think of it as a new category, and that's why we call it a data as a service platform. And and it it you will see elements of of other kind of categories in the solution here. But, fundamentally, it's it's something very different. And when people ask us, how does this compare to to something else? Typically, what you, you know, typically, what you what you see there is that is that it's a little bit like comparing, you know, an iPhone to, like, an SLR. Right? Of course, they, they compete in some way because, you know, as a as a as a customer, I have to choose whether I'm gonna take pictures of my kids with my, with my camera or with my SLR. I've I've an a nice, kinda high end Nikon SLR, or do I take pictures using my, my phone? And more often than not, these days, I'm I'm using my phone because, I can easily, you know, send, send the picture on WhatsApp to to, you know, my my family or or post them on Facebook when they get backed up automatically on Google Photos and all this integrated experience that I get by by using the phone as opposed to kind of that that SLR. And in a way that's kinda how Dremio fits. Right? We more often than not, you know, if we're replacing something, we're really replacing kind of a complex process that involves creating lots of copies, preprocessing the data in in, like, an a data warehouse, you know, building cubes, building, you know, aggregation tables, building BI extract. That whole complex pipeline is is what we focus on. So we're not replacing something like, let's say, Spark, where you're primarily using that for kind of the batch processing. But definitely in terms of, we're we're definitely the the the platform and the layer that is
[00:14:08] Unknown:
enabling access to data and providing kind of the query execution. And as you mentioned, there are a lot of different aspects to Dremio that cause it to overlap on a number of other products, but not necessarily with the same level of depth. But given the fact that it's all integrated, it makes the overall system easier to use. So I'm curious if you think that at least for a particular scale of company, having Dremio installed would obviate the need for either a large scale data warehouse or data lake that they would otherwise have to maintain to be able to get the same sort of functionality? Yeah. When it comes to, so so for I I think people can see the, the types of customers we work with, on our website, primarily
[00:14:50] Unknown:
enterprises across various different verticals ranging from tech companies like, you know, like Microsoft to more retail focused companies, Royal Caribbean and and, and so forth. And when it comes to the data lake, we don't replace the data lake, but I think what we do is we make it useful for the first time, right, to a broad set of users. So in most companies, I think there's been a lot of investment in the data lake, 1, 000, 000 of dollars poured into these these data lakes. And at the end of the day, I think the benefit has been limited to the technical people in the company, the the engineers. Right? So if you're a Java engineer or a Scala engineer, you've probably gotten pretty good value from from having the data kind of in that data lake. But if you're a Tableau user or even maybe a a data scientist, you know, so kind of that range of the the more of the data consumer, I think has been a struggle because you've probably tried to, you know, go in there and run a hive query, and it was frustrating how slow it was. And as a result, companies try to kinda get the data out of the data lake and put it somewhere else so so that you can get reasonable reasonable performance on your queries. And so Jamio comes in, we sit on the data lake, and we make it so that for the first time, you can actually get, an ability to query data with interactive speed and to get, a response sign that that actually works for kind of a BI user. Right? Somebody who's not technical. And I think that's 1 of the key value propositions. When it comes to how do we do we replace the data warehouse, I think there are 2 uses, 2 use cases for data warehouses.
1 is as the kind of the historical store of kind of store of record for data. For example, you know, storing all my, historical finance financials, right, for a company or HR data. And and we don't replace that. So, you know, there's definitely, you know, people would still want to use the data warehouse for that. But when it comes to, the use using kind of the data warehouse more as like just like an MPP database. Right? High performance queries for kind of ad hoc analytics and and BI, we're definitely helping companies, replace replace the data warehouse for for those use cases.
[00:16:50] Unknown:
And given the fact that there are so many different concerns built in to the Dremio product, I imagine that that can lead to some measure of complexity in terms of development and managing the, overall growth and maintenance of the project. So I'm curious if there are any particular strategies that you have as a business and as a development team to be able to manage that integration of all of these different pieces into the whole. So things like the governance capabilities, data lineage, the built in data catalog, etcetera. Yeah. You know, there's 2,
[00:17:25] Unknown:
2 2 parts to that answer. I think first of all, you're absolutely right that, the to build something that can do what what this does, and provide that kind of an integrated experience that requires a lot of, you know, developing a significant amount of IP. Right? It's everything from kind of query optimizer technology to, you know, query execution to, you know, having a great user user experience. Right? And that's typically not found within a single company. And and so it comes down to kind of hiring, you know, top talent from an engineering standpoint. You know, people that have expertise in these areas and have built kind of execution engines, have have worked on the Oracle kernel, have built, have have have been committers on on various successful open source projects. So so that's been kind of our focus when when it comes to kind of building the engineering team is is really hiring people that that can build these types of things. And and then also the just from a product standpoint, this isn't about building basically multiple different unrelated technologies. Right? We get the the fact that it is part of 1 solution gives us a a tremendous amount of benefit. So for example, if you think about Dremio, we we, you know, the product includes, like, basically a semantic layer. Right? So people can, the users can create new virtual datasets because often what what happens in companies is the gold copy of the data is not exactly what the user wants. And today, they have to go to IT and and request kind of some some project to to to get the data into their their desired shape. With Dremio, they can actually create a new virtual dataset either visually or by defining a SQL select statement, and that new virtual dataset can then be queried from from any external tool. But if you compare that to, like, say, a data prep tool, data prep tool is actually now having to execute kind of a batch process, a nightly batch process, and create another copy of the data. With Dremit, that doesn't happen. The virtual data set is basically a logical definition, which means that there no cost to creating those things, and the the kind of the data curation layer in Dremio builds on top of that, the semantic layer in Dremio. So we actually get all these benefits from having kind of an integrated platform both from a development standpoint and a QA standpoint. And then, of course, the customer benefits from, you know, not having hundreds of data copies in the company, for example. In a lot of cases, some of those different concerns
[00:19:33] Unknown:
are available in dedicated products for that specific use case. So whether it be the data curation or the data governance. And in some ways, that makes it easier to integrate multiple systems with that particular concern. Whereas with Dremio, having it all being built into the same system, I'm curious if you have any
[00:19:53] Unknown:
external integration points for other tools to be able to take advantage of the data catalog or the data lineage that you store within Dremio itself for then being able to leverage in other systems that are serving different concerns within the business? Yeah. For sure. So our our customers, they, they frequently use, like, you know, like, an enterprise data catalog or a data prep tool. They those serve kind of a different use case. Right? So if we're focused on enabling access and data exploration and analysis, on on your data, you know, either on the data lake or across multiple sources, you know, a data catalog is is focused on, you know, taking all the data in the enterprise across, you know, potentially thousands of of databases and kind of exposing that in a collaborative interface. And to that tool, Dremio looks just like any other database. So we expose ODBC and JDBC, as well as REST APIs.
And, you know, if it's a a data catalog, for example, just like the BI tools can connect to us, the data catalogs can can connect to us and immediately see all the information in our, in our system. In addition to that, everything that we have in Dremio is exposed through REST APIs. So it's very easy to, to integrate. But again, we're we're really focused, you know, if you think about like, kind of the the data life cycle, we're really focused on that last mile. Right? So, you know, the customers are still gonna they're they're still gonna use ETL tools. They're still going to use kind of spark jobs for kind of their, their heavy duty, their batch processing, the things that have to happen, you know, every night to to clean the data or land it or whatever.
Is really focused on that last mile. So you think about the data exists somewhere, and it's largely been kind of organized in in in some way that seems reasonable to the company. But then, you know, every user wants things to be a little bit different. Right? And Dremio enables users to create their own virtual datasets that are derived from other datasets in the company and kinda provides that almost like a Google Drive or Google Docs type of an experience for for data. And in terms of the overall architecture of the system, you mentioned that Arrow serves
[00:21:49] Unknown:
as a primary component of the core execution engine, but I'm curious how the rest of the system is architected, particularly as far as integrating some of these various concerns that we were just discussing and being able to manage the ongoing development of them without introducing
[00:22:07] Unknown:
conflicts at the source level? Sure. Yes. The architecture of Dremio is kinda at the bottom, we have the connectors to data. Right? And so we we obviously have, very sophisticated connectors to things like, HDFS, s 3, ADLS that are highly parallelized, high performance, and support a variety of file formats like parquet and ORC. When it comes to kind of more of the NoSQL and relational databases, we have connectors for those as well, And those are very focused on how do you push down processing, be it aggregations, you know, filters, projections, window functions into the underlying data sources. So we have this connector architecture where, that's very extensible and people can build new connectors and and it has the connectors.
On top of that, you have kind of the the, the execution layer, which is all about, you know, running distributed query execution. Right? Using utilizing Apache Arrow as kind of a an in memory column or format and and really kind of focusing on just raw performance there. The reflections kind of sit to the side of that where the engine can query and and read the reflections if if that's what the optimizer decides is the best path to kind of the lowest cost query. So that's that's kind of another piece of that. And then on top of all this, we have kind of the user interface. And the user interface is really about making things simple for nontechnical users. So when you create a new virtual data set, which is kinda similar to a view, you can do that through SQL. You can say this is the the view, and it's a join between my, my Hadoop data or my s 3 data and my Oracle database. But for a nontechnical user, doing that is too hard. Right? Writing that kind of a SQL join. So they can interact with the data kind of visually. You know, you could look at your Hive table and say, click click the join button, and the system will say, hey. You know what? We recommend you might wanna join this with this other dataset based on what other people are doing. So that's kind of the the user interface. And when the user interface when you interact with datasets, you can then save that as a virtual dataset, which basically translates into a just a SQL select statement that gets saved, as metadata. And now that becomes another dataset in the system that people can look at. And because we have all this, you know, we we really understand all these virtual data sets, they're they're they're SQL statements and, you know, we're a we have an optimizer in a SQL engine. We understand the relationship between all these datasets so that that gives us the ability to do, to very easily do things like lineage and show you what exactly depends on what. And if you want to do data masking and row and column level permissions, that's also very easy to do because underneath the hood, you know, know, it's there there's a SQL statement that defines every dataset. And that last piece that you mentioned is 1 of the things in particular that I think adds a lot of value to Dremio is that granular access control and governance of the source data without having to push that concern down to the
[00:24:43] Unknown:
specific source engines. Because I know that for, some of the different business intelligence platforms, the permissions are actually managed at the database level, which can increase the overall complexity and the difficulty of being able to get a unified view of who has access to what. So I'm wondering if you can talk a bit about some of the ways that you manage those granular controls and some of the implementation of that? So Dremio,
[00:25:09] Unknown:
well, we we can definitely, respect the the underlying permissions of of the data sources. But on top of that, we allow companies to really control who can access what data. Right? And so if you think about Dremio as being that platform that's used by the consumer of data, by the end user to to discover, to explore, to query datasets regardless of what data source they're in. That means that Dremio can now basically be at a be in a position where it can restrict access to specific datasets. And that can be as simple as, oh, these datasets are accessible to this LDAP group and and and not this other group of users.
And it could be more sophisticated. It could be if you're a member of this group, you can only see the last 4 digits of this column or of the Social Security number column. But if you're a member of this other group, you can actually see that whole thing, and it's really the same data set that that's being accessed. And so that just having this 1 platform that through which the people are accessing the data gives us that the the ability to be that, layer of of control. Right? Where if you were, just sitting to the side, yeah, you could have data masking and and and all these things, in your product, but anytime somebody went to the data through some other tool, you know, BI tool or a notebook, there would be no access control. Right? And so that's kind of the what what Jareneo brings to the table here. On top of that, I think the just the fact that you have kind of a a a system that enables self-service means that people don't have to now download the data into kind of these disconnected copies like spreadsheets and servers under their desks, which is is where all the data governance problems start. Right? It's if you don't give people access to data and don't don't empower them, they will find find a way to do it themselves, and most likely they're going to be downloading data onto their laptops. And at that point, you've totally lost control as an organization, and you have no idea what people are doing with the data, where where it's existing, what's being shared. And having a system like Dremio enables IT to then kind of empower the end users to have kind of this self-service experience, but at the same time, to really understand what are people doing with data, who are they sharing it with, how often is are various datasets being accessed, and are people doing things that they shouldn't be doing. And for somebody who is interested in deploying Dremio into their own environment
[00:27:20] Unknown:
and putting it to use, what is involved in actually getting it set up, and what are the different scaling factors for being able to add capacity to the installation? Yeah. So Dremio is typically deployed in 1 of 2 ways. You know, for people running it on Azure or AWS,
[00:27:37] Unknown:
we provide kind of a Kubernetes option with helm charts that makes it very easy to to deploy the product. When it comes to people that are on prem, kind of have Hadoop clusters, we just integrate with Yarn, which is kind of the scheduler on Hadoop that makes it really easy to kind of scale out Dremio on top of the on top of the data lake. The product comes the the the the Dremio project is composed of kind of 2 different roles. 1 is the the Dremio coordinator, and 1 is a Dremio executor. And the coordinator is basically, you know, kind of deployed on your edge node. That's what, say, the BI tools would be connecting to, and it's also the the node that kind of serves the the user interface. So you'd wanna have, maybe 1 or or maybe several coordinators for from an and and kinda concurrency standpoint, and you can scale that out independently.
So that's kind of 1 aspect is is the coordinators. And then the executors are responsible for kinda query execution, and that's more driven by how much data do you have and what's the the kind of the volume of data being processed by the queries. And that too can be scaled out independently. So you could scale it. You could have 10 executors, and you can have a 100 executors. And, you know, if you're running on something like, with Yarn, it's as simple as saying how many executors you want, and and we automatically take care of kind of provisioning and and deprovisioning those. So Dremio itself has been public for a relatively short period of time. But in that span, you have released a number of different iterations and releases of it. So I'm curious, what are some of the most exciting features that have been added recently and some of the things that your customers are gaining the most benefit from? Yeah. We we try to release a major release once every 6 months. So we we launched the the company and the product about a year ago, just over a year ago, and we we did our 3 dot o release last week. So if, you look at at the, kind of the website in terms of what's new with with 3 dot o, we introduced a number of new things. We introduced a kind of an integrated catalog experience where users can tag datasets and and also kinda have this, collaborative Wiki to describe various datasets and and the spaces in which datasets live. So that's kind of 1 example of something that was added. We also added kind of sophisticated workload management capabilities, which allow you to control what different users and groups are allowed to do and how much resources they get. So if you wanted to say something like, you know, the the interns are not allowed to use the the system between 6 PM and and 6 AM the next morning, you could do that, or maybe you just wanted to say you wanna say that they can only get 10% of the resources at specific times. So you have all sorts of flexibility around workload management. So that's something we released in preview in the 3.0 release. We also announced a new initiative called Gandiva, which we integrated into Dremio and also contributed to Arrow.
That basically leverages a compiler called LLVM to generate kind of vectorized vectorized code, so columnar, very kind of advanced columnar processing of data. And for queries that have kind of complex filters and projections especially, so, you know, big case statements and and various UDFs, that can actually increase performance by, in some cases, we've seen ADX on on some of the queries that our customers run. So that's been very important to our, customers. We've also added various security related capabilities, whether it's, kinda end to end encryption on the wire of all data using TLS or it's using ect EC 2, instance profiles to access data and and a variety of things things like that. So and then 1 other thing is we created a new framework for connectors in Dremio, which we called the ARP, advanced relational push down, and that basically sets us up to be able to deliver many new connectors over the coming months. So you'll see, you'll see the number of connectors really, really expand as a result of that. And also the quality of the existing relational database connectors, will will really, improve as well. So in terms of how much gets pushed down and and all the window functions get pushed down can can add all sorts of different aggregations, you know, correlated subqueries. There's just a rich set of things that that now all get pushed down into the underlying source if the underlying source supports that and if that's the best plan, right, compared to, say, using a reflection.
[00:31:38] Unknown:
And it's easy when discussing all of the different capabilities and features of Dremio to start to think that it's the sort of cure all for your data problems. But I'm curious, what are some of the situations where it's actually the wrong tool for the job where something else would be a better fit? Yeah. You know, it's interesting. We, in fact, some sometimes people see the demo, and they're they're like, well,
[00:32:01] Unknown:
this is too good to be true. And, because it it really, you know, for the first time, enables, people that could not be productive with data to to be productive and makes it so that data engineers don't have to kinda do that very reactive, you know, kind of tedious work that they hate doing. Right? So I think that's all great. I think it it like you're saying, it it's important to, really understand the what what the system is great at and what it's not. And I think the 2 kind of key things are to to kind of to keep in mind to make sure that you're you're aware of when when when you're adopting Dremio is. The first 1 being, it's not an ETL tool. Right? So if your use case is, you know, I want to run 5 hour jobs every night, then there are systems that are designed for batch processing.
You know, they do a lot of checkpointing as the queries are running. They're very slow though, but they'll get the job done, right, that 5 hour query. And I think that's so if you think of things like Spark and Hive, those are really good for batch processing, and that's that's what, we'd recommend for those 5 hour jobs. And then the other thing I think that's important is also understanding that, you know, this is a distributed system, and and you wanna make sure that you are giving it the resources that that are required to to be successful. So you can't, you know, every so often, somebody will come to us and they'll try to run your on on 1 server and just the volume of data that they they want or the they wanna support thousands of users. And, you know, the system is designed to scale out, so you just gotta make sure that you're you're kinda sizing it for whatever you're trying to accomplish. And so
[00:33:33] Unknown:
with the scope
[00:33:34] Unknown:
and capabilities of the project, I'm sure that there are some aspects of it that have proven to be rather challenging. So I'm curious to hear what have been some of the most challenging aspects of building and maintaining and growing both the technical and business aspects of Dremio. So we we, so we launched about a year ago, and, you know, the last year has been incredible for us way way beyond anything that we we had imagined. If, I mean, obviously, you can you can see the the types of companies that have already become customers, whether it's the largest banks in the world and the largest tech companies and largest alcohol manufacturers and cruise lines and and so forth. Right? So so that's, it's it's been very, it's been kind of a a great experience for us. That has required a lot of effort for the entire team here. And so, you know, as as we've kind of scaled up the the organization, and I think we've grown 3 x as as a company in the last year, you know, addressing the needs of of these types of companies and kind of the new addressing their feature requests and making sure that we're doing a great job support supporting them 247.
Those are obviously challenges that we as an organization have to deal with, right, and execute on. So I I think that's something that we've been doing. We've opened up other sites internationally so that we can provide that kind of around the clock, experience to our customers. Certainly, the breadth of the product, if you think of the, you know, just the the kind of the capabilities and and and what's needed to deliver this kind of an experience that's, you know, to the end user magical. It requires a lot of investment in kind of development, but an equal amount of investment in kind of QA and testing. And so we have tens of thousands of tests that run every day to really, you know, if you think about all the different types of queries that people could run-in all these different environments and data sources and and BI tools and and so forth. There's no way that can be done manually, right, by somebody sitting there and running a few queries and testing things out. So we have really incredible kind of test suites that that we run regularly to help us make sure that this the the system is working as it should in all these different combinations.
And it's not just working, but it's making sure the performance did not degrade because somebody checked in some code that potentially introduces some some problem. And it's making sure that, memory consumption of of the system hasn't changed, you know, from from 1 release to another release in a negative way. So those are those are the kinds of investments we do to to make sure that we can kinda deliver on this. And as you look forward to the future of Dremio, what are some of the things that you have planned that you're most excited for? What are some of the things that are planned? So we typically don't talk a lot about the future, but, but, you know, there's a huge amount of, engineering going on right now in terms of new capabilities in the product, and those span kinda many different areas.
1 thing you can expect to see from us over the next year is dozens of new connectors to to data sources ranging from a variety of SQL kinda relational systems to, you know, SaaS applications and and other sources. I think you can expect to see Dremio, you know, being increasingly easier and easier to consume for a company. So even if I do not have a data engineering team, I should still be able to take advantage of this. Or if I have a very small 1, like, you know, a lot of companies, you know, the data engineering team is really understaffed and and they can't find the number of people that they need in order to be successful. You'll you'll see us continue to make kind of engineering investments to to to make it possible for these small teams to kinda do the work of, you know, teams that are 10 size the 10 times the size. So those are other things in terms of autonomous and kind of AI driven capabilities inside of the product. And are there any other aspects
[00:37:07] Unknown:
of Dremio and data self-service and just the data space in general that we didn't discuss yet, which you think we should cover before we close out the show? It's always hard to answer what we, what we didn't do
[00:37:19] Unknown:
because we would have done it otherwise. No. I think we I think we've covered,
[00:37:24] Unknown:
we've really covered a lot of rounds here. So for anybody who wants to follow along with the work that you're doing and get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Yeah. I think we're I think we're solving the biggest gap, but
[00:37:46] Unknown:
but there's obviously a lot more to do. And and and as I look at, you know, really the in the years to come, we, as an industry, have to make it easier. Right? I think, companies understand that they have to be data driven. There there's I I think that's not a question anymore. Right? If you're a, you know, if you're a credit card company, the most valuable asset you have is not the credit cards, it's the data that you have about all these people that are buying things and what are they buying and how are they interacting with with, your systems. And that's the advantage that you have over, you know, say, your competitors. Right? And being able to tap into your data and use it for everything from security to to marketing to kind of delivering a better experience, I think, requires everybody in the company to be data driven. And there's a lot of work, I think, remaining to, to accomplish that. Right? I think it's great in our personal lives. Like I said, you know, my, you know, my kids are in elementary school, and and and, you know, they go online all the time and ask questions and get answers, and and I think it's still too hard within companies to work with kind of business data. And to and that's kinda what we're what we're focused on, but, obviously, there's a lot more to do. Alright. Well, thank you very much for taking the time today to join me and discuss the work that you're doing at Dremio. It's definitely a very interesting product
[00:38:59] Unknown:
and 1 that I've been keeping an eye on for a while now and something that I plan to take advantage of in my own systems shortly. So thank you again for that, and I hope you enjoy the rest of your day. Yeah. Thank you. Thanks
[00:39:12] Unknown:
so much.
Introduction and Guest Introduction
Tomer Sharon's Background and Journey
Overview of Dremio
Dremio's Deployment and Customer Use Cases
Unique Features of Dremio
Open Source Strategy and Community Contributions
Dremio's Position in the Data Tool Landscape
Replacing Data Warehouses and Data Lakes
Managing Complexity and Integration
External Integration Points
System Architecture and User Interface
Granular Access Control and Governance
Deployment and Scaling
Recent Features and Customer Benefits
Limitations and Use Cases
Challenges in Building and Growing Dremio
Future Plans and Exciting Developments
Closing Thoughts and Industry Gaps