Summary
The core of any data platform is the centralized storage and processing layer. For many that is a data warehouse, but in order to support a diverse and constantly changing set of uses and technologies the data lakehouse is a paradigm that offers a useful balance of scale and cost, with performance and ease of use. In order to make the data lakehouse available to a wider audience the team at Iomete built an all-in-one service that handles management and integration of the various technologies so that you can worry about answering important business questions. In this episode Vusal Dadalov explains how the platform is implemented, the motivation for a truly open architecture, and how they have invested in integrating with the broader ecosystem to make it easy for you to get started.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
- Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.
- Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support.
- Your host is Tobias Macey and today I’m interviewing Vusal Dadalov about Iomete, an open and affordable lakehouse platform
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Iomete is and the story behind it?
- The selection of the storage/query layer is the most impactful decision in the implementation of a data platform. What do you see as the most significant factors that are leading people to Iomete/lakehouse structures rather than a more traditional db/warehouse?
- The principle of the Lakehouse architecture has been gaining popularity recently. What are some of the complexities/missing pieces that make its implementation a challenge?
- What are the hidden difficulties/incompatibilities that come up for teams who are investing in data lake/lakehouse technologies?
- What are some of the shortcomings of lakehouse architectures?
- What are the fundamental capabilities that are necessary to run a fully functional lakehouse?
- Can you describe how the Iomete platform is implemented?
- What was your process for deciding which elements to adopt off the shelf vs. building from scratch?
- What do you see as the strengths of Spark as the query/execution engine as compared to e.g. Presto/Trino or Dremio?
- What are the integrations and ecosystem investments that you have had to prioritize to simplify adoption of Iomete?
- What have been the most challenging aspects of building a competitive business in such an active product category?
- What are the most interesting, innovative, or unexpected ways that you have seen Iomete used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Iomete?
- When is Iomete the wrong choice?
- What do you have planned for the future of Iomete?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Iomete
- Fivetran
- Airbyte
- Snowflake
- Databricks
- Collibra
- Talend
- Parquet
- Trino
- Spark
- Presto
- Snowpark
- Iceberg
- Iomete dbt adapter
- Singer
- Meltano
- AWS Interface Gateway
- Apache Hudi
- Delta Lake
- Amundsen
- AWS EMR
- AWS Athena
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans could focus on delivering real value. Go to data engineering podcast.com/atlin today, that's a t lan, to learn more about how Atlan's active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork, and Unilever achieve extraordinary things with metadata.
When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to data engineering podcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show. Your host is Tobias Massey, and today I'm interviewing Busal Dadalov about Iomit, an open and affordable lakehouse platform. So, Busal, can you start by introducing yourself?
[00:01:37] Unknown:
Yes. Sure. Thanks, Toby, for having me here. I'm Vusal, cofounder of Imit. Imit is a lakehouse platform. My background, I'm just a system engineer. I've worked in companies like Uber, Careem, Olex Group, and some telecom company in the past. Yeah. That's my background.
[00:01:58] Unknown:
And do you remember how you first got started working in data?
[00:02:01] Unknown:
It was, like, probably 10 years ago, like, while I was working at telecom company. Since my background is distributed system engineering, that's 1 of the area data, infrastructure where the distributed systems is being heavily used. I started this, like, consulting data engineers, like, how to build data platforms and infrastructure. That's how, like, I got started. And then get heavily involved over time, and then found myself building data infrastructure many times over over and over again.
[00:02:35] Unknown:
As far as the Iommate project, I'm wondering if you could describe a bit about what it is that you're building and some of the story behind how you decided you wanted to create it, why you decided that this was a problem that you wanted to spend your time and focus on?
[00:02:48] Unknown:
Iammit is a most affordable and open source lakehouse platform. The reason we started this, actually, that's my previous experience. You know, like, as I said, like, I've been involved many times building data infrastructure in the past, and I saw, like, some, you know, repeating problems as a customer, as a, you know, like, user that we were building that infrastructure. The first thing was, like, the big data platforms is expensive. Yeah. In the past, it wasn't problem because, you know, like, companies, they're like, only big companies. They were using big data platforms, like the telecom company I work, and they were ready, like, pay $1, 000, 000.
But, like, as, you know, time passed, like, it's now more common. Data is not a, you know, like, expensive thing. It's a commodity now. And, like, some platforms, their improvements on the, you know, cost aspect, but still, it stay expensive. And, yeah, we were fighting as engineer, like, as builder of this infrastructure, like, we we were always got pushed, like, how we can optimize, how can we make, you know, more cost efficient. Yeah. That was 1 reason. Okay. Like, you know, that world is going to, you know, date intensive. Like, all the companies, in order to stay competitive in the market, they they should use the data, and the data is becoming, you know, a usual thing. And that's why we should provide a platform that's, you know, inexpensive and reachable, accessible by everyone.
That's 1 problem. The another 1 is transparency. Some vendors, you know, like we were using, I guess I can name few, you know, well known names that people can easily resonate, like Snowflake. There's an transparency in that, like, in the storage side because you just put your data to somewhere else in their own format, plus the compute you use. Like, you don't know, like, what actual compute engine they use. Right? They could optimize their platform, but they can, you know, like, still use a cheaper device, and they have higher margins. So, like, there are lots of in transparency there. Like, I was wishing that, hey, I should own my own data. Like, it shouldn't go somewhere else. And there is also really great open data formats, like, for an article world, like parquet or see this type of formats, and it should stay on my side. Then, like, I feel comfortable that I own my data.
And whenever I don't like the no vendors service, I can switch to use different platform that will also, you know, like, push the vendor, provide the best service because they're not locking the customers. Yeah. Behind this transparency, there's also this lock in stuff. Like, with all this, they're kinda locking their customers. Another thing is the integrations. Like I remember, like every time when we build when we say today, modern data stack, their components are, you know, very well known. Like, you need the ingestion, you need data analytics or processing layer, you know, data governance, BI, etcetera. These companies are well known, and this modern data stack has got matured a lot, that you can identify, like, any company, regardless of their, you know, like the business type, they're more or less building the same. Like, pro most likely, if we give a name, like, from the well known vendors, 5 stranded ingestion or air bite or similar technology, data processing side is like Databricks, Snowflake, BigQuery, this type of product.
Data governance, you need to go another vendor, like you need to bring Calibra, I don't know, talent, like or similar platforms. And usually, all these vendors, they have really great, features, but, you know, like, it costs a lot of engineering efforts to integrate all this stuff. Every different company in every different, like, building infrastructure building, you have the same components with different vendor names and doing the same integrations. And I remember at that time I was like thinking, like, if you look to the world as a, like, our home, inside this home, we are repeatingly doing the same thing, like, integrating all these different pieces, like but we can also centralize this effort to some other company and this other like, we shouldn't do this repeatedly.
And then we can, you know, save those resources to do something that is innovative, not just, like, very repetitive. So that was the idea, like, hey. That's another thing. Like, I wish to have that, hey. Like, all these companies are well known. Like, why wouldn't we have a platform that involves all this together? Because this is not rocket science. We need ingestion. We need analytics there. We need data governance. And that's all. Like, I saw, like, from this telecom company, like, Olex Group was a big ecommerce company. Careem was like a smaller version of Careem than, like, Uber, which was bought by Uber.
And at Uber, so, like, the same, you know, set of technologies, like, the names are changing, but the concept is same. So all these things kind of, like, triggered me. Well, let let's start with, you know, efficient cost aspect. I can give an analogy. Actually, that's some kind of coincidence before this podcast. I saw the interview of Figma co founders, and their pitch was so boring. Like, it was like, they were asked, like, what what's the difference than, you know, Photoshop, Adobe XD, this kind of stuff. And it's like, they were like, you wanna provide easy to use and inexpensive. At that time, it wasn't sound like inexpensive. It's like just, you know, trying to sell inexpensive software wouldn't keep you in the market, you know, like, you will get crushed. But, you know, like, 10, 12 years later, they were sold for 20, 000, 000, 000, but it's not a good news for customers because that's the reason customers love that. It was a time this design things got, you know, like, it becomes, you know, like, a usual thing. That's why they were searching an expensive solution.
I think we can have the same analogy. Now, like, we wanna stay on that location in the map, whatever that map is, that we wanna provide accessible, easy to use. And another reason we use the open source 1, of course, like building this type of platform from scratch is very hard. It's not possible. But another reason we use the open source still, like, continue with the open source 1 is is the transparency aspect. Like, we wanna stay, hey, this is open source. We store data on customer side. We are transparent on that. We show which node you use. We are transparent on that. We don't add any markup because since we manage whole infrastructure, our business model is to get this reserve instances from AWS and get our earnings from there, not from the customer. Blah blah. Like, we are trying to solve all this with this model.
And integration wise, we built the most of it, but our vision is to provide all in 1 platform. We have data governance in place, authorization, analytics layer, at some extent, ingestion part. We're working towards to have, like, all in 1 affordable, easy to use platform for customers. Sorry if this became very long introduction, but, yeah, that's how I can summarize what we do. No. Absolutely.
[00:10:38] Unknown:
There are a lot of different pieces that go into building a system like this, and a lot of people will say, oh, I just need a lake house, so I'll just use Spark or Trino. And then you start digging into it realizing, oh, I actually need these other half dozen capabilities as you as you mentioned. And, I mean, the same thing if you say, oh, I just need a data warehouse, so, you know, I'll choose Snowflake. And, oh, now I actually need governance and a data catalog and all these other pieces. And, oh, it wasn't as simple or as inexpensive as I thought it was because I forgot about everything else that goes into it. And to the point of the kind of selection criteria, the core of IOMeat is this principle of the lakehouse architecture, lakehouse platform.
You brought up the topic of the modern data stack, which has been oriented around kind of the disaggregation of these discrete capabilities. And a lot of the focus and center of gravity for the modern data stack has been these cloud data warehouses, so Snowflake, BigQuery, Redshift, etcetera. And the lakehouse architecture is a bit of a response to kind of that, the popularity of those systems because of the fact that lakehouse or the data warehouse is where all of your data is stored, it becomes, you know, that center of gravity, and it becomes 1 of the most impactful choices that you can make as you're designing your data platform because that impacts all of the other integrations that you're able to select from where if you're using Snowflake, there's a fairly large ecosystem of tools and plugins to be able to work with that because they've invested in building that ecosystem around themselves.
Whereas if you're building around the open data lake slash lakehouse, again, depending on whether you're using Spark or Treto or Presto or what have you, you're going to have a highly variable experience as far as the level of integration and the level of off the shelf pieces you can use. And I'm wondering what you see as the largest motivating factors that push people into that lake house architecture, even knowing the fact that they're not going to have as easy of a time as if they were to just say buy Snowflake or buy BigQuery?
[00:12:44] Unknown:
In general, I think the cost aspect, like, Databricks are also pushing the lakehouse in our architecture. The integration wise, they're also behind Snowflake, what Snowflake provides. In our case, yeah, we don't have, you know, like, our integration in the ecosystem is not at the Snowflake, level, which is, you know, also our customers don't expect that we are at that level. But what we give them instead of that, yeah, eventually, we'll provide richer ecosystem. But we provide they can save 5 to 10 times of their cost, like, going with us, Plus, where the data is stored on their account, it's stored in a parquet format that gives additional confidence that, hey. Like, this company, you know, like, they can only serve us better because otherwise we would go. That's also, like, good for us that keep us healthier. You know? Like, we know that we close the deal, they move 10 terabytes or, like, 100 terabytes of data, and then it's not done. We have to keep serving well.
And we have data governance and other, you know, like Spark jobs, which is not they have snow parks in the Snowflake case, but, you know, I doubt that it's gonna hit the same maturity that's Spark has. But, you know, like, take this additional stuff that's not in the Snowflake, or there are, like, some features like Snowpipe, which is not actually not a feature. That's just their shortcoming of that they don't have Spark streaming kind of platform. That's why they build this feature in order to, you know, like, provide some way to move the files, the data that's in the files.
Anyway, the integration wise, we are little behind, and open source 1 like Spark, you can build your own date lake house. Like, back to your question, like, in 15 minutes, maybe in a day. The problem, that's where the open source platforms are good. Like, you can use Spark or Trino, and then you put Apache Iceberg on top of that to get this asset transactions, etcetera. But that's the main problem that you explained, the ecosystem integration, that's not there. And plus the open source products, they're good, but they're not enterprise ready. They don't have authentication layer, authorization layer. They don't have, like, good UI that you manage all these things, manage your users, blah blah, all these things.
But, yeah, we are building that. We released 2 months ago DBT adapter, which is, like, huge stuff, like, because, you know, like, all engineers are loving dbt. They do all this SQL based transformation, and dbt now support data frame based transformations too. Like, we are working to integrate that too. On the ingestion side, like, we open a pull request to airbyte for airbyte integration. There is a singer also, like, which is backed by Meltano. We also finished the that integration. That's kind of like vendors that have open source versions. Now we are starting the closed source vendors, which is like Fivetran.
The data ingestion is important because if you don't have easy data ingestion, then you won't have data in inside your platform, and then their customers won't use you. And on the user side, we support all the, you know, mainstream BI tools. And I can say, like, integration wise, the customers won't miss a lot, like, compared to Snowflake, that we cover all the mainstream integrations. Yeah. But probably there'll be some, you know, nuances they could miss. But the open source ones, they will definitely miss all those stuff. They have to build what we are building for last 2 and a half years. They wanna build that kind of platforms in house.
[00:16:45] Unknown:
To your point of the work that you're doing to build it, and if somebody says, I'm just going to pull all of the off the shelf open source projects and integrate it on my own, What are some of the hidden difficulties and incompatibilities that they're likely to run into and some of the things that you've come up against in the process of building out this fully integrated platform in the shape of IoB?
[00:17:06] Unknown:
Few stuff I mentioned, like, in authentication, authorization layer, like, building some interface around this. And I think most difficult part is the infrastructure part. You know, like, the spar, it's good, like, but it has 1, 000 different variables that you have to, you know, like, get the right values for for you. And also infrastructure wise, where you're going to run it. Like, you're gonna run on the just plain virtual messions or, like, on in the Kubernetes area. How you gonna handle the scale out, scale down? Spark has its own scheduler. How it's going to work with the Kubernetes scheduler? Do you gonna use the standard scheduler, or are you going to use a custom scheduler?
There are, like, lots of nitty gritty details. Like, I can give a few examples. Like, just 1 is, like, once you run some time, like, you realize, oh, you have to run all your cluster nodes in the same AZ because cross AZ charges a lot. We are killing this because we are trying to optimize all these cost aspects for multiple customers. It's not only, you know, for us. That's why we care this kind of nuances a lot. But, like, if you're building this type of platform just for your own business, which is not your core of your business, then you wouldn't care all these details, but it costs a lot, you know, like cross AZ charts. And another aspect, like, normally when the install I've seen this many times because I did that too. And I realized later, you know, like, we didn't do too much cost optimization, but we do now because, you know, we care about, like, reducing that cost as much as possible.
But 1 thing, like, when you just install run, like, using s 3, it goes through public Internet. And, like, since you're gonna put all your nodes in the private subnet, it should go through NAT gateway. Right? And NAT also charges a lot. But there is, you know, specific service in s 3 in AWS that is called s 3 interface gateway. You configure that, and then it goes in the backbone network of AWS s 3, which, like, removes the charge, like, it touches since it doesn't go through the net. You don't have that charge. And plus, you stay compatible with the compliances because you don't cross through the public Internet.
And that service has doesn't have any additional charges. You know, like all these small things, if you're a new company building this, you only notice after you get like huge build. Like, this is, you know, like learning from the mistakes we did before. But there are a lot of this kind of problems, especially on the, like, handling of the scaling. It's not a huge problem. Like, we still fighting with that.
[00:19:58] Unknown:
Prefect is the data flow automation platform for the modern data stack, empowering data practitioners to build, run, and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn't get in your way, PreFECT is the only tool of its kind to offer the flexibility to write workflows as code. Prefect specializes in gluing together the disparate pieces of a pipeline and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20, 000 community members, Prefect powers over 100, 000, 000 business critical tasks a month. For more information on Prefect, go to dataengineeringpodcast.com/prefect today.
That's prefect. As far as the kind of lakehouse architecture paradigm, We've discussed a bit about some of the challenges in the ecosystem and the integration capabilities, but from the kind of processing and storage and optimization and just kind of basic functionality layer, what are some of the shortcomings that you see in how lakehouses are able to operate as compared with a fully integrated data warehouse architecture?
[00:21:11] Unknown:
The shortcomings probably, I can say right now, sub second latency queries or point lookup queries. And, also, like, there's problems, like, if you wanna do real time ingestion or real time analytics because, you know, like, there's small files issues That's famous, I think, in every data lake lakehouse platforms. We do have a way to optimize that, but we comment not to ingest data, like, lower than 15 minutes. I think in many many cases, that's acceptable. But world is moving. You know, like, there is this analytics needs, you know, for reporting, you like, 15 minutes is more than enough.
But there is also real time needs. That should be a different technology. But, like, people are like, hey, if you have this platform, can't we have, like, both real time and analytics needs run on the same platform? That's why, like, it's converging, but right now, the lake house has this shortcomings, this small files, etcetera, that kind of, like, preventing that happen soon, but it's going to happen. In the traditional data warehouses, that's not a problem because, you know, the file structured in different way. That's why they don't have these problems, but they have other problems.
[00:22:40] Unknown:
Digging more into the kind of lake house functionality and capabilities, you mentioned things like governance, the transformation layer. I'm wondering if you can talk to some of the decision process that you went through of which pieces were, I guess, mature enough to want to incorporate into the IOMe platform and which pieces you decided were worth the effort of building from the ground up and some of the kind of build versus buy decision that goes into building a platform like Iomit versus if you were to do it in house for your own company?
[00:23:16] Unknown:
I think first, when you choose vendor, like build versus buy, most of the time, we do this for our own company too. Like, we should go with the buy because I mean, like, with the, you know, like, sensible buy option, not, you know, any buy. Because if that's not the core of your business, like, building a platform would cost you more than, you know, like, just buying and paying someone else, because that's already centralized that building cost in that company. But if that's your core of business, let's say, if you're a data company, like for us, like, imagine like we are buying a service from Snowflake and then, you know, selling that service, that wouldn't make sense for us. Right?
Or if we are providing ELT, TL type of service, it's again, like, we should build it because that's core of our business. Also, like, for companies that I think even, like, if they assume that they're gonna, like, pick, like, Facebook, Google, at Uber level, at the beginning, it would make sense to buy because, you know, like, to save the initial building cost, saving the hiring or maintaining this data, engineers resources. The beginning that would make sense. Then like once they get bigger, they can evaluate the situation. Like having internal team and building everything in house, if that make them like more, you know, cost efficient, that's the time I think to decide that. But I think most of companies won't need that. Even like there's a bigger companies like banks, insurance companies.
Since they don't have enough know how, or like enough team, or since that's not their core of business, they're not very, you know, interested about like having that type of team, even though that's going to be cost efficient, having Tim internally. They're just, you know, buying the platform and paying someone as outsourcing basically outsourcing the job. As an engineer, like, I'm also, if I speak from the engineering perspective, most of the time, like, that happened to me a few times. You start with the open source 1, But at some point, you got frustrated because many open source platforms are designed in a way that it should frustrate you. And then, like, you should go after the vendor, like, can you give me, you know, ready ready to use, you know, platform?
Also, like, sometimes you have problems, but you don't have enough, you know, resources to fix the those bugs. And that's why, like, after some time later, you realize, oh, it would be nice to, you know, realize some vendor to, you know, like, ask your problems, to fix some, you know, urgent bugs, etcetera. Yeah. I think most of the time, in today's world, it makes sense, mostly buy versus build your own.
[00:26:21] Unknown:
So digging more into the Iomit platform itself, I'm wondering if you can talk to the architecture that you have built around. You already mentioned that you're using Spark as sort of the core of it, but I'm curious if you can talk to kind of the the overall implementation and some of the pieces that you decided you wanted to build from scratch because it either was not cost effective to run it as is off the shelf or just because it was completely missing from the ecosystem?
[00:26:49] Unknown:
The core Spark, Apache Spark, and Iceberg, we are also planning to support Apache, OD, Delta. But besides that, yeah, of course that we cannot build platform without using leveraging this open source platforms because, you know, that's a huge part of it. But there are other parts like data catalog. We evaluated many platforms like, Amundsen from Lyft. Yeah. There are many products that I don't remember right now. I've used also data catalog at Uber. We have that experience. The other engineer that we started working together from Google, he brought his own experience from different companies, of course, from the Google too. Like, we put all this, you know, like, features and how we want to see the platform.
Many of those things are actually implemented at Amundsen. But, you know, since open source product, many of those are built more generic way. I'm trying to support many platforms, not only Spark. Yeah. And try to support other platforms too. That's why their code base, it's grown huge. And also, like, from our perspective, it's, you know, like, not easy maintainable. Like in that type of situations, we decided to build our own. You know, like we built, data catalog, data governance, tagging all those stuff in house. Like SQL editor, you know, like we also have related some open source 1, like the famous 1 is Hu, Apache Hu, which is, you know, like, has a really nice UI, but it's extremely heavy, and code base is, like, you know, like, it's total mess.
It doesn't verse, like and it doesn't have multitenancy. We have to, you know, like, start a new separate code or node for different customers. That's also another area that we have to build our own authorization layer. There's no good authorization for Spark. We have to build everything in house. And we had a very, you know, like, bold ideas. Like, we saw in previous companies how important having the fine grain authorization service. Usually, the other vendors, they all they don't also provide the you have to get another vendor. We didn't want to involve another vendor just for authorization use case. That's why we build, like, table, column level, even, like, data masking. Everything is managed from a single place using policy rules, which is very, you know, like we we put a lot of effort on that because as a users, we have seen many times how important that are.
That's something we also, like, built in house, like, in Spark. There is serverless Spark that we use Apache EMR before Databricks. EMR was, like, the lowest cost 1, but the UI is not great. Somehow, archaic. Databricks, it's also, like, not talking. It's talking to, like, deep technical people, not this user interface was not that user friendly. That's why we took a little bit different approach, which is very minimalistic UI because it comes all the sensible default values. But, like, for the experienced users, you can, you know, like, go to the details and then, like, tweak all the different attributes. But usually, for, like, even non Spark users can easily just deploy their Spark jobs and schedule how frequently they wanna run it.
All these parts, like, are being developed from the scratch, but the main engine, Spark and Apache Iceberg, which could take more than 10 years, you don't have a choice. You have to took the ready open source solution.
[00:30:41] Unknown:
In terms of the selection of Spark as the core, I'm curious what you see as the relative strengths of that from a lake house perspective as compared to things like Presto and Trino or the work that the Dremio folks are doing?
[00:30:56] Unknown:
We chose Spark actually 2 years ago. Like, at the time, like, many people were surprised when we say there's a SQL functionality, like, the in the Spark. Or, like, really? Like, maybe you should use Trino or Presto. But the problem with the Presto and Trino, they're more memory sensitive products. You know, like, we have seen this. Like, even Athena, which is AWS service backed by, Presto. We have seen many times, like, when there is not enough memory, it just crash. And reliability wise, like, we had always lots of problem with the Presto. And also like at the large scale, Presto performance like decreased compared to Spark.
And the second reason is when when you see actually, at the time, like, 2 years ago, like, there's some signs. But, like, today it's very obvious. The support of this, you know, like, this asset platforms, like Lakehouse platform, like Apache Iceberg, Hudi, Delta Lake, They have their support first Apache Spark, then, you know, like Presto, Trino, and others. And the support that they provide to Spark, and to the other platforms are not at the same level with the Spark. You can get better, you know, integration in the ecosystem, purchase Spark.
And also like, it's been always like very there's huge development on the Spark, but lately in the last 2, 3 years, there's huge increase on, you know, like improving the Spark SQL functionality, especially the optimization rules. They're like huge changes there. Now I think we did some benchmarks. We see like huge improvements, even though, like, we have, benchmark that runs 6 months before, there is 2 times improvements at least, between, Spark version 3.2 and Spark version 3.3. That's the reason we chose Spark. Yeah. I think that's still the right choice. But, yeah, in general, we call IAM it as a data platform. Maybe in the future, if it's necessary for specific use cases, we could integrate Trina also.
But Spark, I right now covering that too. Trina works good. They have more data integrations, and Sparkles catching up on that in that space.
[00:33:27] Unknown:
And you mentioned a little bit of the investments that you're making into make it easier to be able to adopt Iommi. So you mentioned things like the dbt adapter, the work that you're doing with the Airbyte team to create an integration there. I'm wondering if you can talk to some of the ways that you think about the prioritization of which elements of the ecosystem to invest in to manage that kind of adoption curve and some of the signals that you're looking to to be able to understand what are the biggest pain points or most important pieces to solve for early on to help your customers kind of manage that transition or manage the initial adoption?
[00:34:08] Unknown:
We have got many customers through DBT. It goes some days. So it's on DBT page that they saw the there's integration, and there is, like, you provide good performance for the same value. For now, like, our main focus for, like, mostly on the data ingestion side, Once we release, Airbyte and CINGER, we wanna really focus on Fivetran because there are also many customers are coming with the already have they're already using Fivetran. You try to make the migration as smooth as possible. That's why as a vendor, I want to say that, hey, you can keep, you know, using those vendors and just use us to replace, you know, Snowflake or something else. That's why having more integrations with the well known vendors is important.
Yeah. The Firetran and Segment dot io are the scheduled ones for us. That's very important. And, yeah, I think for the BI side, we are good there because we have almost all the integrations, mostly ingestion side, I can say.
[00:35:17] Unknown:
Data engineers don't enjoy writing, maintaining and modifying E. T. L. Pipelines all day every day, especially once they realize that 90% of all major data sources like Google Analytics, Salesforce, AdWords, Facebook, and spreadsheets are already available as plug and play connectors with reliable intuitive SaaS solutions. HEVO Data is a highly reliable and intuitive data pipeline platform used by data engineers from over 40 countries to set up and run low latency ELT pipelines with 0 maintenance. Boasting more than a 150 out of the box connectors that can be set up in minutes, Hivo also allows you to monitor and control your pipelines.
You get real time data flow visibility with fail safe mechanisms and alerts if anything breaks. Preloads transformations and auto schema mapping precisely control how data lands in your destination, models and workflows to transform data for analytics, and reverse ETL capability to move the transformed data back to your business software to On review platforms like g 2. Go to data engineering podcast.com on review platforms like g 2. Go to data engineering podcast.com/hevodata today and sign up for a free 14 day trial that also comes with 247 support. As far as building a platform that is targeting this ecosystem of the kind of core storage and querying and processing layer that is the most important or most impactful choice that can be made for a data platform.
Wondering what have been some of the most interesting or challenging aspects of building an entrant into that space, figuring out what are the kind of competitive advantages that you're looking to offer, how to talk to customers to help them understand what that competitive landscape looks like, and also to the point of what you were saying earlier of building on these open source systems so that you don't have vendor lock in, just how to communicate with customers about that fact as well.
[00:37:13] Unknown:
Interestingly, like, many times when we get this, you know, like, first customer interview, they do better pitch than us. Like, because, you know, like, they saw all this and, like, many times we we heard this saying, like, this all sounds too good to be true. Do you really, like, provide all this? We are getting a lot of Snowflake customers, by the way. Like, not Snowflake customers, like, who are already tried Snowflake or, like, they did due diligence or, like, they were considering Snowflake. But, you know, like, I can give 1 example. I'd 1 of the customers, a big payment customer. They had a big, like, 100 k credits for Snowflake, but once after moving 5% of their data, they just realized it's going to be very expensive.
And they started looking for alternative, like, you know, like not to replace this, maybe use some other technology next to Snowflake because they cannot put all their data to Snowflake. And they're looking for Databricks or, like, a similar solution like I am it. That's why I think people really appreciate the cost aspect, first. The second is, like, they're happy with the, like, storing the data on their side because, you know, like, we didn't build this just, you know, like, we just assume that people need it. We saw as a user, we saw that that our managers were asking that, and we saw in many different companies that's the need. Like, people want their data accessible. It's it shouldn't be somewhere else.
And we hear the same, you know, pitch from the customers. They're really happy, like, to own their data. And like, we also give this message with that. Hey, you own data, you can go anytime. The only way we look into our platform is providing better service. There is not a way. Like, you can just stop the the account the cross account access, and you you can go, you know, like, to a different vendor with their data. Yeah. I think first is the cost part. The second, having their data controlled by them or accessible by them is very appreciated by the customers.
[00:39:37] Unknown:
Another interesting aspect of the lakehouse paradigm is that, you know, it's intended to draw on the benefits of both the data lake and the data warehouse. The data lake is often seen as a place to be able to experiment with data, work with it, figure out what are the pieces that I actually want to use so that I can clean it up and model it into a more kind of data warehouse style or load that into maybe an OLAP store so that I can get faster interactivity with it. And with the lakehouse architecture, I'm wondering what you see as the realities of people actually going that next step of saying, okay, I've got it in the data lake. I'm able to use the warehouse for the modeling, but I actually need to also use this data in another avenue where may maybe I need faster query times or maybe I need to load this into another operational system and just some of the realities about how people are actually using the data that originates and is owned by the lakehouse and feeding that into some of these other types of systems.
[00:40:39] Unknown:
That's the usual pattern that people use. Like, they move the data to the lakehouse, and they do the, you know, aggregation, clean up transformation, and then write back to some other platforms like HubSpot, or they can write back to my circle, or like other databases for fast access. But I wanna just give some clarification, like, what's the, like, difference between data lake and lakehouse? Because these 2 terms are being used interchangeably, but there's a difference there. Data lake got popular with the storage and computes decoupling, because that allows people to scale those resources separately.
And you can even shut down your compute cluster, scale out, scale down. You can even create multiple cluster to isolate different teams or different use cases. But, data lake has problems that you have to deal with the files. There's like, with the metadata, you have this table abstraction, but, like, if you wanna, you know, provide column level authorization, you cannot give that authorization. You have to give, not even like table level authorization. You have to give file level authorization. This person has access to file. Yes. Then they can get the access to the table. You cannot do transactional inserts. You cannot do update, delete, merge stuff.
You have to do all this stuff with some additional scripting tools, etcetera. You know, with the all good benefits, people started missing out the, you know, good parts of data, all traditional data warehouses, like table level abstraction, inserts, update, deletes, transactional changes. So the lakehouse is basically the same data lake. It's what's running behind the scene, but there's additional layer that goes on top of the data lake that makes it, you know, like, present this as a data warehouse to the outside world. Now it becomes you don't need to deal with the files. This lakehouse layer, handle that.
There's acidity comes into the picture, you know, you have this insert, update, delete, merge all this functionality even in a transactional way. You have your version history. You can go, you know, do time travel in your data versions. All these things brought by the lakehouse layer, this additional layer. So that's the difference between data lake and lake house. And with that change, I think the need that you have to bring your data to more, you know, like, you have more flexibility here. The need of, like, using 2 different systems fading away.
They're mostly staying in the lake house. They don't need any other OLAP database. They can do all this in the lakehouse. But for specific use cases, if you needed sub second latency, you can write back the aggregated data to MySQL or Postgres data. And then, you know, your ML systems or AI systems can utilize that data, which is a kind of aggregated form of your data from the lakehouse.
[00:43:59] Unknown:
In your experience of building Iomit and working with your customers and integrating with this ecosystem, what are some of the most interesting or innovative or unexpected ways that you've seen your platform use?
[00:44:12] Unknown:
Yeah. Usually, we didn't have such cases. Like, people are, you know, like, mostly state engineers. They are really know what they're doing. The interesting case we got once, like, we got bug report that are the result grid from SQL editor, it it just got freeze. Once we checked the situation, it was like 1, 700 columns that table has. We're completely missed that kind of situations. But, you know, in the analytics world, having, like, you know, like, a large number of columns are usual, but even I wasn't expecting that number of columns that we can get from any table. Yeah. Sometimes we got some customers, they were asking, you know, like, do you have a fling?
Can we and basically when they explain their situation, it's basically more operational use case than analytics. In that case, I recommend them, like, going at different vendor or different pass because the system is designed for analytics use cases, not operational. But that's something for the future that we should think, like, that area is also being consolidated. Like, they wanna use 1 platform for analytics and even operational, but, like, has some analytics aspects in that use cases.
[00:45:33] Unknown:
In your work of building the platform and building the business, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:45:41] Unknown:
I'm not sure, like, if I can be a second time founder because, you know, like, if I knew, like, it's going to be this hard, probably I wouldn't start. You know, like, I was told that building a business is very hard, but, you know, like, I didn't know that I need to multiply know, like, the parts that I didn't knew that there's huge, you know, like, the parts that I didn't knew that there's huge intellectual satisfaction, you know, like, as you build stuff that helps other people. When you see that results, you know, like, you forget about all the hard moments that you have. But, yeah, I think it's it's kind of fun. It depends from which perspective you are looking at. But I understood it's not for everyone.
I lost few of my friends along the journey because they couldn't handle that, and I totally understand. Before I was thinking, you know, like, everyone can, you know, like, start a business. But now I understand, based on for everyone, it requires really tough mental energy. Yeah. That's something I learned, but I think I enjoy the process. I I like it.
[00:47:00] Unknown:
And so for people who are interested in building their data platform, they think that a data lakehouse is the right approach, what are the cases where Iommate is the wrong choice?
[00:47:12] Unknown:
I think, like, sometimes, if it's not analytics use case, like, more operational use cases, we are not good fit. Also, like, in some cases, they're looking for sub second or, you know, like, really low latency use cases. IAM is not the right choice. Not even IAM is like the similar technologies, like Presto, Trino, Databricks, Snowflake, is not the right choice in that kind of situations.
[00:47:39] Unknown:
And as you continue to build and iterate on the IoT product, what are some of the things you have planned for the near to medium term or any particular problem areas or projects that you're excited to dig into?
[00:47:51] Unknown:
For the coming few years, first, to kind of, like, complete our vision to provide all in 1 platform, like, without going to 3 to 5 different vendors, like, spending time with integration. They can get everything in 1 platform at affordable cost. That's our dream, like and, like, kind of go to platform for anyone who cares the cost. And then another milestone is becoming not only the platform, also the smart platform. I can give an example on the SQL editor, not just like, you know, like, only out, Intellisense. We're gonna also provide smart suggestions. Hey.
When you write query, you can get suggested by the editor that, hey. This table is joined by this tables. You can use these columns. They're like this kind of stuff. Like, we have data governance. We kinda collect this metadata. We also wanna use this, not just sit down there, we wanna use, like, this as a actively, like, not just metadata shouldn't be something that you only see when you go and search it, but it should help in different places. As I said, like, in the queries, it can suggest, this is wrong column. It might be wrong column that you are joining using in the join, or, like, it can use the glossary to suggest some additional stuff. In the data governance side, like, you know, it's maintaining of data assets is hard thing. Like, it's requires a lot of user input. But you can imagine you have a, you know, like, intelligent assistance next to you who knows the all ins and outs of the data and suggest you some stuff. Hey. What about this data? It seems it's not being refreshed last few hours. There might be a problem.
Or, like, hey, I suggest this text for this columns. We did this by the way. Like, we started like, we built machine learning slash regex type of engine, which kinda scans data and then put outs auto suggested text, but we wanna bring it to the next level. Anyway, like, the goal is not to just provide a platform, also, like, kind of smart platform. Yeah. Whoever use the platform, they should feel there is also another 1 that could constantly assisting them, the process.
[00:50:16] Unknown:
Are there any other aspects of the work that you're doing at IOMit or the overall ecosystem around lake houses and that architecture paradigm that we didn't discuss yet that you'd like to cover before we close out the show?
[00:50:27] Unknown:
In my opinion, data sharing is going to be, you know, like, a thing for the future, especially all these platforms, like, Snowflake has something, data lake, Databricks built platform data lake. There is a data mesh is coming in the future. We don't announce ourselves as, like, data mesh because, you know, like, it's not a buzzword. But we do, like, care the architecture when we build that's compatible with the data mesh approach, which is not new. Like, when I work at Uber, like, everything built, like, exactly data mesh architecture explained, but we didn't just call it data mesh. That's the only, difference.
So, like, I use a lot of my experience from Uber when we build this platform. That means we're going to be also compatible with the data mesh architecture. Many companies still, like, they're missing the centralized data platform. They're still to catch up that phase. But the next phase, most likely, is the data mesh, which is, you know, solving not infrastructural problem, it's more organizational problems. It's the same way that microservices solve, like, which brought some problems while solving some other problems. For the people who want to consume a data mesh, I would suggest don't consider until you have date organizational problem because it's costly.
It solves the organizational problems, but it's costly. That's why, like, if you can live with the centralized data storage approach, I think that's the way to go. But once you hear, like, this organizational problems, you wanna, like, separate different organizations and then, like, treat them as a different domains, then, yeah, that's where you have to think about data mesh. Yeah. We provide all the functionality that data mesh architecture requires. It's just, you know, like, we haven't started announcing ourselves. Like, Trino is advertising themselves as a data mesh solution.
I think it's a small part of it, not just like I don't understand why they're advertising them, like, as a kind of whole solution for data mesh. And, also, back to data sharing part, and that's also going to be very important as a part of this data mesh. Because data mesh is making the boundaries and then, like, creating a protocol for data sharing. And so far, there is no such, you know, like, easy protocol for data sharing. Like, data, like, bring their own protocol. Snowflake has its own protocol, but probably we need to have something like HTTP for data.
That's something, like, we do, like, our resource, you know, like, probably soon, we can also bring something on the on that space too.
[00:53:26] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:53:41] Unknown:
Yeah. The biggest gap is, I think, like, having multiple vendors for the, you know, like, standard data platform. I think all this should be in 1 place. Shouldn't be 3 to 5 different vendors. But separately, there are good tools out there. I'm not talking about the cost aspect, but, yeah, there are good tools, and it's, you know, like, specific area. But the missing part is having everything in 1 place, you know, finding the experience as a unified experience.
[00:54:12] Unknown:
Absolutely. Well, thank you very much for taking the time today to join me and share the work that you're doing at Iomet. It's definitely a very interesting platform. Definitely excited to see that out there as an option for people who want to build their own lake houses. So appreciate all the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Thanks a lot. Again, thanks for having me here. I enjoyed the process. I enjoyed the conversation.
[00:54:43] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Guest Introduction
The Iomit Project: Vision and Motivation
Challenges and Solutions in Data Integration
Building a Fully Integrated Platform
Lakehouse Architecture: Strengths and Shortcomings
Technical Choices and Implementation
Adoption and Ecosystem Integration
Customer Experiences and Feedback
Lakehouse Architecture in Practice
Innovative Uses and Lessons Learned
When Iomit is Not the Right Choice
Future Plans and Smart Platform Vision
Data Sharing and Data Mesh
Closing Remarks