Making The Open Data Lakehouse Affordable Without The Overhead At Iomete

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Atlan is the metadata hub for your data ecosystem.

Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities.

Push information about data freshness and quality to your business intelligence,

automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans could focus on delivering real value.

Go to data engineering podcast.com/atlin

today, that's a t lan, to learn more about how Atlan's active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork, and Unilever achieve extraordinary things with metadata.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.

With their new managed database service, you can launch a production ready MySQL,

Postgres or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs.

Go to data engineering podcast.com/linode

today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.

Your host is Tobias Massey, and today I'm interviewing Busal Dadalov about Iomit, an open and affordable lakehouse platform. So, Busal, can you start by introducing yourself?

Yes. Sure.

Thanks, Toby, for having me here. I'm Vusal,

cofounder of Imit. Imit is

a lakehouse platform.

My background, I'm just a system engineer.

I've worked in companies like Uber, Careem,

Olex Group, and some telecom company in the past.

Yeah. That's my background.

And do you remember how you first got started working in data?

It was, like, probably 10 years ago, like, while I was working at telecom company.

Since my background is distributed system engineering,

that's 1 of the area data,

infrastructure

where the distributed systems is being heavily used.

I started this, like, consulting

data

engineers, like, how to build data platforms and infrastructure.

That's how, like, I got started.

And then get heavily involved over time, and then found myself building

data infrastructure

many times over over and over again.

As far as the Iommate project, I'm wondering if you could describe a bit about what it is that you're building and some of the story behind how you decided you wanted to create it, why you decided that this was a problem that you wanted to spend your time and focus on?

Iammit is a

most affordable

and open source lakehouse platform.

The reason we started this, actually,

that's my previous experience.

You know, like, as I said, like, I've been involved many times building data infrastructure

in the past,

and I saw, like, some,

you know, repeating problems as a customer,

as a, you know, like, user that we were building that infrastructure.

The first thing was, like, the big data platforms is expensive.

Yeah. In the past, it wasn't problem because, you know, like, companies, they're like, only big companies.

They were using big data platforms, like the telecom company I work, and they were ready, like, pay $1, 000, 000.

But, like, as, you know, time passed, like, it's now more common. Data is not a, you know, like, expensive thing. It's a commodity now. And, like, some platforms, their improvements on the, you know, cost aspect, but still,

it stay expensive.

And, yeah, we were fighting as engineer, like, as builder of this infrastructure, like, we we were always

got pushed, like, how we can optimize, how can we make, you know, more cost efficient.

Yeah. That was 1 reason. Okay. Like, you know, that world is going to, you know, date intensive.

Like, all the companies,

in order to stay

competitive

in the market,

they they should use the data, and the data is becoming, you know, a usual thing. And that's why

we should provide a platform that's, you know, inexpensive

and reachable,

accessible by everyone.

That's 1 problem.

The another 1 is transparency.

Some vendors, you know, like we were using, I guess I can name few, you know, well known names that people can easily resonate,

like Snowflake.

There's an transparency

in that, like, in the storage side because you just put your data

to somewhere else in their own format,

plus the compute you use. Like, you don't know, like, what actual compute engine they use. Right? They could optimize their platform, but they can, you know, like, still use a cheaper device, and they have higher margins.

So, like, there are lots of in transparency there. Like, I was wishing that, hey, I should own my own data. Like, it shouldn't go somewhere else. And there is also really great open

data formats, like, for an article world, like parquet or see this type of formats, and it should stay on my side. Then, like, I feel comfortable that I own my data.

And whenever I don't like the no vendors

service,

I can switch to use different

platform that will also, you know, like, push the vendor, provide the best service because

they're not locking the customers.

Yeah. Behind this transparency, there's also this lock in stuff. Like, with all this, they're kinda locking their customers.

Another thing is the integrations. Like I remember, like every time when we build when we say today, modern data stack, their components are, you know, very well known. Like, you need the ingestion,

you need data analytics or processing layer, you know, data governance,

BI, etcetera. These companies are well known, and this modern data stack has got matured a lot, that you can identify,

like, any company, regardless of their, you know, like the business type, they're more or less building the same. Like, pro most likely, if we

give a name, like, from the well known vendors,

5 stranded ingestion or air bite or similar technology,

data processing side is like Databricks, Snowflake, BigQuery, this type of product.

Data governance, you need to go another vendor, like you need to bring

Calibra,

I don't know, talent, like or similar platforms.

And usually, all these vendors,

they have really great,

features, but, you know, like, it costs a lot of

engineering efforts to integrate all this stuff. Every different company in every different, like, building infrastructure building,

you have the same components with different vendor names

and doing the same integrations.

And I remember at that time I was like thinking,

like, if you look to the world as a, like, our home,

inside this home, we are repeatingly doing the same thing, like, integrating all these different pieces, like but we can also centralize this effort to some other company

and this other like, we shouldn't do this repeatedly.

And then we can, you know, save those resources to do something that is innovative, not just, like, very repetitive.

So that was the idea, like, hey. That's another thing. Like, I wish to have that, hey.

Like, all these companies are well known. Like, why wouldn't we have a platform that involves all this together? Because this is not rocket science. We need ingestion. We need analytics there. We need data governance.

And that's all. Like, I saw, like, from this telecom company, like, Olex Group was a big ecommerce company.

Careem was like a smaller version of Careem than, like, Uber, which was bought by Uber.

And at Uber, so, like, the same,

you know, set of technologies, like, the names are changing, but the concept is same.

So all these things kind of, like, triggered me.

Well, let let's start with, you know,

efficient

cost aspect.

I can give an analogy.

Actually, that's some kind of coincidence before this podcast.

I saw the interview of Figma co founders,

and their pitch was so boring. Like, it was like, they were asked, like, what what's the difference than, you know, Photoshop, Adobe XD, this kind of stuff. And it's like, they were like, you wanna provide easy to use and inexpensive.

At that time, it wasn't sound like inexpensive. It's like just,

you know, trying

to sell inexpensive software wouldn't keep you in the market, you know, like, you will get crushed. But, you know, like, 10, 12 years later,

they were sold for 20, 000, 000, 000, but it's not a good news for customers because that's the reason customers love that. It was a time this design things got, you know, like, it becomes,

you know, like, a usual thing. That's why they were searching an expensive solution.

I think we can have the same analogy. Now, like, we wanna stay on that location in the map, whatever that map is, that we wanna provide

accessible, easy to use. And another reason we use the open source 1, of course, like building this type of platform from scratch is very hard. It's not possible. But another reason we use the open source still, like, continue with the open source 1 is

is the transparency aspect. Like, we wanna stay, hey, this is open source. We store data on customer side. We are transparent on that. We show which node you use. We are transparent on that.

We don't add any markup

because since we manage whole infrastructure,

our business model is to get this reserve instances from AWS

and get our earnings from there, not from the customer. Blah blah. Like, we are trying to solve all this with this model.

And integration wise, we built the most of it, but our vision is to provide all in 1 platform.

We have data governance in place, authorization,

analytics layer,

at some extent, ingestion part.

We're working towards to have, like, all in 1

affordable,

easy to use platform for customers. Sorry if this became very long introduction, but, yeah, that's how I can summarize what we do. No. Absolutely.

There are a lot of different pieces that go into building a system like this, and a lot of people will say, oh, I just need a lake house, so I'll just use Spark or Trino. And then you start digging into it realizing, oh, I actually need these other half dozen capabilities as you as you mentioned. And, I mean, the same thing if you say, oh, I just need a data warehouse, so, you know, I'll choose Snowflake. And, oh, now I actually need governance and a data catalog and all these other pieces. And, oh, it wasn't as simple or as inexpensive as I thought it was because I forgot about everything else that goes into it.

And to the point of the kind of

selection criteria,

the core of IOMeat is this principle of the lakehouse architecture, lakehouse platform.

You brought up the topic of the modern data stack, which has been oriented around kind of the

disaggregation of these discrete capabilities.

And a lot of the focus and center of gravity for the modern data stack has been these cloud data warehouses, so Snowflake, BigQuery, Redshift, etcetera.

And the lakehouse architecture is a bit of a response

to kind of that,

the popularity of those systems

because of the fact that

lakehouse or the data warehouse is where all of your data is stored, it becomes, you know, that center of gravity, and it becomes 1 of the most impactful

choices that you can make as you're designing your data platform because that impacts all of the other integrations that you're able to select from where if you're using Snowflake, there's a fairly large ecosystem of tools and plugins to be able to work with that because they've invested in building that ecosystem around themselves.

Whereas if you're building around the

open data lake slash lakehouse,

again, depending on whether you're using Spark or Treto or Presto or what have you, you're going to have

a highly variable experience as far as the level of integration and the level of off the shelf pieces you can use. And I'm wondering

what you see as the

largest motivating factors that push people into

that lake house architecture,

even knowing the fact that they're not going to have as easy of a time as if they were to just say buy Snowflake or buy BigQuery?

In general, I think the cost aspect,

like, Databricks

are also pushing the lakehouse in our architecture.

The integration wise, they're also behind Snowflake, what Snowflake provides.

In our case,

yeah, we don't have, you know, like,

our integration in the ecosystem is not at the Snowflake,

level, which is, you know, also our customers don't expect that we are at that level.

But what we give them instead of that,

yeah, eventually, we'll

provide richer ecosystem.

But we provide

they can save 5 to 10 times of their cost, like, going with us,

Plus, where the data is stored on their account,

it's stored in a parquet format that gives

additional confidence that, hey. Like, this company,

you know, like, they can only serve us better because otherwise we would go. That's also, like, good for us that keep us healthier. You know? Like, we know that

we close the deal, they move 10 terabytes or, like, 100 terabytes of data, and then it's not done. We have to

keep serving well.

And we have data governance

and other, you know, like Spark jobs, which is not they have snow parks in the Snowflake case, but, you know, I doubt that it's gonna hit

the same

maturity that's Spark has.

But, you know, like, take this additional stuff that's not in the Snowflake, or

there are, like, some features like Snowpipe, which is not actually not a feature. That's just their shortcoming

of that they don't have Spark streaming kind of platform. That's why they build this feature in order to, you know, like, provide some way to move the files,

the data that's in the files.

Anyway,

the integration wise,

we are little behind,

and open source 1 like Spark,

you can build your own date lake house. Like, back to your question, like, in 15 minutes, maybe in a day.

The problem, that's where the open source platforms are good. Like, you can use Spark or Trino,

and then you put Apache Iceberg on top of that to get this asset transactions,

etcetera.

But that's the main problem that you explained, the ecosystem

integration, that's not there. And plus the open source products,

they're good, but they're not enterprise ready. They don't have authentication

layer, authorization layer. They don't have, like, good UI that you manage all these things, manage your users, blah blah, all these things.

But, yeah, we are building that. We released 2 months ago DBT adapter, which is, like, huge stuff, like,

because, you know, like, all engineers are loving dbt.

They do all this SQL based transformation, and dbt now support data frame based transformations too. Like, we are working

to integrate that too. On the ingestion side, like, we open a pull request to airbyte

for airbyte integration.

There is a singer also, like, which is backed by Meltano.

We also finished the that integration.

That's kind of like vendors that have open source versions.

Now we are starting the closed source vendors, which is like Fivetran.

The data ingestion is important because if you don't have easy data ingestion, then you won't have data in inside your platform, and then their customers won't use you.

And on the user side, we support all the, you know,

mainstream

BI tools.

And

I can say, like, integration wise,

the customers won't miss a lot, like, compared to Snowflake, that we cover all the mainstream

integrations.

Yeah. But probably there'll be some, you know, nuances

they could miss. But the open source ones, they will definitely miss all those stuff. They have to build what we are building for last 2 and a half years. They wanna build that kind of platforms in house.

To your point of the work that you're doing to build it, and if somebody says, I'm just going to pull all of the off the shelf open source projects and integrate it on my own,

What are some of the hidden difficulties and incompatibilities

that they're likely to run into and some of the things that you've come up against in the process of building out this fully integrated platform in the shape of IoB?

Few stuff I mentioned, like, in authentication,

authorization

layer, like, building some interface around this.

And I think most difficult part is the infrastructure

part. You know, like, the spar, it's good, like, but it has 1, 000 different variables that you have to, you know, like, get the right values for for you. And also infrastructure

wise, where you're going to run it. Like, you're gonna run on the just plain

virtual messions or, like, on in the Kubernetes area. How you gonna handle the scale out, scale down?

Spark has its own scheduler. How it's going to work with the Kubernetes scheduler? Do you gonna use the standard scheduler, or are you going to use a custom scheduler?

There are, like, lots of nitty gritty details. Like, I can give a few examples. Like, just 1

is, like, once you run some time, like, you realize, oh, you have to run all your

cluster nodes in the same AZ because cross AZ charges a lot. We are killing this because we are trying to optimize all these cost aspects for multiple customers. It's not only, you know, for us. That's why we care this kind of nuances a lot. But, like, if you're building this type of platform just for your own business, which is not your core of your business, then you wouldn't care all these details, but it costs a lot, you know, like cross AZ charts. And another aspect, like, normally when the install I've seen this many times because I did that too. And I realized later, you know, like, we didn't do too much cost optimization,

but we do now because, you know, we care about, like,

reducing that cost as much as possible.

But 1 thing, like,

when you just install run, like, using s 3,

it goes through public Internet. And, like, since you're gonna put all your nodes in the private subnet, it should go through NAT gateway. Right? And NAT also charges a lot.

But there is, you know, specific service in s 3 in AWS that is called s 3 interface gateway. You configure that, and then it goes in the backbone network of AWS s 3,

which, like, removes the charge, like, it touches since it doesn't go through the net. You don't have that charge.

And plus, you stay compatible with the compliances because you don't cross through the public Internet.

And that service has doesn't have any

additional charges.

You know, like all these small things,

if you're a new company building this,

you only notice after you get like huge build. Like, this is, you know, like learning from the mistakes we did before. But there are a lot of this kind of problems, especially on the, like, handling of the scaling. It's not a huge problem. Like, we

still fighting with that.

Prefect is the data flow automation platform for the modern data stack, empowering data practitioners to build, run, and monitor robust pipelines at scale.

Guided by the principle that the orchestrator shouldn't get in your way, PreFECT is the only tool of its kind to offer the flexibility to write workflows as code.

Prefect specializes in gluing together the disparate pieces of a pipeline and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20, 000 community members, Prefect powers over 100, 000, 000 business critical tasks a

month. For more information on Prefect, go to dataengineeringpodcast.com/prefect

today.

That's

prefect.

As far as the kind of lakehouse architecture paradigm,

We've discussed a bit about some of the challenges in the ecosystem and the integration capabilities, but from the kind of

processing and storage and optimization and just kind of basic functionality layer,

what are some of the shortcomings that you see in

how lakehouses are able to operate as compared with a fully integrated

data warehouse architecture?

The shortcomings

probably,

I can say

right now,

sub second

latency queries or point

lookup

queries.

And, also, like, there's problems, like, if you wanna do real time ingestion or real time analytics because,

you know, like, there's small files issues

That's famous, I think, in every data lake lakehouse platforms.

We do have a way to optimize that, but we comment

not to ingest data, like, lower than 15 minutes.

I think in

many many cases,

that's acceptable.

But world is moving.

You know, like, there is this analytics

needs, you know, for reporting, you like, 15 minutes is more than enough.

But there is also real time needs. That should be a different technology.

But, like, people are

like,

hey, if you have this platform,

can't we have, like, both real time and analytics needs run on the same platform?

That's why, like, it's converging,

but right now, the lake house has this shortcomings, this small files, etcetera,

that kind of, like, preventing that happen soon, but it's going to happen.

In the traditional data warehouses, that's not a problem because, you know, the file

structured in different way. That's

why they don't have these problems, but they have other problems.

Digging more into

the

kind of lake house functionality and capabilities, you mentioned things like governance,

the transformation layer.

I'm wondering if you can talk to some of the decision process that you went through of which pieces

were,

I guess, mature enough to want to incorporate into the IOMe platform and which pieces you decided were worth the effort of building from the ground up and some of the kind of build versus buy

decision that goes into building a platform like Iomit versus if you were to do it in house for your own company?

I think first,

when you choose vendor, like build versus buy,

most of the time,

we do this for our own company too. Like, we should go with the buy because I mean, like, with the, you know, like, sensible buy option, not, you know, any buy. Because

if that's not the core of your business, like, building a platform would cost you more than, you know, like, just buying and paying someone else, because

that's already centralized

that building cost in that company.

But if that's your core of business,

let's say, if you're a data company, like for us, like,

imagine like we are buying a service from Snowflake and then, you know, selling that service, that wouldn't make sense for us. Right?

Or if we are providing

ELT, TL type of service,

it's again, like, we should build it because that's core of our business.

Also, like, for companies that

I think

even, like, if they assume that they're gonna, like, pick, like,

Facebook, Google, at Uber level,

at the beginning, it would make sense to buy because, you know, like, to save

the initial building cost,

saving the

hiring

or maintaining this data, engineers resources.

The beginning that would make sense. Then like once they get bigger,

they can evaluate the situation. Like having internal team and building everything in house, if that make them like more, you know, cost efficient,

that's the time I think to decide that. But I think most of companies won't need that. Even like there's a bigger companies like banks, insurance companies.

Since they don't have enough know how, or like enough

team, or since that's not their core of business, they're not very, you know, interested about like having that type of team, even though that's going to

be cost efficient, having Tim internally.

They're just, you know, buying the platform and paying someone as outsourcing basically outsourcing

the job.

As an engineer, like, I'm also,

if I speak from the engineering perspective,

most of the time, like, that happened to me a few times. You start with the open source 1,

But at some point, you got frustrated because

many open source platforms

are designed in a way that it should frustrate you. And then, like, you should go after the vendor, like, can you give me, you know, ready ready to use, you know, platform?

Also, like, sometimes you have problems,

but you don't have enough, you know, resources to fix the those bugs.

And that's why, like, after some time later, you realize, oh, it would be nice to, you know, realize some vendor to, you know, like, ask your problems,

to fix some, you know, urgent bugs, etcetera. Yeah. I think most of the time, in today's world, it makes sense,

mostly buy versus build your own.

So digging more into the Iomit platform itself, I'm wondering if you can talk to the architecture that you have built around. You already mentioned that you're using Spark as sort of the core of it, but I'm curious if you can talk to kind of the the overall implementation

and some of the pieces that you decided you wanted to build from scratch because it either was

not cost effective to run it as is off the shelf or just because it was completely missing from the ecosystem?

The core Spark, Apache Spark, and Iceberg, we are also planning to support Apache, OD, Delta. But besides that, yeah, of course that we cannot build platform without using leveraging this open source platforms because, you know,

that's a huge part of it. But there are other parts like data catalog.

We evaluated

many platforms like,

Amundsen

from Lyft.

Yeah. There are many products that I don't remember right

now. I've used also

data catalog at Uber. We have that experience. The other engineer that we started working together from Google,

he brought his own experience from different companies, of course, from the Google too. Like, we put all this, you know, like, features and how we want to see the platform.

Many of those things are actually implemented at Amundsen. But, you know,

since open source product,

many of those are

built

more generic way.

I'm trying to support many platforms, not only Spark. Yeah. And try to support other platforms too. That's why their code base, it's grown huge. And also, like, from our perspective, it's, you know, like, not easy

maintainable.

Like in that type of situations, we decided to build our own. You know, like we built, data catalog, data governance,

tagging all those stuff in house. Like SQL editor,

you know, like we also have related some open source 1, like the famous 1 is Hu, Apache Hu, which is, you know, like, has a really nice UI,

but it's extremely heavy, and code base is, like, you know, like, it's total mess.

It doesn't verse, like and it doesn't have multitenancy.

We have to, you know, like, start a new

separate

code or node for different customers.

That's also another area that we have to build our own

authorization layer. There's no good authorization for Spark. We have to build everything in house. And we had a very, you know, like, bold ideas. Like, we saw in previous companies how important having the fine grain authorization

service.

Usually,

the other vendors, they all they don't also provide the you have to get another vendor. We didn't want to involve another vendor just for authorization use case. That's why we build, like, table, column level, even, like, data masking. Everything is

managed from a single place using policy rules, which is

very, you know, like we we put a lot of effort on that because as a users, we have seen many times how important that are.

That's something we also, like, built in house, like, in Spark. There is serverless Spark that we use Apache EMR before Databricks.

EMR was, like, the lowest cost 1, but the UI is not great.

Somehow, archaic.

Databricks,

it's also, like, not talking.

It's talking to, like, deep technical people, not this

user interface was not that user friendly. That's why we took a little bit different approach,

which is very minimalistic

UI

because it comes all the sensible default values. But, like, for the experienced users, you can, you know, like, go to the details and then, like, tweak all the different attributes.

But usually,

for, like, even non Spark users can easily just deploy their Spark jobs and schedule

how frequently they wanna run it.

All these parts, like, are being developed from the scratch,

but the main engine, Spark and Apache Iceberg, which could take

more than 10 years,

you don't have a choice.

You have to took the

ready open source solution.

In terms of the selection of Spark as the core, I'm curious what you see as the relative strengths of that from a

lake house perspective

as compared to things like Presto and Trino or the work that the Dremio folks are doing?

We chose Spark actually 2 years ago. Like, at the time, like, many people were surprised when we say there's a SQL functionality, like, the in the Spark.

Or, like, really? Like, maybe you should use Trino or Presto.

But

the problem with the Presto and Trino,

they're more memory sensitive

products.

You know, like, we have seen this. Like, even Athena, which is AWS service backed by,

Presto.

We have seen many times, like, when there is not enough memory, it just crash.

And reliability

wise, like, we had always lots of problem with the Presto.

And also like at the large scale,

Presto

performance like decreased compared to Spark.

And the second reason is when when you see actually, at the time, like, 2 years ago, like, there's some signs.

But, like, today it's very obvious.

The support of this, you know, like, this asset platforms, like Lakehouse platform, like Apache Iceberg, Hudi, Delta Lake,

They have their support first Apache Spark,

then,

you know, like Presto, Trino, and others.

And the support that they provide to

Spark, and to the other platforms are not at the same level with the Spark. You can get better, you know, integration

in the ecosystem,

purchase Spark.

And also like, it's been always like very

there's huge development on the Spark, but lately in the last 2, 3 years, there's huge

increase on, you know, like improving the

Spark SQL

functionality, especially the optimization rules. They're like

huge changes there. Now I think we did some benchmarks. We see like huge improvements, even though, like,

we have,

benchmark

that runs 6 months before,

there is 2 times improvements

at least, between,

Spark version

3.2

and Spark version 3.3.

That's the reason we chose Spark. Yeah. I think that's still the right choice. But, yeah,

in general, we

call IAM it as a data platform. Maybe in the future, if it's necessary for specific use cases, we could integrate Trina also.

But Spark, I right now covering that too. Trina works

good. They have more data integrations,

and Sparkles catching up on that in that space.

And you mentioned a little bit of the

investments that you're making into make it easier to be able to adopt Iommi. So you mentioned things like the dbt adapter,

the work that you're doing with the Airbyte team to create an integration there. I'm wondering if you can talk to some of the ways that you think about the

prioritization

of

which elements of the ecosystem

to invest in to manage that kind of adoption curve and

some of the signals that you're looking to to be able to understand

what are

the biggest pain points or most important pieces

to solve for early on to help your customers

kind of manage that transition or manage the initial adoption?

We have got many customers through DBT.

It goes some days. So it's on DBT page that they saw the there's integration, and there is, like,

you provide good performance for the same value.

For now, like, our main focus for, like, mostly on the data ingestion side,

Once we release,

Airbyte

and CINGER, we wanna really focus on Fivetran because there are also many customers are coming with the already have

they're already using Fivetran.

You try to make the migration as smooth as possible. That's why

as a vendor,

I want to say that, hey, you can keep, you know, using those vendors and just use us to replace, you know, Snowflake or something else. That's why having more integrations with the well known vendors is important.

Yeah. The Firetran and Segment dot io are the scheduled ones for us. That's very important.

And,

yeah, I think for the BI side, we are good there because we have almost all the integrations,

mostly ingestion side, I can say.

Data engineers don't enjoy writing, maintaining and modifying E. T. L. Pipelines all day every day, especially once they realize that 90% of all major data sources like Google Analytics, Salesforce, AdWords, Facebook, and spreadsheets are already available as plug and play connectors with reliable intuitive SaaS solutions.

HEVO Data is a highly reliable and intuitive data pipeline platform used by data engineers from over 40 countries to set up and run low latency ELT pipelines with 0 maintenance.

Boasting more than a 150 out of the box connectors that can be set up in minutes, Hivo also allows you to monitor and control your pipelines.

You get real time data flow visibility with fail safe mechanisms and alerts if anything breaks. Preloads transformations and auto schema mapping precisely control how data lands in your destination,

models and workflows to transform data for analytics, and reverse ETL capability to move the transformed data back to your business software to On

review platforms like g 2. Go to data engineering podcast.com on review platforms like g 2. Go to data engineering podcast.com/hevodata

today and sign up for a free 14 day trial that also comes with 247 support.

As far as

building a platform that is targeting this ecosystem of the kind of core storage and querying and processing layer that is the

most important or most impactful choice that can be made for a data platform.

Wondering what have been some of the most interesting or challenging aspects of building an entrant into that space, figuring out what are the kind of competitive advantages that you're looking to offer, how to talk to customers to help them understand what that competitive landscape looks like, and also to the point of what you were saying earlier of building on these open source systems so that you don't have

vendor lock in, just how to communicate with customers about that fact as well.

Interestingly, like, many times when we get this, you know, like, first customer

interview,

they do better pitch than us. Like,

because, you know, like, they saw all this and, like, many times we we heard this saying, like, this

all sounds too good to be true.

Do you really, like, provide all this? We are getting a lot of Snowflake customers, by the way. Like, not Snowflake customers, like,

who are

already tried Snowflake or, like, they did due diligence or, like, they were considering

Snowflake.

But,

you know, like, I can give 1 example. I'd 1 of the customers, a big payment customer.

They had a big, like, 100 k credits for Snowflake, but once after moving 5% of their data, they just realized

it's going to be very expensive.

And they started looking for alternative, like, you know, like not to replace this,

maybe use some other technology

next to Snowflake because they cannot put all their data to Snowflake.

And they're looking for Databricks or, like, a similar solution like I am it.

That's why I think people really appreciate

the cost aspect, first.

The second is, like, they're happy with the, like, storing the data on their side because, you know, like,

we didn't build this just, you know, like, we

just assume that people need it. We saw as a user, we saw that that our

managers were asking that, and we saw in many different companies

that's the need. Like, people want their data accessible. It's it shouldn't be

somewhere else.

And we hear the same, you know, pitch from the customers.

They're really happy, like, to own their data. And like, we also give this message with that. Hey, you own data,

you can go anytime.

The only way we look into our platform

is providing better service.

There is not a way. Like, you can just

stop the the

account the cross account access,

and you you can go, you know, like, to a different vendor with their data.

Yeah. I think first is the cost part. The second, having their data

controlled by them or

accessible by them is very appreciated by the customers.

Another interesting aspect of the lakehouse paradigm

is that, you know, it's intended to draw on the benefits of both the data lake and the data warehouse.

The data lake is often seen as a place to be able to

experiment with data, work with it, figure out what are the pieces that I actually want to use so that I can clean it up and

model it into a more kind of data warehouse style or load that into maybe an OLAP store so that I can get faster interactivity

with it. And

with the lakehouse architecture, I'm wondering what you see as the

realities of people actually going that next step of saying, okay, I've got it in the data lake. I'm able to use the warehouse for the modeling,

but I actually need to

also use this data in another avenue where may maybe I need faster query times or maybe I need to load this into another operational system and just some of the realities about how people are actually using the data

that originates and is owned by the lakehouse

and feeding that into some of these other types of systems.

That's the usual pattern that people use. Like, they move the data to the lakehouse, and they do the, you know, aggregation, clean up transformation,

and then write back to some other platforms like HubSpot, or they can write back

to my circle, or like other databases for fast access.

But I wanna just give some clarification,

like, what's the, like, difference between data lake and lakehouse? Because

these 2 terms are being used interchangeably,

but there's a difference there.

Data lake

got popular with the storage

and computes decoupling, because that allows people

to scale those resources separately.

And you can even shut down your compute cluster, scale out, scale down.

You can even create multiple cluster to isolate different teams or different use cases.

But, data lake has problems that you have to deal with the files.

There's like, with the metadata, you have this table abstraction,

but, like, if you wanna, you know, provide

column level authorization,

you cannot give that authorization. You have to give, not even like table level authorization. You have to give

file level authorization. This person has access to file. Yes. Then they can get the access to the table.

You cannot do transactional

inserts. You cannot do update, delete, merge stuff.

You have to do all this stuff with some additional scripting tools, etcetera.

You know, with the all good benefits,

people started missing out the, you know, good parts of data, all traditional data warehouses, like table level abstraction,

inserts, update, deletes,

transactional

changes.

So the lakehouse is basically the same data lake.

It's what's running behind the scene,

but there's additional layer that goes on top of the

data lake

that makes it, you know, like, present this as a data warehouse to the outside world. Now it becomes

you don't need to deal with the files. This lakehouse

layer,

handle that.

There's acidity comes into the picture, you know, you have this insert, update, delete, merge all this functionality even in a transactional way. You have your version history. You can go, you know, do time travel in your data versions.

All these things brought by the lakehouse layer,

this additional layer.

So that's the difference between data lake and lake house. And with that change, I think the need

that you have to bring your data to more, you know, like, you have more flexibility here.

The need of, like, using

2 different systems

fading away.

They're mostly staying in the lake house.

They don't need any other OLAP database.

They can do all this in the lakehouse. But for specific use cases,

if you needed sub second latency,

you can write back the aggregated data to MySQL

or Postgres data.

And then,

you know,

your ML systems or AI systems can utilize that data, which is a kind of aggregated form of your data from the lakehouse.

In your experience

of building Iomit and working with your customers

and integrating with this ecosystem, what are some of the most interesting or innovative or unexpected ways that you've seen your platform use?

Yeah. Usually,

we didn't have such cases. Like, people are, you know, like, mostly state engineers.

They are really know what they're doing.

The interesting case we got once, like, we got bug report that are the result grid from SQL editor, it it just got freeze.

Once we checked the situation, it was like 1, 700

columns that table has.

We're completely missed that kind of situations. But, you know, in the analytics world, having, like, you know, like, a large number of columns are usual,

but even I wasn't expecting that number of columns that we can get from any table.

Yeah. Sometimes we got some customers, they were asking, you know, like, do you have a fling?

Can we and basically when they explain their situation, it's basically more operational

use case

than analytics.

In that case, I recommend them, like, going at different vendor or different pass because the system is designed for analytics use cases, not operational.

But that's something for the future that we should think, like, that area is also being consolidated.

Like, they wanna

use 1 platform for analytics

and even operational, but, like, has some analytics aspects in that use cases.

In your work of building the platform and building the business, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

I'm not sure, like, if I can be a second time founder because, you know, like, if I knew, like, it's going to be this hard,

probably I wouldn't start.

You know, like, I was told that building a business is very hard, but, you know, like, I didn't know that I need to multiply know,

like, the parts that I didn't knew that there's huge,

you know, like, the parts that I didn't knew that there's huge intellectual satisfaction,

you know, like, as you build stuff that helps other people.

When you see that results, you know, like, you forget about all the hard moments that you have.

But, yeah, I think it's

it's kind of fun. It depends from which perspective you are looking at. But I understood

it's not for everyone.

I lost few of my friends along the journey

because

they couldn't handle that,

and I totally understand.

Before I was thinking, you know, like, everyone can, you know, like, start a business.

But now I understand, based on for everyone, it requires

really tough mental energy.

Yeah. That's something I learned, but I think

I enjoy the process. I I like it.

And so for people who are interested in

building their data platform, they think that a data lakehouse is the right approach, what are the cases where Iommate is the wrong choice?

I think, like, sometimes,

if it's not analytics use case, like, more operational

use cases, we are not good fit.

Also, like, in some cases,

they're looking for sub second or,

you know, like,

really low latency use cases.

IAM is not the right choice. Not even IAM is like the similar technologies, like Presto, Trino,

Databricks,

Snowflake, is not the right choice in that kind of situations.

And as you continue to build and iterate on the IoT product, what are some of the things you have planned for the near to medium term or any particular problem areas or projects that you're excited to dig into?

For the coming few years, first, to kind of, like,

complete our vision to provide all in 1 platform,

like, without

going to 3 to 5 different vendors, like, spending time with integration. They can get everything in 1 platform

at affordable cost. That's our dream, like and, like, kind of go to platform for anyone

who cares the cost.

And

then another milestone is becoming not only the platform,

also the smart platform.

I can give an example on the SQL editor,

not just like, you know, like, only

out,

Intellisense. We're gonna also provide

smart suggestions. Hey.

When you write query, you can get suggested by the editor that,

hey. This table is joined by this tables.

You can use these columns. They're like this kind of stuff. Like, we have data governance.

We kinda collect this metadata. We also wanna use this, not just sit down there, we wanna use, like, this as a actively, like, not just

metadata shouldn't be something

that you only see when you go and search it, but it should help in different places.

As I said, like, in the queries, it can suggest,

this is wrong column. It might be wrong column that you are joining using in the join,

or, like, it can use the glossary to suggest some additional stuff. In the data governance side, like, you know, it's maintaining of data assets is hard thing. Like, it's requires a lot of user input. But you can imagine you have a, you know, like, intelligent

assistance next to you who knows the

all ins and outs of the data and suggest you some stuff. Hey. What about this data? It seems it's not being refreshed last few hours. There might be a problem.

Or, like, hey, I suggest this text for this columns.

We did this by the way. Like, we started like, we built machine learning slash regex type of engine, which kinda scans data and then put outs auto suggested text, but we wanna bring it to the next level. Anyway, like, the goal is not to just provide a platform, also, like, kind of smart platform.

Yeah. Whoever use the platform,

they should feel there is also another 1 that could constantly assisting them, the process.

Are there any other aspects of the work that you're doing at IOMit or the overall ecosystem around lake houses and that architecture paradigm that we didn't discuss yet that you'd like to cover before we close out the show?

In my opinion, data sharing is going to be, you know, like, a thing for the future,

especially

all these platforms, like, Snowflake has something, data lake,

Databricks built

platform

data lake. There is a data mesh is coming in the future.

We don't announce ourselves as, like, data mesh because, you know, like, it's not a buzzword.

But we do, like, care

the architecture when we build that's compatible with the data mesh approach, which is not new. Like, when I work at Uber, like, everything built,

like, exactly data mesh architecture explained,

but we didn't just call it data mesh. That's the only,

difference.

So, like, I use a lot of my experience from Uber when we build this platform. That means we're going to be also compatible with the data mesh

architecture.

Many companies still, like,

they're missing the centralized data platform. They're still to catch up that phase.

But the next phase, most likely, is the data mesh, which is, you

know, solving

not infrastructural

problem, it's more organizational

problems. It's the same way that microservices

solve, like, which brought some problems

while solving some other problems.

For the people who want to consume a data mesh, I would suggest don't consider

until you have date organizational

problem because

it's costly.

It solves the organizational problems,

but it's costly.

That's why, like, if you can live with the centralized data storage approach,

I think that's the way to go.

But once you hear, like, this organizational

problems, you wanna, like, separate different organizations

and then, like,

treat them as a different domains, then, yeah,

that's where you have to think about data mesh. Yeah. We provide all the functionality that data mesh architecture requires. It's just, you know, like, we haven't started announcing ourselves.

Like, Trino is advertising themselves as a data mesh

solution.

I think it's a small part of it, not just like I don't understand why they're advertising them, like, as a kind of

whole solution for data mesh.

And, also, back to data sharing part, and that's also going to be very important as a part of this data mesh. Because data mesh is

making the boundaries

and then, like, creating a protocol for data sharing.

And so far, there is no such, you know, like, easy protocol for data sharing. Like, data, like, bring

their own protocol.

Snowflake has its own protocol, but probably we need to have something like HTTP

for data.

That's something, like, we

do, like, our resource,

you know, like, probably soon,

we can also

bring something on the on that space too.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Yeah. The biggest gap is, I think, like, having multiple vendors for the,

you know, like, standard data platform.

I think all this should be in 1 place. Shouldn't be

3 to 5 different vendors.

But separately, there are good tools out there. I'm not talking about

the cost aspect,

but, yeah, there are good tools, and it's, you know, like, specific area. But the missing part is

having everything in 1 place, you know, finding the experience as a unified experience.

Absolutely.

Well, thank you very much for taking the time today to join me and share the work that you're doing at Iomet. It's definitely a very interesting platform. Definitely excited to see that out there as an option for people who want to build their own lake houses. So appreciate all the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Thanks a lot. Again, thanks for having me here. I enjoyed the process. I enjoyed the conversation.

Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links