Bringing The Modern Data Stack To Everyone With Y42

This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer friendly data catalog for the modern data stack. Open source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga, and others.

Acryl Data provides DataHub as an easy to consume SaaS product, which has been adopted by several companies.

Sign up for the SaaS today at dataengineeringpodcast.com/acryl.

That's a

cryl.

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline line or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL,

Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs.

Go to data engineering podcast.com/linode

today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services.

And don't forget to thank them for their continued support of this show.

Your host is Tobias Macy. And today, I'm interviewing Hung Dang about y 40 2, the full stack data platform that anyone can run. So, Hung, can you start by introducing yourself?

So, yeah. My name is Hung. And I'm the founder and also the product slash technical lead here at y42.

I've, yeah, been in the data industry for a long time by now. Actually, I started my PhD

in statistics consumer behavior. So it was, yeah, quite mathematical and statistical

in

2013.

And, yeah, my mom is a PhD in statistics, my dad is a professor in sociology and statistics. So, yeah, it's in the family, and I've worked with data. I love it, and that's where. And you mentioned that you first started working with data in your PhD studies. I'm wondering if you can speak to what it is about the overall space and the industry that has kept you interested and how you sort of got started in industry after your studies working in the data ecosystem.

It's very important to highlight that I come from an entrepreneurial

background in the sense that I study an entrepreneurial

study program

in parallel to the normal studies. So, it's called the CDTM.

So it's an extra curriculum

program that's almost like a full time study. So it took more time to study at the CDTM than actually do my regular studies at the time in Germany.

So we are just 25 people, in 1 class.

And out of my 25 people in in my class,

70%

roughly started a company

and we have 5 unicorns.

$5, 000, 000, 000 companies just within a class of 25 people. And hopefully, if we don't fuck it up with y 42, we'll probably be 1 also soon.

So that being said, I

come from an, yeah, entrepreneurial background, but I studied in my main studies, yeah, statistics.

That's why I actually started a company end of 2012,

which was an event company.

And yet, it has nothing to do with data, but I was running

full scale events in 20 different countries. So that was my first commercial success in my early twenties. And we ran event, you know, from LA to Singapore to Dubai to everywhere in Europe. And we made 1, 000, 000 in revenue in the 1st year of operation,

but the margin was, like, extremely thin.

So naturally, you know, as somebody who studied data,

that was my task to just collect all the data,

to analyze that data and trying to basically improve our margin.

And at the time, there was no airflow, there was no Kubernetes.

In that

time. It was yeah. Facebook didn't really have an API to download the CSV files, for our marketing spend to collect data from our ticketing platform by scraping the website

and I had to, you know, do a lot of stuff that was very unconventional in 2 1, 012, 013.

But somehow I glued together

a data infrastructure

that really helped us to improve our margin from 8% to 12%. And that literally

increases the profit by 50%.

And so I really fell in love with the power of data.

Like, I fell in love with understanding how data can help the user understand the world and also, you know, improve efficiency. And this is why I decided to start

a software company

in 2016,

really focusing on the live event industry. Because if we manage to do that with our own events that are only 6 figures to set up, why don't we try to do it with events, you know, that are 8, 9, 10 figures to set up? And this is something, yeah, I've gotten into and we built a very

specialized software tool to collect data, to transform data, and to build visualizations and predictions, like, very tailored to the live event industry in 2016.

And that's also the first time when I really learned how programming works because before, I was just the guy that would script, you know, with r, with Python, or with Stata even to run some, you know, we called it regression model and nowadays we call it machine learning. That was, you know, a very hard journey. I mean, I started out not knowing what a constructor is and, you know, object oriented programming

and so on. And I had to learn that along the way just because

I feel very inclined that I need to understand things on a fundamental level before I can make good decisions. And there's no way that we will build a technical company without me understanding

that the whole process behind it, why we have to do certain things.

And I wasn't able to find a CTO back in the days who would, you know, join in that kind of niche product or niche industry.

But yeah. So we build it out. It was very hard years, but at some point, we were analyzing,

you know, 1 third of all event live events in Europe. And so every major

Ed Sheeran, Justin Bieber concerts, major festivals, we were the other tool that would analyze the data for these events. And so I was forced to serve a very non technical audience who has a lot of data because, you know, imagine the 100, 000 people coming to a Net Cheering concert

producing millions of data points. And so you can have this oxymoron.

It needs to be very easy, but it needs to be very scalable at the same time.

And, yeah, I did that for

until end of 2019.

And I built up a lot of technical depth during that time. We weren't able

to cross the chest and then really introduce that product to other industries other than the live event industry.

So I was, you know, decently

successful. Right? But then again,

I had already a version in my head that we could build, that they could serve basically any industry. But for that, really, we have to build the product again from scratch fundamentally a different architecture.

And this is why I decided

to sell that company, in 2019

to a, you know, German Fortune 100 company.

I took the money that I made. I started a new company, which is early 2020,

called Data Intelligence back in the days, but we rebranded to y42 along the way.

And, yeah, that's our story.

So can you describe a bit more now about what it is that you're building at y 42

and some of the impetus behind that particular area of

focus? We are trying to serve

the audience that have been pretty much underserved by now

more nontechnical users. But that's just, you know, a very first step of what we're trying to do because coming from the live event industry, like I said, it needs to be a very easy and scalable solution.

So what we do is we operate on top of a data warehouse, such as Snowflake or BigQuery,

and we help our users to integrate

with over a 100 different, data sources. So we integrate with

other SaaS tools and pull the data into their data warehouse. We help them transform the data having a UI model. We help them all use SQL.

We help them visualize that data.

We help the user orchestrate the whole flow,

and we help the user also send the data back to the tools where they need it. So basically, yeah, an end to end data pipeline that was very optimized for the non technical folks to use in our very first version. But

that was never our

end game. So the end game

is you have to remember, I come from an enterprise

background. Right? Like, I sold my last company to a major German Fortune 100 company, and I built the data pipeline there or at least I had a big influence on it. And during that time, I've seen that

the data industry,

you have a lot of different personas working together. I think you've heard it, you know, a lot of times by now. You have the the business users and the name already implies

business intelligence. Right? It's for business to make better decisions at the end of the day. And then you have the wonderful data warehouse

coming up. You know, it was 1 of the very first users of Google BigQuery, as a matter of fact. And I fell in love with the power of it. And that's when I knew, hey, the y 42 version slowly comes up in my head. So with that being said, we have to serve, you know, all these different

personas

with the data warehouse, you know, the data engineers come into

play.

The now there's this analytical engineers at DPT coin. There's the data analyst. There's the business user.

For us, it's all about really collaboration.

We want people to work together seamlessly to talk the same language. We want people to understand each other. We we wanted to be efficient. We We wanna create value. I think that's very important. We want to create value.

And for value to be created,

we are betting on the horse that we want to align different entities.

We want

processes

that are superior. And I must say, there are software development best practices that business users haven't adapted

that are just played superior. And we want to bring that to the business user. So we wanted the business user and force them to processes that really scale

that engineers have, but also at the same time, we want business user and less technical folks to express their ideas, their wishes in a self serve manner that the engineers can really understand. Okay. That's what

the end user wants, and I can build it now in a more scalable way in the infrastructure under the hood. And so this is what y42

v 2, which we're about to release end of June, would be all about. It's about collaborating,

bringing the data engineers together, bringing the business user and the data analysts in 1 frictionless platform.

And you mentioned that the

core focus of y 42 is to be able to use these scalable and well practiced paradigms from engineering and make them available and applicable to nontechnical end users. And

there have definitely been a number of attempts to be able to serve that market of nontechnical users in largely in the form of low code or no code, you know, visual drag and drop workflow builders.

Wondering what you see as some of the limitations of that approach

and some of the ways that you're aiming to

improve upon that paradigm and make it

easier for these nontechnical end users to be able to take advantage of these powerful data systems without having to educate them on all of the, you know, conditional and Boolean logic that tends to bleed into some of these

visual sort of drag and drop workflow builders.

So, finally, I learned to understand data structure

very well by using Tableau

for the first time 8, 9 years ago. It really taught me more about data structure.

And

I feel that y42

is doing the same thing to the users but not just in terms of visualization

but truly building out data pipelines.

And along the way, we are very opinionated

how to do things. So for example, in v 2, and I think that's 1 of the major innovation that we are introducing

to the market

is

that

we build our own git engine

within the web browser. So we're using web assembly

and

we built the first extremely scalable git engine.

We abandoned

all our MySQL databases that we have in the v 1. So basically, we have all of these tools. We have integrations. We have visualizations, UI modeling, SQL modeling, etcetera. And they were all separate microservices,

and they would save the settings

inside the

MySQL database, and they would save the jobs every time a job runs for, let's say, integration, we would save the job status. Hey, is it pending? How long did this job take, etcetera? We would all save that inside of my SQL database.

And

that worked well for the v 1, but we ran into a lot of limitations such as, hey, templating becomes, you know, very hard. A rollback becomes hard with a my MySQL database. We would have to build some version control system on top of MySQL that's, you know, very, very hard.

And

also environment like setting up a dev and master branch and merging them and so on. And that was a major drawback of the of the 1 that we have. And so

we

know that now we have to build a system and

everybody would, you know, like listening to this problem that I'm describing, everybody would say, well, then just use Git for it. Right? Like, that's such an obvious answer.

But

it's not because we want to make it truly seamless, the experience.

Most people would just think, well, then I'm just gonna save settings and get, like, settings such as which columns am I gonna import from, you know, which source or what's my SQL model. It's just like settings.

But we would abandon all these MySQL databases

and force the user to use 1

to provide us a git repo and we would save everything inside git. So we would save all the chop statuses in git. And that means there's an interplay between the user creates all of the settings and committing them to the repo.

And then there is our y 42 bot or our back end that would also commit of these jobs

inside the repo. So let's say I'm on the dev branch or I'm on a feature branch. Right? Like, I set up a new integration, I run this integration, it might take literally 3 hours because it's a huge Shopify or MySQL table that this job runs for 3 hours. I have it inside the data warehouse.

The job links to a table inside the data warehouse. If it's successful, the back end commits that job inside the dev of this feature branch. And I merge that feature branch into the main branch, and this job is instantly available

without waiting for 3 hours. And so

by doing that, you have the full power,

yeah, of Git for, you know, version control

branching out. You have the full power of, like, auditing, who did what.

And obviously, you know, there was a lot of engineering.

Yeah. Things that we have to do in order to make it work that you would think of performance problems. Right? Like, this is massively big. We run thousands of jobs a day, and we keep committing these jobs to the repo at the same time. Right? But

then it's an end to end

system that if I screw something up, I can roll back everything like from the visualization back to the integration because everything is as code since we are an all in 1 data platform.

So I can, let's say, create a new feature branch, create a new integration,

then transform that data in that branch with my existing model, and then visualize that data, and then commit that to the main branch. And then I'm on the main branch. But then if I really screw up something, I can just roll back. And so the whole visualization,

like, we have the whole platform rolling back to our state. And so that way, we are also capable

of version controlling all the tables within the data warehouse. Meaning, you can roll back to any table inside any data warehouse instantly

and the whole system would just work. So we took all of this power of data engineering best practices and hide it away extremely well for the business user and there's like a business user mode. They don't even know, like, the engineers would tell them, would set up, oh, all the business user that enable business user mode, they are working on this branch. Like, only on the business user branch. Or they may be committing straight to the main branch with the visualizations,

but probably not the best practice.

And

that way, as you can see, we are enforcing a lot of

engineering best practices

so that the business user

has to adopt them, but we hide it extremely well. And then on top of that, we do have the no code, no code layer to transform data, and it just blends in

perfectly with the pipeline with, you know, the pipeline that's being written inside SQL. We also are in the process of integrating perfectly with DBT. So

you don't write dbt in a separate dbt, you know, folder structure. It's nested within the y 42 file system. When we, let's say, have an integration set up, well, we also automatically write it in the source yml file from bbt.

It's a very, yeah, big interplay between our system and also other system.

And so we allow the user, the business, the nontechnical user to use low code, low code, but then slowly

teaching them the proper way of doing things and letting them graduate

to a more sophisticated system

when they're ready for it, when they want to enable the next step. So we, basically peel layers of layers of layers for these users so they become more sophisticated on the way. But then again, we give all the power

to the engineers

that I would just use y42 in my IDE. I don't need to use the UI for it. I can code everything in y42.

Straight. Yeah. Just clone the repo and code it in there.

Digging further into the actual y42 platform,

can you talk through some of the overall architecture of the system and some of the underlying

sort of services

and functionality that you rely on from other platforms and components because I know that it is,

to some extent, stitching together a number of the different components of the so called modern data stack. So being able to plug in things like a Snowflake warehouse and your DBT models and then wire that into your data visualization business intelligence layer. I'm just wondering if you can just talk through some of the overall architecture

and extension points that are available for being able to integrate across this ecosystem.

So to understand that, I want to highlight

the persona that we are after. And the first persona that we are after

is the 1st data hire in any company

that y 42 is the natural choice for them. So they don't need to stitch together 5, 6 different tools. You know, it comes out pretty much out of the box. It doesn't matter if you're a data engineer. It doesn't matter if you're a data analyst. It basically comes right off the box. And

we

don't want companies to outgrow us. That's basically our strategy.

So with that

in mind,

we do want to build a very frictionless experience.

And there are many tools out there that stitch together the modern data stack that tries to build a service. Maybe like Nutano that is very code driven.

But they're using Airflow, they're using Singer, they're using

TBT, and so on and so forth. We didn't go down that route. We do some of it. For example,

we were always 1 of the

guys that push forward the single standard

for integrations.

So we contributed a lot to the single standard. Milton was in fact using quite a lot of our integrations that, you know, we build, maintain in house.

And then, of course, Airbyte comes along using also the single standard.

So it's just a plug and play for us because their architecture is pretty much the same. We have roughly 50 of our integrations

written on our own, then there are roughly, you know, another 60, 70 integrations written within the single community

that Airbyte also supports and then the other integrations from Airbyte. So, yes, we do support these integrations

from Airbyte. We do support these integrations from Singer and, yeah, have written a lot of our custom integrations.

With orchestration,

we tried to use airflow

in our first version. That didn't work out because it didn't scale to the extent we wanted to, and we just wanted to rely on Kubernetes jobs at the end of the day. So we really

build out our own orchestrator.

Yeah. Because we were forced to airflow didn't really have an API that we were able to use. So we couldn't use it to, you know, a pocket into our system.

Then we were really focused on capturing

the data analysts first, and now we are in the process of serving the data engineers very well within the process so they can collaborate finally together. But our first data transformation

was to support

business users and data analysts to yeah. Little technical background to do basically everything you can do within SQL without no code and no code digital transformation.

Obviously, there are not many players out there, not even any open source software, so we had to build it on on our own. And now in the process of including DBT under the hood,

but it's not a nice user experience.

It's not an for at least for what we're trying to accomplish. It's not a nice user experience

for like for example, tbt,

you would need to create 2 different datasets for 2 different environments. Right? You cannot merge the like tbt doesn't commit the jobs inside

or commits everything else inside Git such as we do. So it makes

an experience quite broken compared to, you know, what we have right now when you wanna merge it from 1 branch to another, you have to rebuild the tables again running the dbt jobs. So we do integrate because dbt is a great tool and we want to support all the engineers that already work with dbt and that can just plug dbt together with the y42 ecosystem.

But it's not as nice of a user experience

as if we, let's say, build our own. Yeah. We our SQL model that we have also supports Jinja.

So

pretty much similar to dbt, but dbt has, you know, on top like snapshots. Dbt has like, for example, testing. We use great expectations, so we don't use dbt test. Like, it's great, but we think great expectations

just has so that was a very natural choice to connect to again. And then on the visualization side, well, we

wanted to have everything in code, so we use Echarts from Baidu, which Superset is also now using Echarts under the hood in the next, duration.

So that's an open source library from Baidu. And, yeah, it's still open source, but 1 level of abstraction lower than using any modern data stack tools such as, you know, like Superset.

And then on reverse ETL, you know, it's an 80 20 kind of game. The most of the users,

like, with 10 reverse integrations, you literally cover 90 percent of all the use cases. That's what we're seeing right now. So now Airbyte also, you know, has invest into the reverse ETL player by just recently buying I forgot the name of the company.

But then we will also integrate with Verified there. So, you know, we choose and pick whatever makes sense and we're really optimizing for frictionless experience.

And sometimes tools make sense, such as Airbyte or Great Expectations.

Other times, it doesn't make sense, and we have to build it on our own.

And in the process of doing that selection of what to take off the shelf, what to build in house, I'm wondering what are some of the most

sort of difficult decisions to make. Are there any particular layers of the stack that are more challenging than others to either select a useful tool or understand

what are the trade offs of actually building this ourselves? And then in that process, particularly the sort of integration layer or the components that you've had to build in house, what are some of the biggest engineering challenges that you've had around that? So we are very product and customer centric always. So we always think backwards. And

tooling is just a means to an end, really. Like, I think it became clear the last 5 minutes.

So I think the hardest decision for us is what to do with dbt.

Because we can integrate with dbt.

Like I said, the experience

is not

on the level that I want it to be. So I would rather build out everything DBT

kinda has with Jinja's already there. We already run all the jobs. Our orchestrator

knows all the dependencies already, so I don't need to trigger a model from DBT down like up stream and knowing that it has to materialize all the tables beforehand. Like and so

it's not as native as I want it to be. But at the same time, DBT is a great tool and it has such a great community behind it that there's no way for us not to do it. And this is why we're going with something

where it's not the standard I want to. And so over time, I'm pretty confident that we will still build 2 versions alongside. 1 is way more native. Not as feature rich as tbt yet, but a much more native experience that our users enjoy more within our platform.

And, yeah, I think that's like the hardest decision that we have to make in terms of picking off the shelf or integrating or building it in house. And then in terms of challenges,

there are many many challenges. First of all, the product surface is huge that we have. So that alone is a huge challenge, but we have the advantage that we were able to raise a lot of money early on with Germany where we don't have to pay $200, 000

for a mid senior engineer. I think the price range is rather around $70, 000.

We have all the east, which we can collaborate with, you know, from East Europe all the way to Asia, where the time zone difference is not 12 hours compared to the US. So there's a huge advantage

being a tech first

company in Europe with a lot of funding to attract great talents. And yeah. Which is, you know, we are way more engineers than we are anything else in the company. So yeah, like a lot of product surface then

our latest challenge is finally not in the back end. Well, I mean, back end is super super hard. Right? Like, you have to, like, move terabytes of data. We run 100, thousands of jobs. So obviously, we have to master infrastructure

and we invested a lot into our infrastructure

as code using, you know,

Terraform,

open telemetry to, you know, observe the stack and, you know, like all the, I would say, engineering best practices. We just have to adopt. Otherwise, we would die within that process. And then,

yeah, auto scaling,

massively hard because

when we run an integration and it just takes like 1 day to import. Right? Like, we have to do resumable imports because we have to stop and then we don't scale up and down. And that's not an easy task. We have to remember the state. So there's many challenges

across different tools that we have to solve.

You know, naturally, you would think, oh, all these challenges are in the infrastructure and in the back end. Yes, they are. But at the same time, we face a lot of challenges

on the front end side, like building our own, you know, git engine on the client side,

building like a UX that people would understand, hiding the complexity of it, dealing with a huge repo that might be like 12 gigabyte that you have to clone within the front end and yeah. So I think there are equal challenges across the board because we are a UX first company. We are a company that enables collaboration and that needs to take into account that

UX, yeah, has to play in the game. So to sum it up, really master our infrastructure,

master

the back end, master the front end side, master UX, master product.

So a lot of things that we have to take care of on a constant base.

Particularly with that focus on nontechnical

users and the fact that these are very engineering

heavy systems, what are some of the

sharp edges and failure modes that you've had to automate around in order to be able to maintain that positive experience for the end user

and some of the cases where you have had to maybe

simplify the capabilities of the system to prevent people from ending up in failure modes

or some of the ways that you're able to

turn some of the failure conditions into a useful message that is understandable by somebody who doesn't have deep knowledge of all these systems?

So let's say you're a business user and

you would use

our no code no code layer.

It's quite type safe. So if you know change the name or the type or whatsoever,

we know that it will have implications

on the downstream model when you change something here. Obviously, we also have to run, you know, tests or so called expectations. Hey, we expect this column to not be null. And so that's like the processes

that we have for the user to

trace back the error. I mean, the advantage that we have is we're truly end to end. So we understand,

like that something might break in integrations

or during the transformation process or maybe during the visualization process

or whatsoever.

And we build out a lot of data lineage, observability,

functionalities

because it's all in 1 data platform where we own all the metadata.

So it's really easy to trace back errors within our platform and find the source, like, the true source of, yeah, failure.

With the fact that you do have that whole end to end experience and you own the metadata across those stages,

for people who might have an existing data platform or maybe they start at y 42 and they start to

of what you offer out of the box, What are some of the capabilities that you have for being able to either

integrate with those external systems and be able to maintain the metadata internally or being able

to integrate or extend from your platform to be able to support some of those external systems without losing that sort of single experience or a single pane of glass to be able to view what the state of the world is across your entire platform?

So definitely more difficult because we are at the mercy of the different tools, what metadata they give us. Right? So let's say there are some integrations that Airbyte doesn't have or we don't have, so we need to use Fivetran for it. And so we would build that inside our orchestrator.

There's like 1 task or 1 note in the orchestrator in your in your deck that you say, hey, I'm triggering the Fivetran API to trigger the table and I need to keep calling because they don't have a webhook. So I need to keep calling, hey, is that done? Is that done? Like, is the integration

finished? And then afterwards, we will do the next step, maybe a, you know, transformation

that runs using the table that, you know, Fivetran has. So we integrate other tools by just, you know, within our orchestrator,

calling an API and poll for the status. We can do the same for, you know, like within our deck, we can trigger

an airflow run and just keep seeing, hey, are we successful with that airflow run or are we not? Or an airflow run could trigger our orchestrator

and we can keep telling or we can even, you know, after it finishes or fails or whatever, that's a webhook, you know, we can do further actions with that as well.

So I think the matching greeting here to integrate well with other systems within our orchestrator

as well that other system can call us. So all our APIs

are, yeah, available. And as a matter of fact, it's funny that 1 of our major focus now

is all of these aggregated use cases. So there are companies

that need to programmatically

set up their like, they need to onboard 50 different clients a week or off board them and there's no way they can do that with, like, stitching together 5 different tools.

So they use our system to replicate that. So we have to provide them all the APIs, you know, to trigger the jobs. We use Git for everything so they can template the entire data pipeline very easily and then programmatically

set it up for every client.

And the only thing the client needs to do is then to authenticate themselves with Facebook

ads, with Google Analytics, whatsoever. So we save the token

and then we encrypt that token and save it inside the git repo so the jobs can run successfully and pull the data and then transform the data and visualize that data for them. So that's like a huge use case for us now that we can replicate the data pipelines

programmatically

on mass. So y42 is basically very extendable and you can program it. But if we leave our ecosystem and let's say, you start to create tables and then you're using Tableau

to connect to these tables that we create inside your data warehouse.

Obviously, it's very hard for us to trace at that time what further actions is being done in Tableau. Maybe that a calculated field got added or, you know, they join different tables together. And that's something

that we haven't been able to manage yet too well.

So yeah.

RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control.

Their SDKs make event streaming from any app or website easy, and their state of the art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up for free or just get the free t shirt for being a listener of the data engineering podcast at dataengineeringpodcast.com/rudder.

For people who are actually

building with y 42, can you just talk through the overall process of maybe getting set up initially and then starting to

onboard onto the platform and build an example workflow? So maybe I want to

start ingesting data from my application databases and my SaaS platforms. I want to build a business dashboard to understand, you know, what is the status of my KPIs for my organization,

or

I want to pull in all of the data from my application databases, and then I want to do a reverse ETL workflow to be able to export my users into Mailchimp so that I can update them about different product updates?

Yeah. I mean, the modern data stack would be stitching together, you know, 5tran,

dbt, Snowflake,

Looker, High Touch, Airflow, whatsoever,

what you just described.

In y42,

the general process would be

you generate

a new company with y42,

and the first thing that you have to do is you need to generate a new space. So the space

there, we ask you for 3 things. First of all, your git repo either hosted on GitHub or GitLab,

and

we need a database so it can be Snowflake BigQuery,

and we need a storage. So it can be s 3, Azure Blob Storage, or Google Cloud Storage. These are the 3 ingredients

that the user needs to provide us in order to work with y42.

So now let's say you set it all up, then you never have to touch your data warehouse again if you don't want to. So you would then go to the tab integrations.

You integrate the data source in a couple yeah. In a few clicks, you would configure

integration settings

such as which column, which table do I want? Do I want it to be an incremental update? Do I want it, you know, like to like just append, observe, whatsoever?

Then next, you would create a model. And you can decide, hey. Do I want to build a UI model with drag and drop? And then it's our JSON format that compiles down to hypers efficiency queries.

Or do you wanna just write SQL for that? And, yeah, join and manipulate all the source tables that you got in. And then you have these tables and you have the lineage. You can trace it back then

you can, like, create a go to the visualization tab. You create a new dashboard, you create a new visualization, and then you select the tables. Could be a model or could be a source table that you select, and then you would just create, you know, like, visualizations

on top of that with a drag and drop UI. And soon, you can write SQL commands also there because it just returns a table, and then you can just plot it on x and y or z axis if you want it three-dimensional.

And then next step is you wanna automate that, so you go to the orchestrator

and you say, hey, I wanna automate

the whole process for this dashboard. So you would just import all the dependencies from the dashboard all the way back to the source. Set up, you know, like, thrown expression,

which obviously is hidden in a very nice UI, which do we do I want to select at what time.

Then after you're done with that, probably you go to alerts, and you set up a couple of alerts if an orchestrator

and integration fails, then, like, status failed, then send me a Slack message, send me an email to this team, or send me a webhook or what's when it succeeds, you know, to whatsoever.

And then on top of that, you also set up tests. Like you say, hey, I expect

this column to be not now. And then

you build all of that within the orchestrator,

and then you can be sure that, you know, the pipeline, you know, is as stable as you want to invest in. You can provide a lot of tests and a lot of alerts, etcetera.

And then at the end, you can say, hey, I want to have a we call it data export. You don't want to sound too fancy as reverse ETL.

So you kinda export that data back to where you need it. And obviously, we don't have as many as Hightouch, right, or sensors. And we don't have to. Like I said, it covers

like 10 exports, covers pretty much 9% of the use cases.

And soon, obviously, we can do the same such as integrating the way we do with Fivetran.

We just call the API from Hightouch and tell them, hey, this table is, like, fully done now, the revenue table or the customer table. Please send it back. We don't have yet, maybe

such as, I don't know, yeah, Mailchimp or whatsoever.

So that would be the process.

As far as

the

collaboration aspect, you mentioned that some of the core

product focus is to be able to support this

interaction of very technical users and engineering code focused people who are working on the platform with these

business users, whether they're, you know, product support,

analysts,

c suite executives,

and being able to have native interfaces for people on both sides of that, you know, technical, nontechnical divide or sort of along that continuum.

I'm wondering as far as the actual collaboration

interfaces and abstractions that you've built in, I'm curious if you can talk through some of the ways that you've thought through

the interactions

among those different

end users and how they might use the platform differently, but being able to say, for instance,

I am a data engineer. I'm going to add in some governance controls to ensure that

this PII data is not accessible to an end user to be able to expose in a dashboard.

And then as an end user, I want to be able to

request additional access for being able to build a private dashboard for people who have the necessary

roles or attributes to be able to view that and just some of that overall collaboration across these different

roles in the organization

and the types of kind of controls and

interaction patterns that they might have as they're building these various workflows.

Unfortunately,

we are not super advanced there yet in terms of optimizing

the user interface or get different success path or different personas.

We do have 2 different modes. 1 is the business user mode where we hide away all the Git, complex Git functionalities

in the engineering mode and use the command line tool within our app. You can use, you know, like all the hardcore git functionalities

or even use our CLI tool to generate new integrations, models, and that's all.

And so we just have these 2 modes for now. But I think what's very interesting is we should look at it from the lens of, hey, what's like the modern

cloud era bringing us in terms of collaboration possibilities?

So in the cloud, I think something like

SAML,

you know, like sim like using Okta, using, you know, single sign on. Like, this helps greatly to predefine your teams, your

users. Like very basic stuff such as, hey, I have a comment and resolve functionality

in the cloud.

I have single sign on. I have access control. Like, this user or this team has access to this table or this visualization or this orchestration.

There is, you know, like, a lot of, let's say, fundamental

things that a commercial product should offer in order for the user to collaborate

well. And that comes, yeah, yeah, with the advancement of being able to work on the cloud. And we have awesome functionalities such as ownership of, you know, asset. Like, hey, I'm the owner of this integration, and I need to recheck it every 3 months or whatever interval. Is that still, you know, up to date, like verifying?

And these are just like small things that makes a huge difference in the way people collaborate. And I think the core collaboration there is really, hey, you can trust your pipeline.

You can always roll back, you know, you have version control.

You can, you know, do the pull request and see what the other person is doing, whether approve that or not. So you have more mission to, like,

hold each other accountable, but also in case of failure, rolling back. So that makes collaboration easier because people are more confident to move ahead. And then, you know, access control is super, super important. Access control on everything.

But we're not access controlling on column level. That's important.

We're only access controlling on table level, and that's like our single source of, you know, truth because,

like, a widget

is accessing this table or even a data

export or a reverse tier is accessing a certain call table. So we just protect the tables like this user or this team has access to like permission

to read or write inside this table. Then, yeah, for collaboration, we're trying also to take the best practices of the mono data stack, making things very modular, like a very modular setup, integration, you know, modeling. And then there can have the staging and the map layers. And then in the staging layer, the engineers should, you know, work with SQL. Know, the map layer. You have the data analyst using either SQL or the UI model. And then for business leaders, they do some visualization

or the data export wherever they need it to be. And then also for the technical users, you can use your IDE to call it or command that, like I've said, use the front end, a graphical user interface, both both contributes to the same git repository. We

try to make, you know, models reusable and do data, like tagging very well, data cataloging.

What does that mean? Does this metric like, this metric belongs to this model,

and you shouldn't use it again, like, make it reusable.

And,

yeah, a lot of, you know, different dimensions where we can really accelerate the collaboration process.

The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality.

Posthog is your all in 1 product analytics suite, including product analysis, user funnels, feature flags, experimentation,

and it's open source so you can host it yourself or let them do it for you.

You have full control over your data, and their plug in system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms.

Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog.

Another interesting area to dig into is because of the fact that you have moved to storing

everything about the platform in source control and this Git repository,

there are a couple

of potential edge cases that come in. 1, with the ability to branch and then other people committing changes to the main line, and then you end up with potential merge conflicts and being able to reconcile those. But also

because of the fact that you're dealing with data systems and you're driving modifications to these systems through

this source controlled

project, and you might create a branch to be able to experiment with different modifications to the table structure, for instance, how you're able to

reconcile those

side affected changes

with this sort of snapshot in time. If, for instance, I

create a branch,

I experiment with a different set of statements to be able to create a new table or modify an existing table, and then I decide that I don't actually want to go that path,

making sure that you actually clean up some of these externalities

to make sure that it matches what the repository says it should say. So, you know, I create a new table in this branch. I say, actually, am I gonna get rid of that? I delete the branch, making sure that I then actually go and clean up the additional tables that are created by that since there is now no longer any reference to those tables in the repository as it exists after the sets of changes? I mean, you already answered the question. That's exactly what we do.

So you can imagine that we run jobs.

Everything on our platform is a job, an integration, a model, an alert, a job, like a test, an export. It's all a job. It takes the settings of the user, and it runs a job. And when it's a job to create tables in the data warehouse or update it, if it's a successful job, we reference it to a table. It's basically a hash of the timestamp. There's a signature

basically of that job on that. And

it doesn't matter, we basically work in the same data set. It doesn't matter in which branch you are. You work in the same data set. So I can run a job on the develop cluster and, you know, we run this job and it has the signature.

And so if I then delete this job out of my git repo,

then we would also delete that table inside your data warehouse. So there's a Chrome shop that basically checks all the different branches and it checks the,

the the repo and sees, hey, which table are still active? Which tables can we basically delete out of the the git repo, and on top of that, delete then these tables inside the data warehouse to keep it basically slim.

And what we also do is we have to modify the git protocol a little bit.

So we want to really clean up the history

for the jobs. Like, we wanna keep the jobs. Let's say, you wanna keep them for 3 months. So we have to go within the git protocol and see all the commits that, you know, the y 42 back end or the bot did. And we could do that because it's just 1 commit, and then we could clean up all the commits

that have been committed for jobs that are longer than 3 months and that are not being used. So it's been cleaned out of the git repo, and it's been cleaned out of the data warehouse as well. That's the 1 part that answers the the cleaning out part question. And then the merge conflicts. Well, let's say I do have,

you know, a merge conflict. Now that's interesting. I absolutely agree with you. We haven't solved it great yet. So the simplest case is you just accept yours or accept theirs if you have a merge conflict within a SQL model. So you are

running basically

yeah, you're merging.

So you ran on 1 branch with 1 SQL statement and you ran that job. You ran that job and it creates a table inside the data warehouse. Now you ran another job which

was running on the other branch, It created another table inside the data warehouse. So you have 2 tables now inside the data warehouse, and then you're merging the 2 jobs together.

And we're basically also hashing the settings. So the settings in this case is the SQL command or the head settings would be the columns that have been selected and so on. And if we see that this hash doesn't match the signature

of the table that we saved, then you need to rerun the job again, and we would know that this job is not a valid job for the current settings that you have. I hope that answers your question.

Yeah. It definitely does. It just points to the complexity that's inherent in all of these systems.

Yeah. And so as you have been building and growing the y 42

product and working with some of your early design partners and customers, what are some of the most interesting or innovative or unexpected ways that you've seen it applied?

We have, like, a lot of different types of clients from a car wash in Brazil

all the way to the biggest, you know, tax agency

in New Zealand. So really, really

different

types of companies

that weren't able to access,

you know, like the modern data stack before using our tool up until now.

And, yeah, they build, like, you know, really crazy use cases with their data, but it's still reporting use cases mostly. Right? Like,

we we they're not sophisticated enough to run

machine learning models on top of some tables that they create. But it basically

allows them

to understand the business in ways they haven't been able before when we get access to the systems because they ask us to, you know, debug stuff for

them, then we also understand their business in ways that we never thought would be possible. So it's more on a content layer rather than, wow, look at this crazy

and actually, let me take it back. There are some funny stuff that happens also. It's not a best practice. Right? But in our UI model, you can have, like, drag and drop nodes in there, and it's a big chase and that we compile out the SQL commands.

Clearly, if you build, you know, like, 20 join node and 10 union nodes within that UI model, well, this is not gonna work in BigQuery if you were to run that big SQL commands like BigQuery exceeding limits whatsoever.

So then we would still make it work. I'm not saying that it's best practice. Right? Like, you should split it out into smaller models and abstract that well. But we see some users that are not experienced doing that, which is, you know, not necessarily a good thing. But we offer

the support because we build intermediate tables in between when we see, okay, at this point, it would fail. Then we materialize a table for 10 minutes, and then we would materialize the next table. So then you run the SQL command on the already materialized table, so it wouldn't fail. And so we would still be able to get them the flat table that they want towards the end. And so that are, like, enormous, you know, like, literally, 1, 000 node building out the 1 flat table.

That's the 1 set table they run the entire company. In your own experience of building the y 42 product and platform and growing the business around it, what are some of the most interesting or unexpected or or unexpected or challenging lessons that you've learned personally?

I always underestimate

that enormous amount of work to, you know, cover such a big product surface. It's a lot for us, yeah, to cover,

but at the same time, it's

very rewarding

to try to solve problems on a

fundamental

level by naturally thinking, hey, how would I solve this problem with a clean slate? And

that's very challenging,

trying to always

be first principle driven, how to design that system that makes sense and that ties together,

making decisions of, you know, like integrating with a tool, like DBT or not, and really think through all the implications.

Interesting, certainly, I think there's 1 core innovation that we do. We don't innovate on being the best orchestrator or being the best visualization tool. We can't,

And that's not what we want, but we

innovate on how do we collaborate

fundamentally

well together.

And, yeah, there's a lot of interesting

things happening. I'm just super excited about

WebAssembly

lately. Like, you can hear from my voice. I think that

the way we're doing it using Git as a NoSQL database, I think it could be applied to so many different industries, like, for example,

CMS, content management system. You could like, Contentful. I don't know if you know that tool. Like, tools that we use on database that I think they would benefit so much from using that approach. So that's, like, super interesting for the team.

So for people who are looking for a way to be able to manage their data flows and maybe they need to support nontechnical users? What are the cases where y 42 is the wrong choice and they're better suited either building something in house, integrating the various systems that are out there, or using a different vendor.

So I think

in big enterprises

that have a lot of different use cases,

you should go with very specialized tools such as, you know, Airflow, Dexter, whatsoever.

But there is also on a departmental level, at these big enterprises, it makes a lot of sense to use something like y 42 that can play along with a data stack that already exists within this enterprise. And we really enable

these underserved, you know, users to be very productive and to collaborate

with the engineers that are using hardcore specialized tooling. Then on the other end, there

are companies that are just way too small. You know, they don't have a fully dedicated data person. They just rather have the founders.

And in these early stages

where the product direction is not clear and their requirements are changing oftentimes, I think these companies are even better off not using a modern data stack, but just stick around with the Google Sheets or whatsoever for as long as they can and understand

what's happening.

I think also for companies

that

are running on premise, we operate on the cloud. So if you don't want to be on Snowflake or BigQuery, we cannot yeah, serve these clients as well.

So

to sum it up, very small clients,

not a good choice

that don't have a data owner or at least 1 data higher

than companies that are on premise and

extremely complex use cases

where we should be a part of it, but

not like running the core

infrastructure

and orchestrate, like, millions of jobs same time.

As you continue to

build and iterate on the platform and expand the capabilities, what are some of the things you have planned for the near to medium term or any areas that you're particularly excited to dig into?

We are excited to release our b 2 that is all about the idea of collaboration

of serving the data and GDS also

much better because they can collaborate with the business data, business users, and the data analysts.

So that's something we're very, very excited about for the b 2. We keep moving in a direction

where we want to go upmarket.

Meaning, like I said, it all started with me from trying to make sense in serving this enterprise and, you know, building out a scalable architecture

to serve everyone

in that big company.

And now we're slowly moving back there and saying, hey, we are the 1 tool that, you know, helps you define the metrics

that connects with all the other tools that enables everyone to really work with their data effectively.

And so we wanna embark on that journey, yeah, to build a very collaborative tool.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as as a final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I think

the advancement

in the data industry,

it's it kinda reminds me also a bit of

the front end space

6, 7 years ago when, you know, Angular JS came out. Now it's Angular and React.

And

the space is moving towards a direction

where

it becomes more sophisticated,

but people are not in the level yet to fully

capture or understand, though, that sophistication.

And so

I truly believe and this is why we're building y42.

Serve

the early adopters or, like, all the laggards that

are slowly

coming into the game of data

and that still needs the sophistication

sooner or later,

but don't have the power to harness it yet with the modern data stack.

This is the 1 gap that, you know, I keep seeing, and this is the only gap that

we're trying to double down on it. But on a more engineering best practice, I think, like, that is

the way of software development is very superior in many ways, and that needs to be adapted to the data engineering space as well as we need probably more, you know, observability.

CICD

should be, you know, a big game in data.

Again, if it's not us, you know, it has to be something else. It cannot be so much friction

within all the different toolings. It needs to be something

unifying

everything together

in 1, you know, level of version control

and

yeah. Some sort of unification

and taking the whole game,

embedding all the software development best practices, that process of data engineering as it is right now.

Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing at y 42. It's definitely a very interesting product and an interesting platform, and I appreciate the approach that you've taken. So definitely excited to see how that progresses. So I appreciate all of the time and energy that you and your team have put into that, and I hope you enjoy the rest of your day. Thank you so much. Have a wonderful day.

For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links