Exploring The Evolving Role Of Data Engineers

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Struggling with broken pipelines,

stale dashboards,

missing data?

If this resonates with you, you're not alone.

Data engineers struggling with unreliable data need look no further than Monte Carlo, the world's 1st end to end fully automated data observability

platform.

In the same way that application performance monitoring Monte Carlo monitors and alerts for data issues across your

data

Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes.

Start trusting your data with Monte Carlo today. Go to data engineering podcast.com/

monte carlo to learn more.

The first 10 people to request a personalized product tour will

the first 10 people to request a personalized product tour will receive an exclusive Monte Carlo swag box.

Your host is Tobias Macy. And today, I'm welcoming back Maxime Beauchmann to talk about the impacts that the evolution of the modern data stack has had on the role and responsibilities of data engineers. So Max, for anybody who isn't familiar with you, can you give a brief introduction?

For sure. Yeah. And thank you for having me on the show again. Excited to be here. So

how to best introduce myself? I think at this point in my career, I'm probably best known for the work that I've done around

Apache Airflow and Apache Superset. So I started both these open source projects when I was at Airbnb

back in 2014

and 15.

Since then,

I went on to start a company called Preset

where we essentially offer Apache Superset as a service. For those not familiar with Apache Superset, it is a very much a data visualization exploration platform

that caters to like all of the business intelligence type use cases and beyond.

And then talking a tiny bit more about my career. So over the past 20 years, I've been,

you know, a business intelligence engineer, data warehouse architect.

By the time I joined Facebook in 2012,

I believe, is when we started calling ourselves data engineers internally at Facebook, and that's the title

that followed me for much of the decade

to come. And then I've just been building a lot of data tools. Like, I really enjoyed building

tooling around data, so that's really my my passion.

And

in terms of your introduction

to data, we've gone over that a couple of times in past episodes you've been on, so we'll make you rehash that. I'll just I'll let a link in the show notes for people who wanna go back and hear about that. But as far as the topic at hand today,

you recently had a post that was talking about how the modern data stack is reshaping data engineering.

And before we dig too much into that, I'll also call out the previous articles you had done almost 5 years ago on the rise and the downfall of the data engineer, and then we also did an episode all the way back in episode 3 talking about defining data engineering

because it was still very early in the journey of data engineering being a dedicated role and being something that people would actually go out and get as their job title.

So now almost 5 years later, it's almost incomprehensible that that was even the state of the world at the time. So I'm wondering if you can just give your current working definition of what a data engineer is and does

given how much it has shifted since the last time we covered that?

Things have have changed quite a bit, right, over the 4 or 5 years, and that's really what I was interested in and that latest blog post. So I highly encourage people to

even, like, pause this podcast and read the post because I think today we're gonna talk about a lot of these trends and how they're shaping and changing

the data the modern data team and

the modern data engineer. But I think it's the definition of what a data engineer

is at its core hasn't really changed, right? It says the practice of designing and building

systems and processes for collecting, sorting, analyzing data at scale. Like at a at a high level, you know, a data engineer is someone who built kind of systems processes

around data and metadata.

It's a super broad field, right, like that touches just about every

industry, every team, every department.

1 thing that has changed quite a bit, I think, over the past decade is just how

mainstream data has become. So, you know, 20 years ago

when I was a data warehouse architect, it was the craft of a very small group of people in the company to be kind of the librarians of the data in the company. It was very kinda

focused and targeted to a small group of people to take care of that. And now it's like every company wants to be data driven. Every team's a data team. People are investing a lot in their data team and that kind of skills. So things have become much more mainstream over the past decade, and then there's been,

like, new roles that have been kinda shaping up. So there's some specializations,

some new tooling.

There's some tooling that makes some of the things that used to take a lot of time now, just kind of something that doesn't take time anymore. So it will be interesting to talk about all of these things and all these changes.

And to your point, at the time that we first visited the idea of what is data engineering, it was still very much a low level operation where you had to be, as you said, well versed in the

mechanical aspects of how data was laid out on disk, how the processing engines were going to work with it, whereas now a lot of that has been

pushed into the software layer. You don't need to think about it. You just say, I wanna take the data and put it from point a to point b. There are services to do that. You don't need

to think about all of the retry logic and error conditions that go into all of those things that wasted a lot of time and caused a lot of headaches. So I'm wondering how the growing availability of these data infrastructure services

and

the utilities

that are built on top of and around them have

shifted the foundational skills and knowledge that are necessary to be effective as a data engineer

and some of the ways that new and aspiring data engineers should think about spending their time and energy to actually break into that role? There is a lot in this question

on PAC. 1 thing is, like, the rise of

the services

that automate a lot of what a data engineer used to do. So on 1 front, there's these cloud data warehouses,

commoditizing kind of the database administrator type workloads. Right? Even like the infrastructure

load of the data engineer. I think in the rise of the data engineer, I talked about, like, a, some people

include the infrastructure work of, like, setting up your data structure as part of the data engineering role. And I think like that as definitely that with the cloud services that exist today,

you don't need to go and kind of set up your own data warehouse. Right? Like, what all you need to do is kinda create a a Snowflake account, a BigQuery account, and you're up and running fairly quickly. You pay as you go, or you don't even have to necessarily

size and kinda grow your cluster based on needs. Like, all that stuff is done

for you.

That does mean, though, that there's still a burden around provisioning, like choosing the technology that you're gonna use and giving access to people and then procurement. Right? Like, making sure you're choosing the right thing for the right reason

and perhaps containing

costs in some ways or monitoring costs is becoming maybe more of a concern

over time. There's

another part of, like, the squish in some way of, like, on 1 end. Right? Like, the data engineer doesn't have to to do some infrastructure work. Maybe it doesn't have to do as much of the scripting to hoard data.

We're we're in a phase where, like, data warehousing is, like, a lot of it is about hoarding the data from all of the different

systems and subsystems

in your company into a central place. Nowadays, that means getting a bunch of

data from your SaaS services. Right? A modern company uses 100

of SaaS services to operate,

whether it's around, you know, recruiting or customer CRM type things.

In all areas of business nowadays, we use, you know, targeted specialized SaaS

systems. Right? Whether it's payroll or pretty much in all areas. So bringing all this data, hoarding the data to the data warehouse

is also something that's getting commoditized by tooling with things like

Fivetran and Meltano and Airbyte.

It's becoming

easier to bring all of the data into a central place, so then there's a little bit of some workload is, like, setting that stuff up, making sure it works, you know, monitoring these things and getting all, you know, the procurement and the operation

and the selection of that tooling is still big.

Another area

that we've seen changes is, like, the rise of the analytics engineer. That means now we have a data analyst who speaks SQL and

knows Git pretty well, so that means they can

start automating some of the the t and EL t. Right? So these people are able now to kind of solve their own problem and automate their pipelines.

So that pushes the data engineer a little bit further away from this.

So back to your question now, what does that mean in terms, like, what skills do you need as a data engineer

nowadays? I think it's an interesting question. Right? I think, like, there's a question around specialization,

like, how broad do you wanna go? Do you wanna be more full stack, right, and be able to cover some of the data analyst to data infrastructure

spectrum, or do you wanna really focus and be the person who manages core datasets? There's also an area that seems to be just as complex and as kind of very still well attributed to data engineers, all the streaming the streaming pipelines. This area, clearly, if your company needs to have streaming data pipelines,

stream data processing,

actually, you know, I think that stays under the the realm of

specialty of the the modern data engineer as well.

With the introduction of these managed services where a lot of the work to set up the foundational data platform is just sign up for the service,

put in the, you know, credentials, do some of the integration work,

where that starts to sound a lot more like an infrastructure engineer than a data engineer responsibility.

And I'm wondering what you see as the potential for at least some of the more service and infrastructure level work to be pushed into the domain of the DevOps engineer or the platform engineer

and less so in the realm of the data engineer where the data engineer is maybe just the 1 making the selection of which tools to actually purchase and integrate and less of doing that actual integration work.

I think that's clear. Right? Like, we we can kinda hand that over to the infra cloud infrastructure team. They can handle it just as they handle other

systems and piece of infrastructure that they do handle. So that means procurement process and, you know, even doing, like, security review. Right? Is that tool matching our security

type requirements? It's SOC 2 compliant.

Something can be done

in tandem or even led by your normal

infra cloud infrastructure team.

I think that in terms, like, wiring all these things together, so we buy all of these,

you know, services and tooling, and there's there's still, like, a responsibility

of

making these things work and then tying them together. Right? Like so I think that's true of infrastructure

in general. It certainly is true with data infrastructure too. Right? It's not just like buying

5 Tran and dbt cloud or whatever, you know, or astronomer cloud and then getting these things to I should be like, okay. We've bought them. Now we're done. And I say you still need to go and make these things work

very well together. So there's, like, you know, metadata integration.

There is essentially, like, just the duct tape and the chicken wire that's required to get all these things to work together.

And I think the reality of the modern software engineer,

just as much as, like, the modern, you know, data engineer, is to get all these services

to work well together and then to take, you know, all the business rules and the the things that are specific to your company and and, you know, put those and make sure that those things are reflected in those systems.

1 of the big themes in the idea of the modern data stack and all the services that are becoming available

and the

sort of areas of focus for data engineers and data product managers is the idea of democratization of data where you want to make data access more universal throughout the organization. You want to lower the barrier of entry, lower the level of sophistication that's necessary to be able to actually explore these different datasets that are powering the business.

And in your post about the downfall of the data engineer, you called out the pressure on data engineers to maintain control with so many different contributors with varying levels of skill and understanding.

And I'm wondering how you see the modern data stack balancing some of those concerns of giving everybody access to the data, you know, empowering them to actually ask and answer questions,

but at the same time, not overwhelming the data engineer or not introducing

sort of

uncontrolled manipulation of the data in a way that actually causes there to be

invalid assumptions based on the, you know, unknown quality of the datasets or people who are creating new datasets without necessarily understanding what the original context was?

Yeah. So I think, like, if we think about this problem of, you know, if we give more access to more things to more people just in the abstract,

know, there's this danger of, like, people getting lost or, you know, shooting themself in the foot.

And and I think that's a general problem. Once something becomes more

accessible to more people, there's a risk that it might be misused or misunderstood.

So a big thing is the education gap. Right? So do we make sure that we provide resources for these people to use the tooling right? And is the tool

well structured and organized to provide all the context that the people need to succeed in what they're trying to achieve

with the tool.

So that probably mean 1 big thing is like metadata accessibility.

Right? So if you're building a dashboard from a dataset, like how do you know all the context on this dataset? Like, is it fresh?

Who's the owner?

Is it reliable? Is it certified?

So I think some of these questions can be answered through the use of, like, good

metadata and metadata management and maybe, like, data dictionary, that kind of that space of call it data graph or the metadata graph,

understanding where is it coming from, who owns it,

how reliable is this, who are other users of this dataset. Right? It's very popular and used every day by many of your colleagues. Your colleagues is probably,

you know, reliable.

So that's, like, somewhat, like, beyond the tribal knowledge of going to the Slack channel called data questions and asking people, like, hey. Does anyone know about this dataset and whether I should use this?

Metadata accessibility is important.

Educating your workforce. So at

previous companies, we had programs to make sure that we push data literacy forward

internally and make that accessible to make that accessible to just about everyone within a company. So at Airbnb, we had data university

where we we taught, and I'm sure the program is still going on, maybe has changed over time, but my memory of it is we trained people on the tooling that we had, and there was, like, a progression

of, like, you know, learning airflow 101 and airflow 201 and airflow 301.

There was also classes around data structures and the tables and the datasets that are most popular, and then just also, like, orientation of, like, how do you ask a question? How do you find out what the dataset you might wanna use or might not wanna use is.

Similarly, at Facebook, there was something called DataCamp, and that was a little bit more. Instead of being,

you know, a series of classes, maybe with a commitment a few hours a week, so that was the approach at Airbnb. At Facebook, I believe it was a full 1 week called data camp, and you would just

almost kinda check out of your team for a whole week

and then go sit through a bunch of

classes.

And

it was, like, classes and exercises too. Right? So they would ask some questions, get little projects,

and play,

you know, data analysts for for a whole week, which was pretty exciting. And they made it pretty fun too where you would learn about the datasets. You would go and answer some really kinda key intricate questions of, like, hey. How does

engagement

work for different age groups? And are teens as engaged on Facebook as,

you know, your different groups of people? And then go and run your your own analysis and learn about all the tooling that we had available internally.

So there's this education gap. I think that's a big 1. There's for the tooling to show more context. Think, is 1 way to help with that. I'll open up on, like, the topic of,

call it data literacy or call it democratization of access to data. There's some bigger topics there, like, do we wanna democratize

the entire analytics process? Right? Do we want to make it possible for more people to write pipelines, to for more people to go and instrument more events and application,

or more people to go and define, you know, business rules and things like that too. So I think the I think the answer is yes. And then the question is like, what are the right set of guardrails

in education we need to enable more people to do more of this?

Yeah. The democratization

is definitely something that's worth kind of enumerating where it could just mean giving people access to read it. But as you said, maybe you wanna be able to give everybody access to write their own pipelines, to be able to build their own datasets that power their specific segment of the business where,

you know, 1 of the areas that's most recent that's seeing a lot of attention is the idea of the metrics layer where you wanna bring the business users in to understand the definitions of what that metric is supposed to mean semantically and some of the ways that the data that we have can be used to actually formulate that metric

because the sales manager is more likely to know what the actual

semantics around a conversion

should be versus a data analyst or a data engineer because, you know, we're working at the layer of the data so we can see, okay, these are all the numbers. These are the different events that tie together. But from the business perspective,

what does it actually mean to be a conversion, and how is that being used in the data? So we wanna make sure that everybody's working together on that. From the pipeline perspective. You know? And we have the core set of data pipelines that are kind of protected, and you don't just grant access to everyone. But from that base set of datasets that we're pulling in from the, you know, application databases,

from Salesforce, whatever it might be, we then wanna be able to give people access to build their own downstream

pipelines, downstream datasets.

But to the point of guardrails, maybe we say here is the kind of cookie cutter template of your DBT model to say, you know, these are the core datasets you're able to pull from.

Here is the initial set of operations to build a new transformation or build a new table so that it's using maybe the agreed upon vocabulary as far as column names. But you can now go and build some other view on this dataset that you can consume in this dashboard. So you have kind of the templated out set of steps in the pipeline so that all ties together with your kind of paved path. And then if they go

a field of that, then they're kind of on their own, and you make no guarantees about the validity of their datasets that they're building.

Yeah. I mean, I'm interested to talk about, like, the metrics layer and, like, what it's after and what's novel about it and what's maybe so not so novel around it too. Because, like, you know, if you go back to the artificial data warehousing books that are, you know, 25 years old now,

so the Ralph Kimball books, the Bill Inman books, it was always about metrics and dimensions.

It was about, you know, conformed dimensions,

conformed metrics, conformed facts,

getting consensus, defining these things very, very well,

having these things be a reflection of the business.

So I think that these ideas are not new. Like, there's even, like, metric centric data modeling. Like, to say, like, oh, the metric is the most important thing, you know, at the heart of data modeling

or or even from a data governance standpoint.

You know? I think it does kinda make sense, but it really I think what is screaming to me, you know, looking at these metrics layer and, you know, different

entities and people, companies gonna emerge in the space, like, talk about different things when they say metrics layer 2.

But I think some common things and themes that we see, 1 is, like, beyond the dbt world of templated SQL, like, we need higher level abstractions

that maybe come with more constraints and guarantees than just like your raw templated SQL. So templated SQL is too free form. You know, anybody can do anything.

It's kind of the far west, so maybe the metrics layer

is a little bit more prescriptive in what you can and cannot do and how you have to define, say, ownership of things or how

things are derived or how

things like time window, you know, are expressed more semantically instead of, like, writing, like, these more complex, you know, unreadable

mountains of SQL.

So I think there's, like, this higher level abstraction

with more constraints and guarantees.

1 thing that makes a lot of sense to me that people are not necessarily talking about too much is this idea of, like, more entity centric data modeling. So when you think about, like, metric centric data modeling, that means, like, hey, we're gonna make the metric

to the kind of unit, that really

strong entity

in information

architecture around

how we manage data and metadata.

Right? I think that makes sense. Like if you think at Airbnb or like bookings, you know, bookings is really important thing.

Let's define, like, who owns certain, like, subsets of dimensions

around

how bookings are defined.

And then we all need to align on a definition of this stuff.

I like entity centric data modeling, which is, like, when you think about it, like, the Kimbell book is all about, like, dimensional modeling, which a dimension is an entity. Right? So it is very much entity

centric data modeling. I I like to push this idea

more forward and say

beyond dimensional modeling,

if you look at, like, feature stores nowadays that are more emerging in the field of, like, ML type area and feature engineering.

I think it's really interesting to bring a lot more metrics to inside

dimensional modeling or call it entity centric data modeling to bring things like,

you know, 7 day visits and 7 day clicks and 7 day page views and 28 day to pivot these metrics inside these

entity centric datasets.

So that's happening

quite a bit in the fields of ML and, you know, historically in dimensional modeling.

Going back into like the metrics layer, I think to me it's a little bit of a misnomer because it's still like metrics are not useful without dimensions. Right? So it's like it's it's we still live in a world of, like, metrics and dimensions.

I guess now we're just looking to add, like,

more governance,

kinda construct and ideas, like, around the metric.

And then these higher level, like, less SQL and more YAML, there's, like, kind of this tweak of, like, oh, let's be more configuration driven and less, like, in a code or declarative, like, transformation, low level transformation.

Like, let's operate a higher level

a little bit, which I think is a great idea. Something we could we could talk a lot more,

about.

Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo or Facebook ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on.

The data you're looking for is already in your data warehouse and BI tools.

Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems.

No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for reverse ETL today. Get started for free at data engineering podcast.com/hitouch.

Yeah. To your point about operating at the higher level, 1 of the other interesting things that's been happening lately is the reemergence

of these visual pipeline builders and low code slash no code solutions where, you know, maybe

10, 15 years ago, it was the world of, you know, SQL Server Integration Studios and Pentaho, and everything was a drag and drop builder for defining your pipelines.

And then with the advent of airflow and the series of tools that followed it, they went back to

everything is software. So it was software defined pipeline, so you needed to be able to write code and reason about the flows and with the, you know, map reduce world of Hadoop. And

now we're starting to build back up to this higher level of you can, you know, take these

visual elements, drag and drop them together, but then you have a way to actually drop down into the code layer. So I think it was prophecy IO is 1 of the interesting entrances in that space where you have this visual mapper. But then when you want to actually dig in and maybe tweak things specifically, it actually generates

the spark code so that you can modify it yourself if you have sufficient knowledge. And so it's an interesting world where we have this kind of hybrid of

low code visual builders, but also the ability to drop down into the software level.

Yeah. It's really interesting to see these cycles too.

I think both use cases are valid. I think, like, 1 realization, you know, as the person originally created Airflow is, like, the pipeline world is, like, too complicated

to kinda express inversion

and diff and review.

It's so complex that it has to be

represented as code at a certain level.

When you start getting into, like, those GUIs and you try to do, like, source control type things that

now are kind of a given. Right? Like, reviewing a pipeline, seeing what it looked like before and after and forking and testing

and CICD type things. Like, that stuff to me feels like

as you go up the level of complexity, the there's more need to be in that very kind of version control and

as code

environment. We also see, like, infrastructure as as code. Right? Like, it is also a movement and seems to be pretty well settled. There's tension there between, like, declarative and templated too. Right? Like, you expressed it as YAML.

And if so, like, I'll parameterize

it is and, you know, I've seen, like, a lot of YAML with a lot of logic in it, you know, to a point where it doesn't feel like a static declaration of anything. It's very much more like code.

So, yeah, there's that tension. You know? So to me, I like the idea of, like, being able to have it both ways. So if you could have, you know, the drag and dropiness of, say, Informatica

and code orientation

of something like Airflow and have bidirectional

workflow and being able to, you know, pivot from 1 to the other and vice versa.

Maybe that's the best of

all worlds, but if you can just add it 1 way, it probably should be code. Right? Like, I don't know. At a certain level of complexity, like these GUIs did just seem to break down pretty intricately.

Absolutely. I think that, as you said, if you need to go 1 direction or the other, it should be software because

at a certain point, you can't express the necessary logic in these constrained environments without having a very long iteration cycle of needing to say, okay. Well, now I need to, you know, go in and define a completely new visual block with some different input types that will map to the specific use case that I have.

And then, you know, you end up with a proliferation of blocks that are very similar to each other with slight tweaks. And so then it's just a a different explosion of complexity where you'd be better off, you know, having just defined a function that it takes a few parameters and, you know, does these different conditional steps.

I think the best 1 that I've seen in terms of the best GUI that I prefer is the Abenisho. It's not well known because it was very kind of special purpose and

I believe very, very expensive, but it was very good in the visual drag and drop realm. And you could go pivot from

code to visual to visual to code bidirectionally pretty well. And then the parallelism specification, like the way that you could monitor the pipeline as it executed was pretty great. It could see the flow visual flow of rows through intricate, you know,

transformation

phases. So it felt a little bit like when you look at a query execution plan, you know, from a complex, like, you know, parallel database, you can kinda see all the blocks and how the different phases of your query. They kinda just expose that as an API. So it could be I'm gonna have, like, a a group by, you know, and I'm gonna have a, you know, a parallelization phase with a round robin, so you could define all these things very, very well and visually. And for me, it helped me early in my career to think in parallel. It's just the fact of seeing it and seeing the rows flow through and seeing the declaring the parallel phases and the computation,

like, really helped me understand, like, data processing

on, like, distributed architecture

early on because the visuals were so great.

Another element of the kind of guardrails,

and you hinted at it earlier, is the idea of data governance. And that has also gone through a few different shifts where, you know, earlier on, it was a very sort of process oriented manual

endeavor where you had to have the data dictionaries, and you said, you know, these are the different data owners, and, you know, maybe you had very coarse grained

access layers to say, you know, you can only access this dataset if you have this role in the LDAP system.

And now with

more

code driven and more flexible

data systems and layers on top of that, thinking in terms of things like

the introduction of tools like Immuta, which have more

data sort of attribute oriented access controls versus just role based access controls

and some more of these flexible metadata layers to be able to understand

as the data flows through the systems, these permissions need to flow with it and being able to do sort of just in time access control where somebody wants to query a given table, but it has maybe somebody's address in it so that you need to request access to it, and then that propagates to somebody else to say yes or no rather than having to, you know, go through a very manual process of trying to, you know, submit a request to the IT department, waiting for them to turn it around in a week or so before you could run the query, and at that point, you've forgotten what you were trying to figure out, you know, we can have these more flexible

processes

to manage data access so that people

are constrained and that they're not just gonna query everything if they don't have the necessary context or they don't have the necessary access. But in the cases where they do need to be able to run a query across a set of data,

they can do so. Another interesting element of that is some of the more

sophisticated

sort of data privacy algorithms and cryptographic algorithms to be able to actually

run queries on encrypted data without ever having to actually decrypt it in flight to be able to do aggregations, but you don't ever actually see, you know, the individual values.

I'm curious to get your thoughts on some of the more modern sort of data governance aspects of how

that plays into the data engineering role and some of the ways that that also manifests in this, sort of data democratization play?

There's, like, many things to unpack here.

1 topic is

kinda inheritance

in the data schemas or data access policy. Right? So you you mentioned data governance, and to me there's

subfields there. 1 is data access policy, like who can access what,

and then there's data governance more like who created what, who owns what, who's responsible

for

say the change management, the SLAs around a certain dataset. So

let's get in to the more data access policy.

I think it's pointing in the direction that

if the database is aware of the dataset

graph, right, like, which column is coming from where,

then you can apply good inheritance kind of scheme into, like, data access policy, and there seem to be

value in that. It's interesting to see, like, with ELT kind of winning

pretty significantly. Right? Like, a lot of, like, the bulk of, like, the batch processing nowadays written in SQL and it's done by the database engine. So that means the database engine should be able to track the provenance of any given

column and dataset and

have some inheritance rules around that. Right? And then databases like Dremio, for instance, are baked. The transformations

and the derivatives

see the database is aware

of where things are coming from,

and I think there's a need for that. So that means

maybe, you know, over time, what we see is, like, the database engine

being very aware of the dataset

kinda semantic.

Right? Like, so whatever is done in something like dbt inside the database, the database knows about and

surfaces that information.

There's a lot you can do with that beyond just data access policy. Right? There's, like, aggregate awareness. Right? I could ask a question to the database and it would be like, hey. I know I have an aggregate that's fresh here that will better serve your query. So so they're they said that I know about that you don't know about that I'm gonna use to answer your query more efficiently.

So I think, like, we will see the rise of database

engines that are aware of the ELT or the transformation semantic and leverage that for all sorts of consideration.

And probably

is Dremio the only example I can think of that? I mean, Vertica kinda does that with projections, you know, Dremio with reflections,

but it's aware of different maybe perhaps, like, projections or different show of the same dataset.

That's 1 thought. I don't know if I wanna get again, there was 2 aspects of your question. I'm kinda tempted to go in the data governance.

That's more kinda ownership, validity, SLA,

SLO

type thing, but that's a very large unsolved problem right now that I think, like, a lot of the interest in data mesh currently

are around

the fact that the data mesh is talking about

data governance. Like, who owns what and what is private, and what is public in terms of datasets, and what's the API to the data warehouse, and what are the guarantees, like, you know,

treating data assets as and they call it data products. Right? Like,

treating your datasets like they're little products with little kinda API with dual binding contracts around them. So I think that's an interesting

area where we, you know, haven't figured out as an industry, you know, the the answers there. Interesting parallels in this area with, like, microservices

and the DevOps world of, like, you know, microservice is, like, a really kinda clear contract and service.

This question of, like, can could we have, like, datamarts or datasets, you know, some similar kinda

contracts and guarantees in the data world around sets of datasets.

In terms of the actual

data engineering position,

as the usage of data has become more widespread, more data sources have become available,

it's easier to actually get a data infrastructure set up with all these different services.

It has

caused an increase in demand for data engineers, which has made it difficult for companies to be able to actually hire for it because there are so many opportunities out there. And I'm curious what your sense of the

sort of order of dependency has been in the

sort of rise of the modern data stack and the demand for data engineers as to which has driven the other more

prominently?

I think on 1 front

with the analytics engineer, you know, as this materialized and if we can get enough people that have those skills of, like, the

analytical mindset and then kind of the curiosity of the data analysts and the for, like, someone who's, like, vertically aligned, right, that sits in a product team and wants to answer

to solve problems with data

while being able to write pipelines and check down source control and have decent kind of data engineering IG. And I think maybe that creates a new need that removes some of the load and the pressure to have so many data engineers.

And the fact that they're vertically aligned, I think their odds of succeeding is probably better

than a data engineer maybe trying to do that across verticals. So

that removes some of the pressure there.

I think, like, as any discipline

matures,

it becomes more the essence of itself. Right? So everything that is automatable

in the role

becomes, you know, served

by tooling and by practices. And what is left is the things that cannot, you know, be solved, like, with a single solution or, like, the kind of 1 size fits all type of solution.

Like, what does that mean in the world

of the data engineer? What's the essence of it once everything that is automatable

is automated.

I think there's, like, less and less left. Like, there's probably a page to read from the DevOps movement there too. Like, you know, there's still, like, very much a need for

DevOps engineers

even, like, kinda 10 years in to the DevOps move too.

1 question is, like, what are some of the things that every data engineer does that are gonna go away maybe in the next, you know, 5 years, like, what services are gonna pop up. So there's, like, these common patterns and data pipelines. Right? And then in the past, I've been talking about I call it, like, parametric pipelines, which is this idea, like, these higher level abstractions we were talking about a little bit earlier. So everyone does, like,

sessionization,

for instance, to provide answers around, you know, click stream analysis and segmentation.

Right? So we all do this stuff.

And then companies, as they mature up, they build their own

AB testing framework that computes, you know, p values, confidence intervals, and does all sorts of magic and complex computation

around,

you know, subjects and experiments and metric sets and all this stuff. So that's another 1. You know, I've seen people build and rebuild, you know, cohort analysis frameworks.

And I think all of these, we're gonna see

a company maybe or or people or open source projects or

abstractions

that help people solve these problems without having to reinvent the wheel so that every single company is, like, kind of building essentially a variation on the same theme. Like, I would love to see these abstractions

coming into existence so that next time I need to do sessionization,

I can just, like, you know, download the package and and solve that problem.

We're not quite there yet. Right? Like, I think, we might see, like, you know, airflow tags or airflow, like, libraries or, like, DBT projects as reference implementation. But we're in the phase where it's even hard to find

some good reference implementation

for the things I talked about. Right? If I go today, I'm like, I wanna write my AB testing framework or I wanna do some sessionization,

like, what are the resources?

You'll probably find some reference implementation

that if you're lucky, you might be able to reuse tiny portions of it

and kind of bend into submission to get to where you need to be. Right? Yeah. And to your point of the selection of these different, you know, prebuilt packages, but also at the level of the different services that are being built and composed together,

you know, that has definitely become 1 of the responsibilities of the data engineer to say, okay. You know, do I wanna use Fivetran? Do I wanna use Stitch? Do I wanna use Meltano?

You know, which data warehouse do I need? There are multiple offerings for each of these different layers of the stack, and so a big part of it is just tool selection and integration of those systems. And I'm wondering

what you have found to be some of the useful strategies

for approaching that selection process and being able to understand how well each of those different layers integrates together, some of the potential edge cases that might come about where, you know, maybe I want to use Fivetran, but it doesn't work with Firebolt yet kind of a thing and being able

to discover some of those edge cases before you get too far down the road of trying to get it integrated and find out that it actually doesn't work yet. The part of beauty of the modern data stack,

I think, you know, as we try to to define it, like 1 of the properties that we've seen is the pay as you go

and try at will or at least, like, try for cheap. So if it's pay as you go and you wanna do a proof of concept, you're able to self serve into that.

Where in the past, you might have to,

like, spin up some infrastructure

to do a POC or to pay or deal with a vendor process and have an official POC

approach. And then the POC become an institution

where, like, now you have to involve 3 or 4 vendors.

If you wanna do a horizontal or vertical kinda integration through it too, you would have to to involve multiple vendors for each layer and then align

them, and then the combinations just becomes really heavy. So at least, like, now I think you can go pretty easily

and try, you know,

if you could go from having nothing to having a pretty decent proof of concept with a kind of full stack integration pretty quickly.

So if you wanna try Fivetran today, I think it's pretty easy to get started

and to get some data starting to flow.

And similarly, I think with like reverse CTL or

some of these things that used to be very like non trivial.

Then you probably want to, you know, talk to peers, similar companies, like, you know, tap into the collective wisdom in terms of, like, for people that that are kinda like you, and then make sure it works for you. And, hopefully, you can get going. I think our our story with reverse CTL, that preset is we're like, hey. You know, do these tools like, we need to send data back to HubSpot, some product analytics back to HubSpot.

You know, how are we gonna do this? I was like, oh, let me just try 1. I'm just gonna try HiteTouch, and within

you know, it's, like, 25 minutes. I was connected to my database and sending data over,

and everything was working pretty well. And we only needed 1 integration, which fits under the pre the freemium plan, and we're like, okay. Well, problem solved. You know?

So build confidence

very quickly, and I think that's where the more old school vendors need to worry a little bit. It's like for this generation of people,

you know, we wanna self serve. We wanna run our own POC, and we wanna get, like, time to value down to, like, sub 1 hour,

and that's just not compatible with the more traditional sales cycle. Like, you gotta talk to someone, and they're gonna ask you a bunch of question. They're gonna qualify you, and they're not gonna be interested in selling you anything unless, like, your contract value is gonna be above, you know, 20 or $50, 000.

So I think they're gonna miss out on the more traditional vendors are gonna miss out on these, these opportunities.

Kinda so you're gonna sneak up on them.

Another interesting

element of wordplay is the idea of the modern data stack has gained a lot of popularity

as well as the idea of building a data platform. And I'm wondering if you see those as

disjoint concerns

or something where you start with the modern data stack, and then you have to build the platform on top of it and some of the sort of skills and responsibilities

that are

implied in each of those phrases.

I don't know what is the data platform and what is the modern data stack. They're, like, both a little bit unanswered. But, like, 1 way I would paint a picture

for me is, like, my data platform

at the start up that I'm part of is

the collection of

building blocks that we selected from the modern data stack and made work together with our business logic. Right? So we pick a a certain number of things,

invested in making them work together. They're all modern data stack.

I would say, like, building blocks.

And then our data platform is, like, the fabric or the mesh of services

and this logic that we built on top of it.

Going back to what you're saying earlier about the role of templated SQL

and the current prominence that it has in the form of DBT and, a few other systems.

But as you were saying, we need some higher level constructs to be able to have appropriate guardrails and appropriate

kind of proofs around the validity of the workflows that we're trying to build where SQL is a little bit too free form because it's just text. You know, it's parsable. You can make some assumptions about it, but it's very easy to kinda shoot yourself in the foot without necessarily having some advanced warning of that fact. And I'm curious what you see as the long term viability

of tools like DBT and the idea of templated SQL

as a core workflow and maybe some of the ideas that might succeed that is a more Is it a more fitted abstraction, maybe? Right? And so and there's a question as to whether, you know, dbt

or airflow or template SQL

can be the building block of

these higher level construct, and then I'd like to shoot that down. I think it's not. So I think, like, dbt or template SQL

is a great way, I would say, to express

ETL primitives. And by ETL primitives, I mean, like, you wanna source from a dataset, you wanna apply filters, you wanna group by, you wanna

join so that ETL primitives or data transformation primitives are these simple things that are very, very well expressed

in SQL.

And with a little bit of YAML in there or templating,

a little bit of Jinja and YAML

and prioritization,

you can achieve a lot, and it's great. I think that the rise of DBT

and by the way, like, I would say, like, airflow as templated, like, Jinja

baked into it very deeply too. So you're gonna achieve, like, very similar things

with Airflow. Right? So Airflow, I would say, is a superset of what you can do with dbt in many ways. Right? So you can also have, like, all these other operators and your SQL operators and

the Jinja templating. But I would say dbt

does a better job at, you know,

showing you exactly at just the subset of what you need if all you care about is you have a single data warehouse or using just templated SQL.

I think, like, dbt is just very elegant in terms of, like, coordinating

a lot of SQL

very, very well. It solves that in a very good way.

So now if you wanna build these higher level constructs, so let's just take 1 and we'll take

I don't know which 1 is the best 1. We could take, like, the AB testing framework. Right? So you can go and write an AB testing framework

in DBT today. Right? Like with YAML, you could say, like, go and define your your metrics, your metrics group,

where you have, like, your subjects, right, your user IDs

and all these metrics, and then what are your experiments and your exposure tables.

And you can go and build all of that. But then

what you're building is really hard to reuse for a variety of reasons. Like, 1 is that

as you become kinda logic heavy, you have a lot more Jinja than SQL, and then that just not as very expressive. Like, SQL with a lot of Jinja in it,

where every field list is a 4 loop on a collection of fields stored somewhere else. It's just, like, very hard to read and reason about, and it's not expressive enough

to do that well to have, like, these very dynamic pipelines. And then there's the other core issue, which is, like, d v 2 doesn't really solve you know, you're writing a certain dialect, so you're only solving the problem

for people who use the same dialect as you. Or if you're trying to say, like,

oh, you know, I'm gonna write something that works kinda cross SQL languages, then your template is gonna become even more overloaded

with Jinja. Right? So you would not use something like limit or.

You would use some sort of, like, more intricately

complex abstraction on top of it. So I think, like,

DBT doesn't seem like the right place to build these, like, higher level constructs.

Right? You know, maybe it's a great place to do a reference implementation

and say, like, I have this simple

dbt project where I do obsessionization.

I'm gonna share this in a GitHub repo and you can take it and reuse what you want and alter it to kinda fit your need, which might be the first phase. Like, we I think we need people doing that today so we can identify the patterns and share and talk about these things and have all these reference like, a a good library of reference implementation so people can compare and try things. So It's a good place to start.

Spark

maybe? It seemed like a better place to do some of these things, the way you can write these

more dynamic pipelines, it can be more dry. It's, like, more expressed as code.

It seems more like a natural place for some of these frameworks to be in a higher level construct to be expressed.

I don't know. There's a real question there of, like,

if you're trying to build these abstractions

today, right, reusable

kinda high level transformations,

And I called them parametric pipelines

or competition frameworks in the past. It's like kinda this idea of these higher level construct that solves certain, like, data engineering

high level

challenges. Like, what's the right tool set if you're trying to build, like, a 1 size fits all solution or reusable component that all companies can use

to solve these problems?

I don't know. I think I'd use Spark as probably what I would try to use if I was to work on that.

Does that work for everyone? Like, does everyone has a Spark cluster or is able to run a Spark workload?

Does it make sense for people to get data out their warehouse to compute it somewhere else and send it back in the ELT heavy world?

Maybe. I don't know. It's unclear.

Just put everything into, Delta Lake, and then problem solved.

Put into Lake and, let people write, MapReduce

to solve it, and and we're done. And that's the way we used to do it a long time ago. We're 20 years ago. Yeah. Yeah.

But, yeah, I mean, I would love to see a lot more of these, like, reference implementation, people sharing, like, hey.

This is how

this team at this company solved you know, build a core analysis framework on top of airflow. And here's, you know, things you

might wanna try to reuse, right, and alter and and

make sense of.

And you could have, like, more people sharing more

of these things.

I think it would become more clear what the different variation of on that topic are and make make it easier for someone eventually to solve that problem once and for all for for everyone.

There are a number of other sort of hot take topics that it would be fun to dig into. Maybe we'll have to slate those for another interview to go a little deeper on them. So

I guess just quickly,

in your work participating

and contributing to the data ecosystem, what have been some of the interesting or unexpected or challenging lessons that you've learned in the process?

So 1 thing that's interesting is to see how there's these cycles.

And if you've been around long enough

in any given discipline, you'll see getting new people come in and have a fresh take on these old problems

without having necessarily the context of some of the failures in the past. I think that there's both, like, a beauty in that. Right? That kind of the innocence of giving an old problem a completely new shot with a new environment

and then you set of

maybe tools and and solution. Right? The world has changed, so you don't think the right way.

And I'm sure you think the same way about the problem and can get really creative and fresh ideas.

There's also on the other side, the stupidity of kind of missing out and kind of this teenage, like, not being able to

leverage previous experiences of this innocence of, like, not missing out on

learning from previous

achievements and learning and struggles.

So it's interesting to be that person, something that points out to, you know,

technologies or mythologies that, you know, were born or existed, you know, 10, 15, 20 years ago that

went pretty far. Like, in some cases, like, there's some of these efforts

are very notable and solve not necessarily the same set of problems in the same way, but sometimes they would have take optimized for a different

kinda outcome or a different facet of the problem and, like, much better on that facet than what we're doing now. So it's been really interesting to see, could everyone

rebuild everything on new premises?

Like, you know, everything's gotta be on the cloud and everything is as a service and everything is as pay as you go and everything is distributed first.

But in terms of, like, user experience and some of the expressivity

of how we solve the problem, there's, like, shortcomings

on that side as we optimize for new kind of constraints.

To close out the show as the final question,

in episode 3, when we

took a crack at defining data engineering,

we closed out with some predictions for the following years of what would come for the data engineering role.

And many of those have actually been proven out pretty well. So you're very prescient in that.

So now that we're kind of recapping some of those ideas and the

definition of data engineering, I'm interested in what your next set of predictions are for the upcoming years. I think there's a there's a question around, like, how are data engineers gonna

work with analytics engineers. And that's a similar question, I think, to, like, what does a DevOps specialist like, how do they work with developers or engineers elsewhere in the company? And, you know, it's kinda transfer of, like, the vertically aligned versus horizontally aligned.

But I think on the short term, we're gonna see a little bit of a struggle

and tension and and kinda identifying, like, the border between the 2 roles.

And maybe the data engineer is gonna feel like they're hurting kind of a little bit more reckless analytics engineers that, you know, they wanna solve business problems first.

They're maybe oriented a little bit more short term, and they don't care about performance and costs and,

like, naming conventions and best practices and hygiene. Right? So we're gonna see some tension there form

until we can create, like, all of

the tooling and the rules and the guidelines of the best practices that are required for to make sure to keep these people in check and make sure that they're not, you know, accumulating depth as they solve, problems in their respective particles.

Alright. Well, thank you very much for taking the time again today to talk through the sort of current definition of data engineering. So appreciate all the time and energy that you have put into

contributing to the data ecosystem and your continued

sort of thought leadership, if you will. So always a pleasure to have you on the show. Definitely have to have you back again sometime. So thank you again for all of that, and I hope you enjoy the rest of your day. It's been a pleasure, and I know there was a lot more questions on your list that we did not cover. So happy to come back on the show at some point and then push the conversation forward.

Listening. Don't forget to check out our other show, podcast.init

atpythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of data engineering podcast dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email host atdataengineeringpodcast.com

with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links