Summary
Data Engineering is still a relatively new field that is going through a continued evolution as new technologies are introduced and new requirements are understood. In this episode Maxime Beauchemin returns to revisit what it means to be a data engineer and how the role has changed over the past 5 years.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box.
- Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
- Your host is Tobias Macey and today I’m interviewing Maxime Beauchemin about the impacts that the evolution of the modern data stack has had on the role and responsibilities of data engineers
Interview
- Introduction
- How did you get involved in the area of data management?
- What is your current working definition of a data engineer?
- How has that definition changed since your article on the "rise of the data engineer" and episode 3 of this show about "defining data engineering"?
- How has the growing availability of data infrastructure services shifted foundational skills and knowledge that are necessary to be effective?
- How should a new/aspiring data engineer focus their time and energy to become effective?
- One of the core themes in this current spate of technologies is "democratization of data". In your post on the downfall of the data engineer you called out the pressure on data engineers to maintain control with so many contributors with varying levels of skill and understanding. How well is the "modern data stack" balancing these concerns?
- An interesting impact of the growing usage of data is the constrained availability of data engineers. How do you see the effects of the job market on driving evolution of tooling and services?
- With the explosion of tools and services for working with data, a new problem has evolved of which ones to use for a given organization. What do you see as an effective and efficient process for enumerating and evaluating the available components for building a stack?
- There is also a lot of conversation around the "modern data stack", as well as the need for companies to build a "data platform". What (if any) difference do you see in the implications of those phrases and the skills required to compile a stack vs build a platform?
- How do you view the long term viability of templated SQL as a core workflow for transformations?
- What is the impact of more acessible and widespread machine learning/deep learning on data engineers/data infrastructure?
- How evenly distributed across industries and geographies are the advances in data infrastructure and engineering practices?
- What are some of the opportunities that are being missed or squandered during this dramatic shift in the data engineering landscape?
- What are the most interesting, innovative, or unexpected ways that you have seen the data ecosytem evolve?
- What are the most interesting, unexpected, or challenging lessons that you have learned while contributing to and participating in the data ecosystem?
- In episode 3 of this show (almost five years ago) we closed with some predictions for the following years of data engineering, many of which have been proven out. What is your retrospective on those claims, and what are your new predictions for the upcoming years?
Contact Info
- @mistercrunch on Twitter
- mistercrunch on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- How the Modern Data Stack is Reshaping Data Engineering
- The Rise of the Data Engineer
- The Downfall of the Data Engineer
- Defining Data Engineering – Data Engineering Podcast
- Airflow
- Superset
- Preset
- Fivetran
- Meltano
- Airbyte
- Ralph Kimball
- Bill Inmon
- Feature Store
- Prophecy.io
- Ab Initio
- Dremio
- Data Mesh
- Firebolt
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Struggling with broken pipelines, stale dashboards, missing data? If this resonates with you, you're not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world's 1st end to end fully automated data observability platform. In the same way that application performance monitoring Monte Carlo monitors and alerts for data issues across your data Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today. Go to data engineering podcast.com/ monte carlo to learn more.
The first 10 people to request a personalized product tour will the first 10 people to request a personalized product tour will receive an exclusive Monte Carlo swag box. Your host is Tobias Macy. And today, I'm welcoming back Maxime Beauchmann to talk about the impacts that the evolution of the modern data stack has had on the role and responsibilities of data engineers. So Max, for anybody who isn't familiar with you, can you give a brief introduction?
[00:02:10] Unknown:
For sure. Yeah. And thank you for having me on the show again. Excited to be here. So how to best introduce myself? I think at this point in my career, I'm probably best known for the work that I've done around Apache Airflow and Apache Superset. So I started both these open source projects when I was at Airbnb back in 2014 and 15. Since then, I went on to start a company called Preset where we essentially offer Apache Superset as a service. For those not familiar with Apache Superset, it is a very much a data visualization exploration platform that caters to like all of the business intelligence type use cases and beyond.
And then talking a tiny bit more about my career. So over the past 20 years, I've been, you know, a business intelligence engineer, data warehouse architect. By the time I joined Facebook in 2012, I believe, is when we started calling ourselves data engineers internally at Facebook, and that's the title that followed me for much of the decade to come. And then I've just been building a lot of data tools. Like, I really enjoyed building tooling around data, so that's really my my passion.
[00:03:22] Unknown:
And in terms of your introduction to data, we've gone over that a couple of times in past episodes you've been on, so we'll make you rehash that. I'll just I'll let a link in the show notes for people who wanna go back and hear about that. But as far as the topic at hand today, you recently had a post that was talking about how the modern data stack is reshaping data engineering. And before we dig too much into that, I'll also call out the previous articles you had done almost 5 years ago on the rise and the downfall of the data engineer, and then we also did an episode all the way back in episode 3 talking about defining data engineering because it was still very early in the journey of data engineering being a dedicated role and being something that people would actually go out and get as their job title.
So now almost 5 years later, it's almost incomprehensible that that was even the state of the world at the time. So I'm wondering if you can just give your current working definition of what a data engineer is and does given how much it has shifted since the last time we covered that?
[00:04:26] Unknown:
Things have have changed quite a bit, right, over the 4 or 5 years, and that's really what I was interested in and that latest blog post. So I highly encourage people to even, like, pause this podcast and read the post because I think today we're gonna talk about a lot of these trends and how they're shaping and changing the data the modern data team and the modern data engineer. But I think it's the definition of what a data engineer is at its core hasn't really changed, right? It says the practice of designing and building systems and processes for collecting, sorting, analyzing data at scale. Like at a at a high level, you know, a data engineer is someone who built kind of systems processes around data and metadata.
It's a super broad field, right, like that touches just about every industry, every team, every department. 1 thing that has changed quite a bit, I think, over the past decade is just how mainstream data has become. So, you know, 20 years ago when I was a data warehouse architect, it was the craft of a very small group of people in the company to be kind of the librarians of the data in the company. It was very kinda focused and targeted to a small group of people to take care of that. And now it's like every company wants to be data driven. Every team's a data team. People are investing a lot in their data team and that kind of skills. So things have become much more mainstream over the past decade, and then there's been, like, new roles that have been kinda shaping up. So there's some specializations, some new tooling.
There's some tooling that makes some of the things that used to take a lot of time now, just kind of something that doesn't take time anymore. So it will be interesting to talk about all of these things and all these changes.
[00:06:13] Unknown:
And to your point, at the time that we first visited the idea of what is data engineering, it was still very much a low level operation where you had to be, as you said, well versed in the mechanical aspects of how data was laid out on disk, how the processing engines were going to work with it, whereas now a lot of that has been pushed into the software layer. You don't need to think about it. You just say, I wanna take the data and put it from point a to point b. There are services to do that. You don't need to think about all of the retry logic and error conditions that go into all of those things that wasted a lot of time and caused a lot of headaches. So I'm wondering how the growing availability of these data infrastructure services and the utilities that are built on top of and around them have shifted the foundational skills and knowledge that are necessary to be effective as a data engineer and some of the ways that new and aspiring data engineers should think about spending their time and energy to actually break into that role? There is a lot in this question
[00:07:16] Unknown:
on PAC. 1 thing is, like, the rise of the services that automate a lot of what a data engineer used to do. So on 1 front, there's these cloud data warehouses, commoditizing kind of the database administrator type workloads. Right? Even like the infrastructure load of the data engineer. I think in the rise of the data engineer, I talked about, like, a, some people include the infrastructure work of, like, setting up your data structure as part of the data engineering role. And I think like that as definitely that with the cloud services that exist today, you don't need to go and kind of set up your own data warehouse. Right? Like, what all you need to do is kinda create a a Snowflake account, a BigQuery account, and you're up and running fairly quickly. You pay as you go, or you don't even have to necessarily size and kinda grow your cluster based on needs. Like, all that stuff is done for you.
That does mean, though, that there's still a burden around provisioning, like choosing the technology that you're gonna use and giving access to people and then procurement. Right? Like, making sure you're choosing the right thing for the right reason and perhaps containing costs in some ways or monitoring costs is becoming maybe more of a concern over time. There's another part of, like, the squish in some way of, like, on 1 end. Right? Like, the data engineer doesn't have to to do some infrastructure work. Maybe it doesn't have to do as much of the scripting to hoard data. We're we're in a phase where, like, data warehousing is, like, a lot of it is about hoarding the data from all of the different systems and subsystems in your company into a central place. Nowadays, that means getting a bunch of data from your SaaS services. Right? A modern company uses 100 of SaaS services to operate, whether it's around, you know, recruiting or customer CRM type things.
In all areas of business nowadays, we use, you know, targeted specialized SaaS systems. Right? Whether it's payroll or pretty much in all areas. So bringing all this data, hoarding the data to the data warehouse is also something that's getting commoditized by tooling with things like Fivetran and Meltano and Airbyte. It's becoming easier to bring all of the data into a central place, so then there's a little bit of some workload is, like, setting that stuff up, making sure it works, you know, monitoring these things and getting all, you know, the procurement and the operation and the selection of that tooling is still big.
Another area that we've seen changes is, like, the rise of the analytics engineer. That means now we have a data analyst who speaks SQL and knows Git pretty well, so that means they can start automating some of the the t and EL t. Right? So these people are able now to kind of solve their own problem and automate their pipelines. So that pushes the data engineer a little bit further away from this. So back to your question now, what does that mean in terms, like, what skills do you need as a data engineer nowadays? I think it's an interesting question. Right? I think, like, there's a question around specialization, like, how broad do you wanna go? Do you wanna be more full stack, right, and be able to cover some of the data analyst to data infrastructure spectrum, or do you wanna really focus and be the person who manages core datasets? There's also an area that seems to be just as complex and as kind of very still well attributed to data engineers, all the streaming the streaming pipelines. This area, clearly, if your company needs to have streaming data pipelines, stream data processing, actually, you know, I think that stays under the the realm of specialty of the the modern data engineer as well.
[00:11:05] Unknown:
With the introduction of these managed services where a lot of the work to set up the foundational data platform is just sign up for the service, put in the, you know, credentials, do some of the integration work, where that starts to sound a lot more like an infrastructure engineer than a data engineer responsibility. And I'm wondering what you see as the potential for at least some of the more service and infrastructure level work to be pushed into the domain of the DevOps engineer or the platform engineer and less so in the realm of the data engineer where the data engineer is maybe just the 1 making the selection of which tools to actually purchase and integrate and less of doing that actual integration work.
[00:11:50] Unknown:
I think that's clear. Right? Like, we we can kinda hand that over to the infra cloud infrastructure team. They can handle it just as they handle other systems and piece of infrastructure that they do handle. So that means procurement process and, you know, even doing, like, security review. Right? Is that tool matching our security type requirements? It's SOC 2 compliant. Something can be done in tandem or even led by your normal infra cloud infrastructure team. I think that in terms, like, wiring all these things together, so we buy all of these, you know, services and tooling, and there's there's still, like, a responsibility of making these things work and then tying them together. Right? Like so I think that's true of infrastructure in general. It certainly is true with data infrastructure too. Right? It's not just like buying 5 Tran and dbt cloud or whatever, you know, or astronomer cloud and then getting these things to I should be like, okay. We've bought them. Now we're done. And I say you still need to go and make these things work very well together. So there's, like, you know, metadata integration.
There is essentially, like, just the duct tape and the chicken wire that's required to get all these things to work together. And I think the reality of the modern software engineer, just as much as, like, the modern, you know, data engineer, is to get all these services to work well together and then to take, you know, all the business rules and the the things that are specific to your company and and, you know, put those and make sure that those things are reflected in those systems.
[00:13:25] Unknown:
1 of the big themes in the idea of the modern data stack and all the services that are becoming available and the sort of areas of focus for data engineers and data product managers is the idea of democratization of data where you want to make data access more universal throughout the organization. You want to lower the barrier of entry, lower the level of sophistication that's necessary to be able to actually explore these different datasets that are powering the business. And in your post about the downfall of the data engineer, you called out the pressure on data engineers to maintain control with so many different contributors with varying levels of skill and understanding. And I'm wondering how you see the modern data stack balancing some of those concerns of giving everybody access to the data, you know, empowering them to actually ask and answer questions, but at the same time, not overwhelming the data engineer or not introducing sort of uncontrolled manipulation of the data in a way that actually causes there to be invalid assumptions based on the, you know, unknown quality of the datasets or people who are creating new datasets without necessarily understanding what the original context was?
[00:14:40] Unknown:
Yeah. So I think, like, if we think about this problem of, you know, if we give more access to more things to more people just in the abstract, know, there's this danger of, like, people getting lost or, you know, shooting themself in the foot. And and I think that's a general problem. Once something becomes more accessible to more people, there's a risk that it might be misused or misunderstood. So a big thing is the education gap. Right? So do we make sure that we provide resources for these people to use the tooling right? And is the tool well structured and organized to provide all the context that the people need to succeed in what they're trying to achieve with the tool.
So that probably mean 1 big thing is like metadata accessibility. Right? So if you're building a dashboard from a dataset, like how do you know all the context on this dataset? Like, is it fresh? Who's the owner? Is it reliable? Is it certified? So I think some of these questions can be answered through the use of, like, good metadata and metadata management and maybe, like, data dictionary, that kind of that space of call it data graph or the metadata graph, understanding where is it coming from, who owns it, how reliable is this, who are other users of this dataset. Right? It's very popular and used every day by many of your colleagues. Your colleagues is probably, you know, reliable.
So that's, like, somewhat, like, beyond the tribal knowledge of going to the Slack channel called data questions and asking people, like, hey. Does anyone know about this dataset and whether I should use this? Metadata accessibility is important. Educating your workforce. So at previous companies, we had programs to make sure that we push data literacy forward internally and make that accessible to make that accessible to just about everyone within a company. So at Airbnb, we had data university where we we taught, and I'm sure the program is still going on, maybe has changed over time, but my memory of it is we trained people on the tooling that we had, and there was, like, a progression of, like, you know, learning airflow 101 and airflow 201 and airflow 301.
There was also classes around data structures and the tables and the datasets that are most popular, and then just also, like, orientation of, like, how do you ask a question? How do you find out what the dataset you might wanna use or might not wanna use is. Similarly, at Facebook, there was something called DataCamp, and that was a little bit more. Instead of being, you know, a series of classes, maybe with a commitment a few hours a week, so that was the approach at Airbnb. At Facebook, I believe it was a full 1 week called data camp, and you would just almost kinda check out of your team for a whole week and then go sit through a bunch of classes.
And it was, like, classes and exercises too. Right? So they would ask some questions, get little projects, and play, you know, data analysts for for a whole week, which was pretty exciting. And they made it pretty fun too where you would learn about the datasets. You would go and answer some really kinda key intricate questions of, like, hey. How does engagement work for different age groups? And are teens as engaged on Facebook as, you know, your different groups of people? And then go and run your your own analysis and learn about all the tooling that we had available internally. So there's this education gap. I think that's a big 1. There's for the tooling to show more context. Think, is 1 way to help with that. I'll open up on, like, the topic of, call it data literacy or call it democratization of access to data. There's some bigger topics there, like, do we wanna democratize the entire analytics process? Right? Do we want to make it possible for more people to write pipelines, to for more people to go and instrument more events and application, or more people to go and define, you know, business rules and things like that too. So I think the I think the answer is yes. And then the question is like, what are the right set of guardrails in education we need to enable more people to do more of this?
[00:18:48] Unknown:
Yeah. The democratization is definitely something that's worth kind of enumerating where it could just mean giving people access to read it. But as you said, maybe you wanna be able to give everybody access to write their own pipelines, to be able to build their own datasets that power their specific segment of the business where, you know, 1 of the areas that's most recent that's seeing a lot of attention is the idea of the metrics layer where you wanna bring the business users in to understand the definitions of what that metric is supposed to mean semantically and some of the ways that the data that we have can be used to actually formulate that metric because the sales manager is more likely to know what the actual semantics around a conversion should be versus a data analyst or a data engineer because, you know, we're working at the layer of the data so we can see, okay, these are all the numbers. These are the different events that tie together. But from the business perspective, what does it actually mean to be a conversion, and how is that being used in the data? So we wanna make sure that everybody's working together on that. From the pipeline perspective. You know? And we have the core set of data pipelines that are kind of protected, and you don't just grant access to everyone. But from that base set of datasets that we're pulling in from the, you know, application databases, from Salesforce, whatever it might be, we then wanna be able to give people access to build their own downstream pipelines, downstream datasets.
But to the point of guardrails, maybe we say here is the kind of cookie cutter template of your DBT model to say, you know, these are the core datasets you're able to pull from. Here is the initial set of operations to build a new transformation or build a new table so that it's using maybe the agreed upon vocabulary as far as column names. But you can now go and build some other view on this dataset that you can consume in this dashboard. So you have kind of the templated out set of steps in the pipeline so that all ties together with your kind of paved path. And then if they go a field of that, then they're kind of on their own, and you make no guarantees about the validity of their datasets that they're building.
[00:20:55] Unknown:
Yeah. I mean, I'm interested to talk about, like, the metrics layer and, like, what it's after and what's novel about it and what's maybe so not so novel around it too. Because, like, you know, if you go back to the artificial data warehousing books that are, you know, 25 years old now, so the Ralph Kimball books, the Bill Inman books, it was always about metrics and dimensions. It was about, you know, conformed dimensions, conformed metrics, conformed facts, getting consensus, defining these things very, very well, having these things be a reflection of the business.
So I think that these ideas are not new. Like, there's even, like, metric centric data modeling. Like, to say, like, oh, the metric is the most important thing, you know, at the heart of data modeling or or even from a data governance standpoint. You know? I think it does kinda make sense, but it really I think what is screaming to me, you know, looking at these metrics layer and, you know, different entities and people, companies gonna emerge in the space, like, talk about different things when they say metrics layer 2. But I think some common things and themes that we see, 1 is, like, beyond the dbt world of templated SQL, like, we need higher level abstractions that maybe come with more constraints and guarantees than just like your raw templated SQL. So templated SQL is too free form. You know, anybody can do anything.
It's kind of the far west, so maybe the metrics layer is a little bit more prescriptive in what you can and cannot do and how you have to define, say, ownership of things or how things are derived or how things like time window, you know, are expressed more semantically instead of, like, writing, like, these more complex, you know, unreadable mountains of SQL. So I think there's, like, this higher level abstraction with more constraints and guarantees. 1 thing that makes a lot of sense to me that people are not necessarily talking about too much is this idea of, like, more entity centric data modeling. So when you think about, like, metric centric data modeling, that means, like, hey, we're gonna make the metric to the kind of unit, that really strong entity in information architecture around how we manage data and metadata.
Right? I think that makes sense. Like if you think at Airbnb or like bookings, you know, bookings is really important thing. Let's define, like, who owns certain, like, subsets of dimensions around how bookings are defined. And then we all need to align on a definition of this stuff. I like entity centric data modeling, which is, like, when you think about it, like, the Kimbell book is all about, like, dimensional modeling, which a dimension is an entity. Right? So it is very much entity centric data modeling. I I like to push this idea more forward and say beyond dimensional modeling, if you look at, like, feature stores nowadays that are more emerging in the field of, like, ML type area and feature engineering.
I think it's really interesting to bring a lot more metrics to inside dimensional modeling or call it entity centric data modeling to bring things like, you know, 7 day visits and 7 day clicks and 7 day page views and 28 day to pivot these metrics inside these entity centric datasets. So that's happening quite a bit in the fields of ML and, you know, historically in dimensional modeling. Going back into like the metrics layer, I think to me it's a little bit of a misnomer because it's still like metrics are not useful without dimensions. Right? So it's like it's it's we still live in a world of, like, metrics and dimensions. I guess now we're just looking to add, like, more governance, kinda construct and ideas, like, around the metric.
And then these higher level, like, less SQL and more YAML, there's, like, kind of this tweak of, like, oh, let's be more configuration driven and less, like, in a code or declarative, like, transformation, low level transformation. Like, let's operate a higher level a little bit, which I think is a great idea. Something we could we could talk a lot more, about.
[00:25:04] Unknown:
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo or Facebook ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you're looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for reverse ETL today. Get started for free at data engineering podcast.com/hitouch. Yeah. To your point about operating at the higher level, 1 of the other interesting things that's been happening lately is the reemergence of these visual pipeline builders and low code slash no code solutions where, you know, maybe 10, 15 years ago, it was the world of, you know, SQL Server Integration Studios and Pentaho, and everything was a drag and drop builder for defining your pipelines.
And then with the advent of airflow and the series of tools that followed it, they went back to everything is software. So it was software defined pipeline, so you needed to be able to write code and reason about the flows and with the, you know, map reduce world of Hadoop. And now we're starting to build back up to this higher level of you can, you know, take these visual elements, drag and drop them together, but then you have a way to actually drop down into the code layer. So I think it was prophecy IO is 1 of the interesting entrances in that space where you have this visual mapper. But then when you want to actually dig in and maybe tweak things specifically, it actually generates the spark code so that you can modify it yourself if you have sufficient knowledge. And so it's an interesting world where we have this kind of hybrid of low code visual builders, but also the ability to drop down into the software level.
[00:26:55] Unknown:
Yeah. It's really interesting to see these cycles too. I think both use cases are valid. I think, like, 1 realization, you know, as the person originally created Airflow is, like, the pipeline world is, like, too complicated to kinda express inversion and diff and review. It's so complex that it has to be represented as code at a certain level. When you start getting into, like, those GUIs and you try to do, like, source control type things that now are kind of a given. Right? Like, reviewing a pipeline, seeing what it looked like before and after and forking and testing and CICD type things. Like, that stuff to me feels like as you go up the level of complexity, the there's more need to be in that very kind of version control and as code environment. We also see, like, infrastructure as as code. Right? Like, it is also a movement and seems to be pretty well settled. There's tension there between, like, declarative and templated too. Right? Like, you expressed it as YAML.
And if so, like, I'll parameterize it is and, you know, I've seen, like, a lot of YAML with a lot of logic in it, you know, to a point where it doesn't feel like a static declaration of anything. It's very much more like code. So, yeah, there's that tension. You know? So to me, I like the idea of, like, being able to have it both ways. So if you could have, you know, the drag and dropiness of, say, Informatica and code orientation of something like Airflow and have bidirectional workflow and being able to, you know, pivot from 1 to the other and vice versa. Maybe that's the best of all worlds, but if you can just add it 1 way, it probably should be code. Right? Like, I don't know. At a certain level of complexity, like these GUIs did just seem to break down pretty intricately.
[00:28:46] Unknown:
Absolutely. I think that, as you said, if you need to go 1 direction or the other, it should be software because at a certain point, you can't express the necessary logic in these constrained environments without having a very long iteration cycle of needing to say, okay. Well, now I need to, you know, go in and define a completely new visual block with some different input types that will map to the specific use case that I have. And then, you know, you end up with a proliferation of blocks that are very similar to each other with slight tweaks. And so then it's just a a different explosion of complexity where you'd be better off, you know, having just defined a function that it takes a few parameters and, you know, does these different conditional steps.
[00:29:26] Unknown:
I think the best 1 that I've seen in terms of the best GUI that I prefer is the Abenisho. It's not well known because it was very kind of special purpose and I believe very, very expensive, but it was very good in the visual drag and drop realm. And you could go pivot from code to visual to visual to code bidirectionally pretty well. And then the parallelism specification, like the way that you could monitor the pipeline as it executed was pretty great. It could see the flow visual flow of rows through intricate, you know, transformation phases. So it felt a little bit like when you look at a query execution plan, you know, from a complex, like, you know, parallel database, you can kinda see all the blocks and how the different phases of your query. They kinda just expose that as an API. So it could be I'm gonna have, like, a a group by, you know, and I'm gonna have a, you know, a parallelization phase with a round robin, so you could define all these things very, very well and visually. And for me, it helped me early in my career to think in parallel. It's just the fact of seeing it and seeing the rows flow through and seeing the declaring the parallel phases and the computation, like, really helped me understand, like, data processing on, like, distributed architecture early on because the visuals were so great.
[00:30:44] Unknown:
Another element of the kind of guardrails, and you hinted at it earlier, is the idea of data governance. And that has also gone through a few different shifts where, you know, earlier on, it was a very sort of process oriented manual endeavor where you had to have the data dictionaries, and you said, you know, these are the different data owners, and, you know, maybe you had very coarse grained access layers to say, you know, you can only access this dataset if you have this role in the LDAP system. And now with more code driven and more flexible data systems and layers on top of that, thinking in terms of things like the introduction of tools like Immuta, which have more data sort of attribute oriented access controls versus just role based access controls and some more of these flexible metadata layers to be able to understand as the data flows through the systems, these permissions need to flow with it and being able to do sort of just in time access control where somebody wants to query a given table, but it has maybe somebody's address in it so that you need to request access to it, and then that propagates to somebody else to say yes or no rather than having to, you know, go through a very manual process of trying to, you know, submit a request to the IT department, waiting for them to turn it around in a week or so before you could run the query, and at that point, you've forgotten what you were trying to figure out, you know, we can have these more flexible processes to manage data access so that people are constrained and that they're not just gonna query everything if they don't have the necessary context or they don't have the necessary access. But in the cases where they do need to be able to run a query across a set of data, they can do so. Another interesting element of that is some of the more sophisticated sort of data privacy algorithms and cryptographic algorithms to be able to actually run queries on encrypted data without ever having to actually decrypt it in flight to be able to do aggregations, but you don't ever actually see, you know, the individual values.
I'm curious to get your thoughts on some of the more modern sort of data governance aspects of how that plays into the data engineering role and some of the ways that that also manifests in this, sort of data democratization play?
[00:33:04] Unknown:
There's, like, many things to unpack here. 1 topic is kinda inheritance in the data schemas or data access policy. Right? So you you mentioned data governance, and to me there's subfields there. 1 is data access policy, like who can access what, and then there's data governance more like who created what, who owns what, who's responsible for say the change management, the SLAs around a certain dataset. So let's get in to the more data access policy. I think it's pointing in the direction that if the database is aware of the dataset graph, right, like, which column is coming from where, then you can apply good inheritance kind of scheme into, like, data access policy, and there seem to be value in that. It's interesting to see, like, with ELT kind of winning pretty significantly. Right? Like, a lot of, like, the bulk of, like, the batch processing nowadays written in SQL and it's done by the database engine. So that means the database engine should be able to track the provenance of any given column and dataset and have some inheritance rules around that. Right? And then databases like Dremio, for instance, are baked. The transformations and the derivatives see the database is aware of where things are coming from, and I think there's a need for that. So that means maybe, you know, over time, what we see is, like, the database engine being very aware of the dataset kinda semantic.
Right? Like, so whatever is done in something like dbt inside the database, the database knows about and surfaces that information. There's a lot you can do with that beyond just data access policy. Right? There's, like, aggregate awareness. Right? I could ask a question to the database and it would be like, hey. I know I have an aggregate that's fresh here that will better serve your query. So so they're they said that I know about that you don't know about that I'm gonna use to answer your query more efficiently. So I think, like, we will see the rise of database engines that are aware of the ELT or the transformation semantic and leverage that for all sorts of consideration.
And probably is Dremio the only example I can think of that? I mean, Vertica kinda does that with projections, you know, Dremio with reflections, but it's aware of different maybe perhaps, like, projections or different show of the same dataset. That's 1 thought. I don't know if I wanna get again, there was 2 aspects of your question. I'm kinda tempted to go in the data governance. That's more kinda ownership, validity, SLA, SLO type thing, but that's a very large unsolved problem right now that I think, like, a lot of the interest in data mesh currently are around the fact that the data mesh is talking about data governance. Like, who owns what and what is private, and what is public in terms of datasets, and what's the API to the data warehouse, and what are the guarantees, like, you know, treating data assets as and they call it data products. Right? Like, treating your datasets like they're little products with little kinda API with dual binding contracts around them. So I think that's an interesting area where we, you know, haven't figured out as an industry, you know, the the answers there. Interesting parallels in this area with, like, microservices and the DevOps world of, like, you know, microservice is, like, a really kinda clear contract and service.
This question of, like, can could we have, like, datamarts or datasets, you know, some similar kinda contracts and guarantees in the data world around sets of datasets.
[00:36:39] Unknown:
In terms of the actual data engineering position, as the usage of data has become more widespread, more data sources have become available, it's easier to actually get a data infrastructure set up with all these different services. It has caused an increase in demand for data engineers, which has made it difficult for companies to be able to actually hire for it because there are so many opportunities out there. And I'm curious what your sense of the sort of order of dependency has been in the sort of rise of the modern data stack and the demand for data engineers as to which has driven the other more prominently?
[00:37:21] Unknown:
I think on 1 front with the analytics engineer, you know, as this materialized and if we can get enough people that have those skills of, like, the analytical mindset and then kind of the curiosity of the data analysts and the for, like, someone who's, like, vertically aligned, right, that sits in a product team and wants to answer to solve problems with data while being able to write pipelines and check down source control and have decent kind of data engineering IG. And I think maybe that creates a new need that removes some of the load and the pressure to have so many data engineers. And the fact that they're vertically aligned, I think their odds of succeeding is probably better than a data engineer maybe trying to do that across verticals. So that removes some of the pressure there.
I think, like, as any discipline matures, it becomes more the essence of itself. Right? So everything that is automatable in the role becomes, you know, served by tooling and by practices. And what is left is the things that cannot, you know, be solved, like, with a single solution or, like, the kind of 1 size fits all type of solution. Like, what does that mean in the world of the data engineer? What's the essence of it once everything that is automatable is automated. I think there's, like, less and less left. Like, there's probably a page to read from the DevOps movement there too. Like, you know, there's still, like, very much a need for DevOps engineers even, like, kinda 10 years in to the DevOps move too.
1 question is, like, what are some of the things that every data engineer does that are gonna go away maybe in the next, you know, 5 years, like, what services are gonna pop up. So there's, like, these common patterns and data pipelines. Right? And then in the past, I've been talking about I call it, like, parametric pipelines, which is this idea, like, these higher level abstractions we were talking about a little bit earlier. So everyone does, like, sessionization, for instance, to provide answers around, you know, click stream analysis and segmentation. Right? So we all do this stuff.
And then companies, as they mature up, they build their own AB testing framework that computes, you know, p values, confidence intervals, and does all sorts of magic and complex computation around, you know, subjects and experiments and metric sets and all this stuff. So that's another 1. You know, I've seen people build and rebuild, you know, cohort analysis frameworks. And I think all of these, we're gonna see a company maybe or or people or open source projects or abstractions that help people solve these problems without having to reinvent the wheel so that every single company is, like, kind of building essentially a variation on the same theme. Like, I would love to see these abstractions coming into existence so that next time I need to do sessionization, I can just, like, you know, download the package and and solve that problem.
We're not quite there yet. Right? Like, I think, we might see, like, you know, airflow tags or airflow, like, libraries or, like, DBT projects as reference implementation. But we're in the phase where it's even hard to find some good reference implementation for the things I talked about. Right? If I go today, I'm like, I wanna write my AB testing framework or I wanna do some sessionization, like, what are the resources? You'll probably find some reference implementation that if you're lucky, you might be able to reuse tiny portions of it
[00:40:57] Unknown:
and kind of bend into submission to get to where you need to be. Right? Yeah. And to your point of the selection of these different, you know, prebuilt packages, but also at the level of the different services that are being built and composed together, you know, that has definitely become 1 of the responsibilities of the data engineer to say, okay. You know, do I wanna use Fivetran? Do I wanna use Stitch? Do I wanna use Meltano? You know, which data warehouse do I need? There are multiple offerings for each of these different layers of the stack, and so a big part of it is just tool selection and integration of those systems. And I'm wondering what you have found to be some of the useful strategies for approaching that selection process and being able to understand how well each of those different layers integrates together, some of the potential edge cases that might come about where, you know, maybe I want to use Fivetran, but it doesn't work with Firebolt yet kind of a thing and being able to discover some of those edge cases before you get too far down the road of trying to get it integrated and find out that it actually doesn't work yet. The part of beauty of the modern data stack,
[00:42:04] Unknown:
I think, you know, as we try to to define it, like 1 of the properties that we've seen is the pay as you go and try at will or at least, like, try for cheap. So if it's pay as you go and you wanna do a proof of concept, you're able to self serve into that. Where in the past, you might have to, like, spin up some infrastructure to do a POC or to pay or deal with a vendor process and have an official POC approach. And then the POC become an institution where, like, now you have to involve 3 or 4 vendors. If you wanna do a horizontal or vertical kinda integration through it too, you would have to to involve multiple vendors for each layer and then align them, and then the combinations just becomes really heavy. So at least, like, now I think you can go pretty easily and try, you know, if you could go from having nothing to having a pretty decent proof of concept with a kind of full stack integration pretty quickly.
So if you wanna try Fivetran today, I think it's pretty easy to get started and to get some data starting to flow. And similarly, I think with like reverse CTL or some of these things that used to be very like non trivial. Then you probably want to, you know, talk to peers, similar companies, like, you know, tap into the collective wisdom in terms of, like, for people that that are kinda like you, and then make sure it works for you. And, hopefully, you can get going. I think our our story with reverse CTL, that preset is we're like, hey. You know, do these tools like, we need to send data back to HubSpot, some product analytics back to HubSpot. You know, how are we gonna do this? I was like, oh, let me just try 1. I'm just gonna try HiteTouch, and within you know, it's, like, 25 minutes. I was connected to my database and sending data over, and everything was working pretty well. And we only needed 1 integration, which fits under the pre the freemium plan, and we're like, okay. Well, problem solved. You know?
So build confidence very quickly, and I think that's where the more old school vendors need to worry a little bit. It's like for this generation of people, you know, we wanna self serve. We wanna run our own POC, and we wanna get, like, time to value down to, like, sub 1 hour, and that's just not compatible with the more traditional sales cycle. Like, you gotta talk to someone, and they're gonna ask you a bunch of question. They're gonna qualify you, and they're not gonna be interested in selling you anything unless, like, your contract value is gonna be above, you know, 20 or $50, 000. So I think they're gonna miss out on the more traditional vendors are gonna miss out on these, these opportunities.
Kinda so you're gonna sneak up on them.
[00:44:38] Unknown:
Another interesting element of wordplay is the idea of the modern data stack has gained a lot of popularity as well as the idea of building a data platform. And I'm wondering if you see those as disjoint concerns or something where you start with the modern data stack, and then you have to build the platform on top of it and some of the sort of skills and responsibilities that are implied in each of those phrases.
[00:45:06] Unknown:
I don't know what is the data platform and what is the modern data stack. They're, like, both a little bit unanswered. But, like, 1 way I would paint a picture for me is, like, my data platform at the start up that I'm part of is the collection of building blocks that we selected from the modern data stack and made work together with our business logic. Right? So we pick a a certain number of things, invested in making them work together. They're all modern data stack. I would say, like, building blocks. And then our data platform is, like, the fabric or the mesh of services and this logic that we built on top of it.
[00:45:43] Unknown:
Going back to what you're saying earlier about the role of templated SQL and the current prominence that it has in the form of DBT and, a few other systems. But as you were saying, we need some higher level constructs to be able to have appropriate guardrails and appropriate kind of proofs around the validity of the workflows that we're trying to build where SQL is a little bit too free form because it's just text. You know, it's parsable. You can make some assumptions about it, but it's very easy to kinda shoot yourself in the foot without necessarily having some advanced warning of that fact. And I'm curious what you see as the long term viability of tools like DBT and the idea of templated SQL
[00:46:29] Unknown:
as a core workflow and maybe some of the ideas that might succeed that is a more Is it a more fitted abstraction, maybe? Right? And so and there's a question as to whether, you know, dbt or airflow or template SQL can be the building block of these higher level construct, and then I'd like to shoot that down. I think it's not. So I think, like, dbt or template SQL is a great way, I would say, to express ETL primitives. And by ETL primitives, I mean, like, you wanna source from a dataset, you wanna apply filters, you wanna group by, you wanna join so that ETL primitives or data transformation primitives are these simple things that are very, very well expressed in SQL.
And with a little bit of YAML in there or templating, a little bit of Jinja and YAML and prioritization, you can achieve a lot, and it's great. I think that the rise of DBT and by the way, like, I would say, like, airflow as templated, like, Jinja baked into it very deeply too. So you're gonna achieve, like, very similar things with Airflow. Right? So Airflow, I would say, is a superset of what you can do with dbt in many ways. Right? So you can also have, like, all these other operators and your SQL operators and the Jinja templating. But I would say dbt does a better job at, you know, showing you exactly at just the subset of what you need if all you care about is you have a single data warehouse or using just templated SQL.
I think, like, dbt is just very elegant in terms of, like, coordinating a lot of SQL very, very well. It solves that in a very good way. So now if you wanna build these higher level constructs, so let's just take 1 and we'll take I don't know which 1 is the best 1. We could take, like, the AB testing framework. Right? So you can go and write an AB testing framework in DBT today. Right? Like with YAML, you could say, like, go and define your your metrics, your metrics group, where you have, like, your subjects, right, your user IDs and all these metrics, and then what are your experiments and your exposure tables.
And you can go and build all of that. But then what you're building is really hard to reuse for a variety of reasons. Like, 1 is that as you become kinda logic heavy, you have a lot more Jinja than SQL, and then that just not as very expressive. Like, SQL with a lot of Jinja in it, where every field list is a 4 loop on a collection of fields stored somewhere else. It's just, like, very hard to read and reason about, and it's not expressive enough to do that well to have, like, these very dynamic pipelines. And then there's the other core issue, which is, like, d v 2 doesn't really solve you know, you're writing a certain dialect, so you're only solving the problem for people who use the same dialect as you. Or if you're trying to say, like, oh, you know, I'm gonna write something that works kinda cross SQL languages, then your template is gonna become even more overloaded with Jinja. Right? So you would not use something like limit or.
You would use some sort of, like, more intricately complex abstraction on top of it. So I think, like, DBT doesn't seem like the right place to build these, like, higher level constructs. Right? You know, maybe it's a great place to do a reference implementation and say, like, I have this simple dbt project where I do obsessionization. I'm gonna share this in a GitHub repo and you can take it and reuse what you want and alter it to kinda fit your need, which might be the first phase. Like, we I think we need people doing that today so we can identify the patterns and share and talk about these things and have all these reference like, a a good library of reference implementation so people can compare and try things. So It's a good place to start. Spark maybe? It seemed like a better place to do some of these things, the way you can write these more dynamic pipelines, it can be more dry. It's, like, more expressed as code.
It seems more like a natural place for some of these frameworks to be in a higher level construct to be expressed. I don't know. There's a real question there of, like, if you're trying to build these abstractions today, right, reusable kinda high level transformations, And I called them parametric pipelines or competition frameworks in the past. It's like kinda this idea of these higher level construct that solves certain, like, data engineering high level challenges. Like, what's the right tool set if you're trying to build, like, a 1 size fits all solution or reusable component that all companies can use to solve these problems?
I don't know. I think I'd use Spark as probably what I would try to use if I was to work on that. Does that work for everyone? Like, does everyone has a Spark cluster or is able to run a Spark workload? Does it make sense for people to get data out their warehouse to compute it somewhere else and send it back in the ELT heavy world? Maybe. I don't know. It's unclear.
[00:51:37] Unknown:
Just put everything into, Delta Lake, and then problem solved.
[00:51:42] Unknown:
Put into Lake and, let people write, MapReduce to solve it, and and we're done. And that's the way we used to do it a long time ago. We're 20 years ago. Yeah. Yeah. But, yeah, I mean, I would love to see a lot more of these, like, reference implementation, people sharing, like, hey. This is how this team at this company solved you know, build a core analysis framework on top of airflow. And here's, you know, things you might wanna try to reuse, right, and alter and and make sense of. And you could have, like, more people sharing more of these things.
I think it would become more clear what the different variation of on that topic are and make make it easier for someone eventually to solve that problem once and for all for for everyone.
[00:52:28] Unknown:
There are a number of other sort of hot take topics that it would be fun to dig into. Maybe we'll have to slate those for another interview to go a little deeper on them. So I guess just quickly, in your work participating and contributing to the data ecosystem, what have been some of the interesting or unexpected or challenging lessons that you've learned in the process?
[00:52:50] Unknown:
So 1 thing that's interesting is to see how there's these cycles. And if you've been around long enough in any given discipline, you'll see getting new people come in and have a fresh take on these old problems without having necessarily the context of some of the failures in the past. I think that there's both, like, a beauty in that. Right? That kind of the innocence of giving an old problem a completely new shot with a new environment and then you set of maybe tools and and solution. Right? The world has changed, so you don't think the right way. And I'm sure you think the same way about the problem and can get really creative and fresh ideas.
There's also on the other side, the stupidity of kind of missing out and kind of this teenage, like, not being able to leverage previous experiences of this innocence of, like, not missing out on learning from previous achievements and learning and struggles. So it's interesting to be that person, something that points out to, you know, technologies or mythologies that, you know, were born or existed, you know, 10, 15, 20 years ago that went pretty far. Like, in some cases, like, there's some of these efforts are very notable and solve not necessarily the same set of problems in the same way, but sometimes they would have take optimized for a different kinda outcome or a different facet of the problem and, like, much better on that facet than what we're doing now. So it's been really interesting to see, could everyone rebuild everything on new premises?
Like, you know, everything's gotta be on the cloud and everything is as a service and everything is as pay as you go and everything is distributed first. But in terms of, like, user experience and some of the expressivity of how we solve the problem, there's, like, shortcomings on that side as we optimize for new kind of constraints.
[00:54:43] Unknown:
To close out the show as the final question, in episode 3, when we took a crack at defining data engineering, we closed out with some predictions for the following years of what would come for the data engineering role. And many of those have actually been proven out pretty well. So you're very prescient in that. So now that we're kind of recapping some of those ideas and the definition of data engineering, I'm interested in what your next set of predictions are for the upcoming years. I think there's a there's a question around, like, how are data engineers gonna
[00:55:16] Unknown:
work with analytics engineers. And that's a similar question, I think, to, like, what does a DevOps specialist like, how do they work with developers or engineers elsewhere in the company? And, you know, it's kinda transfer of, like, the vertically aligned versus horizontally aligned. But I think on the short term, we're gonna see a little bit of a struggle and tension and and kinda identifying, like, the border between the 2 roles. And maybe the data engineer is gonna feel like they're hurting kind of a little bit more reckless analytics engineers that, you know, they wanna solve business problems first. They're maybe oriented a little bit more short term, and they don't care about performance and costs and, like, naming conventions and best practices and hygiene. Right? So we're gonna see some tension there form until we can create, like, all of the tooling and the rules and the guidelines of the best practices that are required for to make sure to keep these people in check and make sure that they're not, you know, accumulating depth as they solve, problems in their respective particles.
[00:56:17] Unknown:
Alright. Well, thank you very much for taking the time again today to talk through the sort of current definition of data engineering. So appreciate all the time and energy that you have put into contributing to the data ecosystem and your continued sort of thought leadership, if you will. So always a pleasure to have you on the show. Definitely have to have you back again sometime. So thank you again for all of that, and I hope you enjoy the rest of your day. It's been a pleasure, and I know there was a lot more questions on your list that we did not cover. So happy to come back on the show at some point and then push the conversation forward. Listening. Don't forget to check out our other show, podcast.init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site of data engineering podcast dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email host atdataengineeringpodcast.com with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.
Introduction to Maxime Beauchmann and His Work
The Evolution of Data Engineering
Impact of Data Infrastructure Services
Democratization of Data
The Metrics Layer and Data Modeling
Visual Pipeline Builders and Low Code Solutions
Modern Data Governance
Demand for Data Engineers
Tool Selection and Integration
Modern Data Stack vs. Data Platform
Future of Templated SQL and Higher-Level Constructs
Lessons Learned in the Data Ecosystem
Predictions for the Future of Data Engineering