Summary
What exactly is data engineering? How has it evolved in recent years and where is it going? How do you get started in the field? In this episode, Maxime Beauchemin joins me to discuss these questions and more.
Transcript provided by CastSource
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
- You can help support the show by checking out the Patreon page which is linked from the site.
- To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
- Your host is Tobias Macey and today I’m interviewing Maxime Beauchemin
Questions
- Introduction
- How did you get involved in the field of data engineering?
- How do you define data engineering and how has that changed in recent years?
- Do you think that the DevOps movement over the past few years has had any impact on the discipline of data engineering? If so, what kinds of cross-over have you seen?
- For someone who wants to get started in the field of data engineering what are some of the necessary skills?
- What do you see as the biggest challenges facing data engineers currently?
- At what scale does it become necessary to differentiate between someone who does data engineering vs data infrastructure and what are the differences in terms of skill set and problem domain?
- How much analytical knowledge is necessary for a typical data engineer?
- What are some of the most important considerations when establishing new data sources to ensure that the resulting information is of sufficient quality?
- You have commented on the fact that data engineering borrows a number of elements from software engineering. Where does the concept of unit testing fit in data management and what are some of the most effective patterns for implementing that practice?
- How has the work done by data engineers and managers of data infrastructure bled back into mainstream software and systems engineering in terms of tools and best practices?
- How do you see the role of data engineers evolving in the next few years?
Keep In Touch
- @mistercrunch on Twitter
- mistercrunch on GitHub
- Medium
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
[00:00:13]
Unknown:
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Go to www.dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page, which is linked from the site. And to help other people find the show, you can leave a review on iTunes or Google Play Music, tell your friends and coworkers and share it on social media. Your host is Tobias Macy. And today I'm interviewing Maxime Beauchmann about what it means to be a data engineer. So Maxime, could you please introduce yourself?
[00:00:41] Unknown:
Yeah. So, my name is Maxim Boschmaier. I'm trying to, I think you did a fairly good job at pronouncing my name, which is pretty good. But yeah. So I'm a data engineer at Airbnb. I'm the kinda the main maintainer for, Apache Airflow, which is a kind of a distributed, batch processing workflow engine, as well as, Superset, which is a data visualization and exploration platform. Before Airbnb, I worked at Facebook, as well as Yahoo and Ubisoft. So I've been working with data for a very long time, you know, since been doing data engineering since way before data engineering. Actually, the name existed.
[00:01:21] Unknown:
And what is it about data management that first attracted
[00:01:25] Unknown:
So I started I think it's a very long time ago that I started to work with data. So that was sometime around like 2, 001 or 2002. And I was working at Ubisoft and they started looking into analytics and building a data warehouse to, just to kinda organize all of their company information, mostly like financial information, you know, account receivables, accounts payable, cash flow. So there was really kind of a financial aspect to their data warehouse and supply chain. And I kinda got dragged into this project because there was a need for it and I just knew the right people at the right time and just started picking up a few books about data warehousing and started building data warehousing.
There was like some different technology stack at the time, so we were using Microsoft SQL Server, something called Hyperion Sbase, which is an OLAP database. And we were doing just I was writing a lot of SQL, a lot of store procedures,
[00:02:25] Unknown:
and some ETL tools at the time. And how would you define the term data engineering, and how has your definition of that term evolved in recent years as it becomes more of a recognized discipline?
[00:02:37] Unknown:
Yeah. I I feel like we almost, like, coined the term data engineer while I was at Facebook in, 2 2, 000. I believe that was, like, around 2011, 2012. So we, as I wrote in my blog post, I started at Facebook as a business intelligence engineer, that was my title. And then we started kinda realizing that we were, really using very different tools, going away from from traditional tools and the ETL tools, because there are just no tools out there that could manage the the volume of data that Facebook had at the time. So, we started building our own tools. We started kind of developing new processes and, at that point we we kinda renamed ourself and we changed the name of the team and, started open, like, position to actually recruit for data engineers at the time, which was a fairly new term.
As to like the definition of what a data engineer is or might be, I feel like that we're kind of a special breed of software engineers that are specialized in data infrastructure, data warehousing, data modeling, data crunching, and metadata management. So that's that's a wide kind of, that that is a fairly, wide description. But what what's important to note is, you know, we're basically software engineers with a deep focus in on data.
[00:04:02] Unknown:
So it's kind of analogous to the idea of a full stack engineer, but applied to the particular realm of analytics and data.
[00:04:09] Unknown:
Exactly.
[00:04:10] Unknown:
Yeah. I know that that's another term that has come into common use in recent years as the idea of what makes a software engineer a software engineer has continued to evolve. And the complexity of our systems has continued to grow. So the problem domain that 1 person needs to be capable of handling is growing larger as
[00:04:31] Unknown:
the expected delivery times are becoming shorter. So Yes. On 1 side, there's there is some sort of specialization, but we also expect for people to, to be kinda wider and cover, more and more fields or related, you know, specialties. So it's really hard for someone that's very specialized and that knows only kinda 1 aspect of software engineering or data engineering to be useful at a company. It's always a balance of going deep in some areas, but, kinda wide and and being able to do kinda full stack as much as possible too. 1 1 other fact is, the idea that, you know, in data engineering, as in software engineer engineering, like, the the only constant is is change, you know. Things are changing very fast. The Hadoop ecosystem and the data landscape is definitely, like, diverging still. Right? We don't see a lot of convergence. Now there's more databases than there ever was.
There there's more kind of pieces of technologies and platforms and frameworks and libraries. So there's this explosion of knowledge and code that we're essentially, like, trying to stay on top of as data engineers.
[00:05:35] Unknown:
And do you think that the DevOps movement that has sort of come to prominence over the past few years has had any impact on the discipline and the concept of data engineering as a whole. And I'm wondering if there are any particular kinds of crossover that you've seen whether as far as the philosophy or the tooling that's available.
[00:05:53] Unknown:
Right. So I think in terms of tooling, you know, companies share, all these pieces of infrastructure. So when you when you talk about DevOps, I'm not sure exactly kind of which part of infrastructure you'd be talking talking about. But but let's say 1 component might be stuff like, continuous integration, unit testing, kinda automating the work of developers and you know, data engineers are definitely just as interested in that as people in dev ops or any type of software engineering. 1 place where we see kinda like an historic kinda divergence and and things are starting to change a little bit is, the way the data that that we work with. So dev ops or, you know, ops group will have, typically an ODS or an operational data store, where they'll have, you know, time series databases or things like at Airbnb, we use something called Datadog. At Facebook, there's something called, ODS that was all about kinda real time metrics of, machines and performance metrics of, you know, anything related to to development.
And, we're starting to and on the on the other side, you know, on the the data engineering side, we've been typically very focused on the data warehouse itself. In a lot of cases, kinda slower systems, kinda 90 days analytics, batch processes. And, you know, historically, it's been different kinda technologies for databases on both sides. We're starting to see more and more data engineers gonna get outside of of, kinda batch processing and get more into real time with technologies like Spark Streaming, technologies like like Druid.
And, you know, slowly, we see, you know, DevOps people using the same databases as data engineers. And and 1 example of that at Airbnb would be, the use of, the Druid dot I o database, which is a a real time, distributed column store that works very well for, like, real time analytics or just like fast analytics, kinda fast paced, big scans, and, kinda being able to, crunch and, like, crunch a lot of data. So, so both the people in DevOps and the people doing more classic analytics and data scientists use, this database.
[00:08:01] Unknown:
And for somebody who is working in data engineering, I guess, how much interface do they have to the actual managing of the underlying infrastructure versus just having an operational team who provides those servers and deploys the services, and the data management team is responsible for building the pipelines and tuning the actual, services running on those instances.
[00:08:22] Unknown:
Right. So I think it's a factor of how big perhaps like the company or the the data team actually is. But we definitely see, at some point in time, some sort of specialization where at the very beginning, maybe as the first data engineer or the first software engineer with a data focus at a company, your first task is gonna be around, you know, setting up some infrastructure, getting, you know, things like Kafka and Hadoop, and, you know, Spark up and running. And you know, if there's only a few, data engineers in the team, then there's probably gonna be, you know, distribution of of tasks where people will do a little bit of both. And over time, they they will kinda shake out in in a certain way that, people will specialize.
Whether, you know, whether managing, deploying, and managing, and maintaining the data infrastructure is the role of the data engineer. I would I would argue that not that it's not the case in in a bigger company that, you know, other people that are perhaps a little bit more dev DevOps y, or people that are a little bit more focused on, infrastructure would would take on, those tasks in a company, and it's very different skill set in a lot of cases. So so, yeah, I think over time, data engineers will will tend to go and focus more on things like the data warehouse itself and the all the data plumbing and, you know, data structures and getting consensus as to how the data should be organized, and as for for for, for the company.
[00:09:52] Unknown:
And for somebody who's interested in getting started in the field of data engineering, what are some of the necessary skills that they should possess and what are some of the most common backgrounds that you see those people coming from? So so there's, different types of people. There's the people,
[00:10:06] Unknown:
the dinosaurs like me who came more from the business intelligence and a more traditional classic data warehousing field. But you'll need to if you're coming from that background, you're gonna need to be able to, to code or you're gonna need to develop, you know, the the skills around around source version control, writing code, and starting to, change your processes so that your daily tasks, look a little bit more like a classic software engineer. So so there's definitely, people would have to kinda recycle their skills. And the people that are coming from there usually are very strong at things like data modeling, and ETL and performance tuning. And, it's typically people that know very well how databases work. Perhaps people that were, you know, DBA in the in their previous life. And then there's all the people coming from from other other places, like just, new grads out of computer science. There's people coming from the field of data science that realize that they're perhaps more interested in building persistent structures.
Right? People that that were in data science, but realize that they're more interested in doing engineering type work. So I would say patience is also probably a very important thing because a lot of the, you know, batch processing data pipelines can be pretty cumbersome. And there's definitely like a part of, data data engineering that is about kinda plumbing. Right? So data plumbing and managing the pipes and sometimes, you know, all hell breaks loose and you gotta get in there and and fix this stuff. So that's 1 aspect of the job maybe that that's a little bit less glamorous, you know, the the plumber aspect. But then there there's all tons all sorts of other aspects, like, if you want to automate your work, there's, like in data engineering as in any type of of software engineering, there's always tons of opportunities to create abstractions and create tooling where you can automate kinda your own work, as a as a data engineer and then build, say, services or systems or framework that, that do the things that you would, you would have done manually as a data engineer before. So for the data engineers out there who are, getting bored with the job or or that are thinking, you know, data engineering is is there's a lot of like, you know, data data pipeline writing and that's not very exciting. There's always an opportunity to to take the things that are redundant and and build services and and systems around it. And that's kinda what I've been doing over the past few years with with Airflow, and that's been super super, rewarding.
[00:12:35] Unknown:
Yeah. In my other podcast, we had a nice long conversation about the work that you're doing in Airflow. And since then, I know that it's become an Apache incubated project. Wondering if you can touch briefly on some of how it has evolved since the last time we spoke. I know that it had been using Celery as the execution layer. I don't know if that's still the case. It it is still the case and, I'd Yeah. I can talk a little bit about the the Apache,
[00:12:59] Unknown:
software foundation and and how, it it, you know, helps a project grow, but it also, really changes, the dynamics of the the project too. So I've been through that for the first time, you know, recently. And Airflow started as an Airbnb project, and we we had kind of full control over the project, and we, you know, would release whenever we were ready. And then at some point in time, we we were interested in joining the Apache Software Foundation because, you know, there's a lot of companies that that consider it kind of a risk to, start running software that they don't have, any guaranteed control over. So what the Apache Software Foundation does as you as you start the incubation process is first, you you have to donate your code and your trademarks and your intellectual property, to the Apache Software Foundation, which is only gonna make sure that, you know, people are not gonna change their mind about sharing. So, you know, from that point on, the Apache Software Foundation owns your code and the trademarks and is has some responsibility around, some of the legal aspects of the of the project. Right? If someone was to try to sue us because, say, the name airflow has been used by some other company, or something like that, the Apache Software Foundation would would help us, would probably manage this litigation. You know, this is as much as I know about the the Apache Software Foundation. So maybe some of the things I'll say are not a 100 percent right. So hopeful hopefully, they they mostly are. But a part of the process also is to, the Apache Software Foundation make sure that, there's different people from different companies involved jointly, and they create what they call a PMC, which is, I believe a project management committee that has to be made of, you know, 5 or 6 or perhaps more committers that have kind of the same level and of involvement with the project. So I make sure that say, in the case of, for example, I don't know the whole story with Apache Storm, but at some point Twitter had developed Storm. A lot of people were using Storm.
And at some point, they were like, okay. We're done with Storm. Now we're just gonna build this other thing called Heron, and we're not interested in contributing any more code to, Storm. So at that point, they, they gave the the software to the Apache Software Foundation, and other people from other companies created a little committee and, manage jointly the project together. From from a developer standpoint, it's also a pretty interesting thing because, like, right now the work that I do say on airflow is if airflow is owned by Airbnb and something changes as to my employment status with Airbnb, I might not be able to, to work on airflow anymore and I might have to fork the project or something of that nature. But with the Apache Software Foundation, you know, I I know that I'll I'll always be an airflow committer and have the same kinda relationship with the project regardless of of my, employment, which is cool. So I think also there's just kind of a a brand that's associated with it too. So companies have a lot more trust into, say, installing or running an Apache project, than they they would have, with just running like a piece of software that's developed at some other company. Yeah. Maybe a a few more things about the Apache Software Foundation. So so joining Apache has definitely kinda slowed down the pace of of change for the project. For and that I think that's part of a maturing project to kinda slow down and focus on things like stability and just making sure you have, like, a solid product that works for everyone. So the the pace of development slows down a little bit, but the quality of the releases goes up. It's been kind of a challenging for the the airflow project for us to come up with a release. I believe the last release was about, like, a little bit more than 6 months ago, and, people in the community are working on the next release, but for different reasons. Now it's like we have all sorts of code that comes from different companies, and that's all and that's that same, melting pot. And we have to make it work for everyone, and that can be challenging, especially the first release. But I'm hoping that, you know, with Thyme and with the community, we're gonna be able to, to come with, like, a steady release process pretty, pretty soon. Yeah. It definitely seems that having the stamp of the Apache Software
[00:17:15] Unknown:
Foundation, it's a boon to a project. And like you said, it does connote a certain amount of maturity and stability. And it's interesting how many of the different widely used projects in the big data and data engineering space have ended up under the umbrella of the Apache organization.
[00:17:30] Unknown:
Right. Especially considering kind of all the restrictions that come with it and the bureaucratic process that, goes along with going Apache. Right? So there's we have to use Jira, for instance, instance, and there was some limitation on how we could use, GitHub. And I was like there's this like, these old mailing lists that we need to use for, for legal reasons. So we're, you know, say, unable to use GitHub issues. We have to use Jira instead. And, you know, at the beginning of the project, that was a little bit frustrating to just have to step back and use some of the Apache infrastructure that that is really kinda decade old at this point. If you go sometimes you'll end up you do a Google search, and you'll end up on some Apache mailing list, you know, website where it looks like the CSS is from, you know, the 1900. But, but all of that, you know, I think it was a really good choice for Airflow, and I think that it really contributed to the project's, growth because it attracted, you know, very talented committers and developers that were like, hey, I can be an Apache committer. Or some of them were Apache committer before on other projects, and then they can really get involved, and trust that that things are going somewhere. Right? If you if you say if I wanted to get involved, with, say, Luigi, which is a similar product to Airflow in some ways, I would have to kinda, you know, negotiate with with Spotify or somehow, you know, try to get them to, like, let me let me join the project and there might not be a process for that. There might not be any guarantees. With an Apache project, there's kinda always a way for a company to get involved and to become decision makers as part of the project.
[00:19:05] Unknown:
What do you see as some of the biggest challenges facing data engineers currently in the discipline of data engineering?
[00:19:12] Unknown:
1 of the new things is kinda all the real time data. So, you know, Spark Streaming, Tamsa, databases like Druid. So that's kind of a new stack for a lot of us and that's that's also new type of technology, new set of constraints, you know, so people adapting to things like beams, Apache Beam, which is the Google data flow, based on the Google data flow paper. And then not only knowing all the batch processing stack, which is fairly complex, right, with things like MapReduce and, Hive and, you know, schedulers like Airflow. Now on top of that, you you have to go and learn the the real time stack. Or perhaps it's in bigger organization, it's different people that will focus on these different aspects of data engineering. What else? So some of the challenges is, like, always always working, with, with data science has always been a challenge, and there's always this kind of push and pull as to, like, whether a data engineer should be building kinda horizontal engineering should go and work closely with a project team. So that would mean something like someone at at Airbnb might, be the the data engineer assigned to a specific product team, say, working on the new trips feature that we have, and build the the data structures for them. And, you know, the data engineer has been always gonna have the pivotal point as to, like, whether they work on horizontal or vertical things. And sometimes it's hard to kinda to please everyone being in that pivotal point, and there's all these forces pulling you in different directions. What else is challenging? I mean, just keeping up with the diverging data landscape and ecosystem has always been a challenge. Right? There's so many there's a new flavor of of database every day, and there's just so many packages and libraries and frameworks, that you have to stay on top of. And that's always been exciting, but but kinda challenging too. Right? If you stay for too long put at the same place or kinda same stack, you can be kinda pigeonholed in a certain role or on a certain stack. So, those are some of the challenges I can see.
The the main challenge for a data engineer, I think, that I see is operational creep. Right? So I think it's true of of a lot of careers in information management in general and software engineering, But it's really easy over time to build more and more more things that you have to maintain and to go from a place where you start perhaps, like, doing 90% And then And then over time, as you stay with the company for, you know, 1 year, 2 year, 3 year, it's really easy to pivot and realize at some point that you're spending, say, 80, 90% of your time, like, just kind of fixing stuff and doing some some fighting fires and not adding a lot of value because you're just kind of the guy that keeps the things, running. So it's important for people that identify this to to go and invest on tooling and on, you know, refactoring and paying off that technical debt so that they can stay challenged and and and so that they can keep creating as opposed to, you know, just kinda fixing stuff that they've built in the past. And you touched briefly on this, but how much analytical knowledge do you think is necessary for somebody who's working as a data engineer?
So I think you you need to have definitely like this analytical instinct. Right? And that is, you know, vital for for data scientists to have, peep People have called that product sense too. I'm not sure if that's what you're you're talking about here, but, you know, you need, say, as an analyst, as a data scientist, you absolutely need to have that product sense and have that intuition as to just being being really curious and kinda prone to dig in until you find the answer you're looking for. So I would say, that is critical for analysts and and data scientists. I would say it's really important for for data engineers too. Though, you know, for a data engineer, there's that other urge, which is the the urge to build infrastructure and to to build things that, are there to stay, that you have to balance, with with the more sometimes, like, ephemeral, analytical urges you might have.
[00:23:29] Unknown:
And also how much sort of statistical analysis or understanding of some of the different mathematical principles or probabilistic principles are necessary for being able to properly manipulate the data so that it is in a shape that is accessible for the people who are doing the end analysis on it.
[00:23:47] Unknown:
I think this part I would call so that we call this data modeling. Right? And that which is very different from statistical analysis, like knowledge about stats. Right? So, let me talk a little bit about data modeling. So data modeling is is not a perfect science. There's been all sorts of books written on it. Right? There's this concept of, you know, star schemas and snowflake schemas and Ralph Kimball and, Bill Inman are people that wrote about that stuff in in the in the nineties. Right? And those books in part, you know, still some aspects of those books are very relevant still today. Some aspects might be a little bit less relevant than they than they they used to be. But data modeling for a data engineer is, like, kind of a a core skill. Right? It's how how are you gonna structure your tables, your partitions, where you're gonna normalize and denormalize data, in the warehouse.
Are you gonna be able to say retrieve attributes of say the customer at the time of the transaction versus the the most recent attribute of the customer? So how do you model your data so you can ask all these questions? And data modeling for analytical purposes is very different than, say, data modeling for OLTP types type of applications. So OLTP is just, you know, classical, kind of databases structure for, building building simple software. So data modeling, I would say, is very important. Now talking about, knowledge about stats and statistical analysis, I would say not very much so. And I would even argue, you know, and I'll probably, you know, get people outrage on that but I would say it's, it's also not very important or not as important as 1 might think for data science. Because a lot of what we do in analytics is just counting things, right? And trying to figure out, like, are we doing better than we used to? And, you know, what is the growth rate and, you know, counting things. The stats that we use and the stats that I worked on more recently that are kind of very important, but that that we have extracted out for pretty much all companies working on an experimentation framework. Right? So most modern companies do experimentation, with their users and that usually, you know, people will call experimentation like an AB testing framework. And it's really important then for a company or especially a web company to be able to run a lot of experiments.
And those experiments, you know, to usually usually, you'll have different treatments and a control, and you wanna see whether there are the changes in behavior are statistically significant. But say at Airbnb, I was involved in in building some of the data pipeline for our AB testing framework named ERF. I believe there's some papers out there or some, probably some videos and presentations that some of my colleague have done over the past year or 2. But that part of stats has been commoditized at Airbnb where now you can go and create an experiment, and deploy it and consume the the results. And of course, you need to know what a p value is and you need to know what confidence intervals are from just a consumption perspective.
And you don't necessarily need to know and understand all of the aspects of the mathematics behind it. So I would say, you know, the stats work in a lot of cases and analytics is a little bit overrated. Though, you know, in some cases, it's really important for data science. But I think it's sometimes, it's often overrated and it's also often very well abstracted out too. So you can easily do some very complex things by using very simple libraries and functions, with clear APIs.
[00:27:29] Unknown:
And to make sure that the data that you're working with is of sufficient quality, what are some of the considerations that you need to be aware of when you're establishing new data sources?
[00:27:38] Unknown:
So data is kind of a jungle. You know, there's there's all sorts of data out there and as a data engineer, I didn't I didn't talk about that workload before or that that burden, which is, data integration. Right? So you need to go pretty often and go and fetch data coming from, you know, different partners and different companies. And you need to integrate your, say, your referentials, like your list of users or your your list of accounts and transactions, with some external service providers. And that's always, that I think that's as challenging as it's ever been. Like, we thought at some point that, kinda the b 2 b, you know, data flow would get kinda fixed over time and that that would it would become easy that you could just kinda have some easy APIs that would sync up your, your data with, say, Salesforce or some of these service providers. And data integration is is, you know, as challenging as it's ever been. I think going going back a little bit more to your question as to data quality, right, So in some cases, we wanna get data from other systems from And we're concerned about data quality because we don't have control, around the instrumentation of how this data is generated. The the first thing is probably to make sure you you get things right where you're actually generating the data. Right? So say, most web companies will have some sort of what we call an instrumentation, framework, meaning that engineers that wanna track certain actions on on a website or on mobile, will have a a way to put kinda tokens and little trackers on the site to be, like, when someone hits, the booking button at Airbnb, make sure we track that. And I think it's really important to have what we call, you know, schema enforcement as up stream as possible. So that means as you generate that data, you want for that data to be to come to life, you know, to be originated with as much quality and have all the dimensions and all the metrics that you need from that moment on. And that can be really challenging, to do that. Now when, when you integrate with external, sources.
Right? So if you you're fetching data from Salesforce API or, you know, in our case, we we use Zendesk, which is a kind of a ticketing system provider. And, you know, when we get data from them, sometimes you you don't know what to expect. Right? They might change their API overnight and let not let you know, or, you know, a certain referential that you might have some very specific rules on, might change over time. So, you know, in in batch processing and in stream processing, you you can, embed some data quality check. So that would mean that as you write your pipeline, you might have an idea of problems that may occur in the future and set up some some alerts and some break points saying, hey. If there's certain variables that go over a threshold, don't load that data into production. And if there's a value I don't recognize for a certain field, you know, send an alert. So that's really, very much case by case. And in a lot of cases, it's it's about kinda developing a, immune system over time. So that means, at first, you work with the API, you take for granted that it's not gonna change, you build your pipeline and 1 day, you know, something happens and the the feed, you know, there there's a bug of some of some sort or the data is not what it should be, then you might go and add some more checks and, like, kinda develop some mechanism to to alert or to, to prevent from this data from from making it to, to production before someone looks into it and and fixes the issues.
1 pattern that that we do a lot, at Airbnb on airflow is, you know, as we do batch processing, very often we'll stage the data. So that means, we bring the data into a staging area. You can think of it like, you know, if you have a warehouse, that would be like where all of the trucks kinda drop the things before they get sorted out and brought into the warehouse. So we have this persistent staging area where we load the data and then we'll, we'll perhaps, like, run a set of checks on it. So stage, check, and then exchange. And the idea of exchanging is to bring that data into, into production or into a validated space. So, with airflow, we use a we made this little sub DAG. So it's like a kind of a construct that we reuse everywhere. And, yeah, it's the nature of stage check and exchange. And over time, the the check step, of of that pipeline will grow more complex over time as we identify patterns of of data quality issues. Right? So we could we might say, like, hey. If there's a week over week change of more than 30%, failed the pipeline and send an email to the people, to the owner of the pipeline, for instance. And we have tons of these checks, all all over the place at Airbnb.
[00:32:33] Unknown:
So would you say that that's fairly analogous to the idea of unit testing and continuous integration for standard software development but applied to data?
[00:32:41] Unknown:
I think it's it's fairly different, right? Because it's it's reactive and it's production code. So everything, you know, if if you look at continuous integration, you you're trying to prevent, logic or changes to make it into into production. So if you do your job well with your unit tests and your continuous integration, you're not gonna deploy code that that would result in problems. Where data quality checks and data warehousing might is more of a reactive thing of, like, we have the logic, we don't control, you know, the ingredients that are coming in this recipe, but we'll make sure to, you know, if we detect certain things that we, we cannot detect, you know, we we won't bring that data into production. So I guess in some ways, you know, instead of being code that would make it or not make it into production, it would be more like data that would make it or not make into production. So it's it is kinda analogous in some ways. I can I can talk a little bit more about kind of unit testing too and that part of kinda continuous integration for data warehousing? So the state of that is is not really great, and a lot of people ask, like, what is the best practice for airflow? Like, how do you run your unit tests? And how do you make sure that when you launch a pipeline that it's not gonna, you know, break production? And, you know, the answer is there's probably as many ways to to test and validate a pipeline as there are, engineers and and pipelines out there. So we have different ways of doing it. But 1 thing that's clear is that you cannot really have, like, an at scale development environment. So you look at when I when I was working at Facebook, we, you know, were close to the exabyte in the in the data warehouse, and you could not just say, oh, we're just gonna have a dev instance of the warehouse that's also gonna be an exabyte. So really often, you need to create some sort of microcosm of your, of your data warehouse. And sometimes, they might just be like pushing some very like, some sample data through and making sure that there's no, you know, no fatal errors that are happening and making sure that say all the rows that make made it in the pipeline actually made it or got summarized in the end. The reality is, data engineers don't systematically test, a lot of what they do, right? And they they'll probably, you know, test the little piece of pipeline that they work on. So that means they might run that this piece of the pipeline and and diverge it into an alternate temporary table that they can check. But once, you know, it's checked, they'll just pipe it right back in and and push it into production. Just because doing things right or by the book would be super private prohibitive, in terms of infrastructure cost.
And kind of the how much more confident you're gonna get for the amount of work that that's required, sometimes it is not worth it. So it's really kind of a case by case, and there's all sorts of techniques out there. I should probably, you know, write a blog post on the subject, 1 day. But the reality is is that data engineers are often less thorough than software developers just because the costs, involved and the challenges involved.
[00:35:46] Unknown:
And have you seen any points where the work done by data engineers and managers of data infrastructure have bled back into more mainstream software development and systems engineering.
[00:35:57] Unknown:
I I know 1 1 aspect of this is that the the same way that data engineers are doing more software engineering, I think it's also true that that software engineers do do a lot of or a fair amount of of data oriented work. Right? So in places like Airbnb and Facebook or any modern web companies, a a software engineer needs to be able to set up a an a an AB testing test, right, or an ex an experiment, so that they can measure whether the change they're making to a website is gonna to measure exactly, like, what how it's gonna influence the different metrics that they're trying to to move. So from that perspective, software engineers are doing some of the things that data engineers do. So writing pipeline or instrumenting a metric. Right?
Building dashboards and building their own little, pipeline and dashboard is not uncommon for for a software engineer working working on a product. So how do you see the role of data engineers evolving in the next few years? I see that that the role is gonna become a little bit more formal and balanced in regards to data science. So I think a lot of companies have recognized that they needed data scientists in order to compete in in in their field with their, you know, in their business. I I don't think all companies that have identified that they need a data engineer to organize the data structures and pipelines that the data scientists are gonna source from. And 1 fact is that data data scientists are pretty horrible at building infrastructure and data pipelines. Right? So they'll they'll they'll build these pipelines that are very brittle and will fail over time that are not manageable.
Typically, the data modeling is not done right and there's a lot of throwaway work and redundant work where all the data scientists will go from the the raw data, right, and then do their own analysis and apply their own transformations, and then you and then you end up with big problems like, the the same metric or the metric with the same name that has very different value depending on who computed it, in data science. So companies I think that I've kinda skipped this part of say data warehousing, data engineering, get serious about their data structures and how they organize their data and metadata, they're suffering because the data scientists are doing that work. They're not that great at it, and there's no consistency in in the way they look at their numbers. So no 1 you get into this issue where no 1 can trust any number, where the CEO is like, I don't even wanna see the dashboard because I know it's wrong because it's different from that other dashboard. So I see also like data engineers starting to do more abstracted work. Right? So that means whatever a data engineer does today, which is building pipelines and building the warehouse, I see that engineers starting to build services and frameworks that, that other people within the company can use. So an example of that would be say an an Or an obvious example is the AB testing framework, where maybe, originally, you would have for every experiment, some data scientists would go and do all the stats work necessary to figure out whether that experiment, was successful or not or how it moved different metric. Now the data engineer along with software engineers and you know, with data scientists that together will build the reusable framework that can be used for all experiments.
But that you know, that's just 1 application of reusable components that data engineers can build. So there's other ideas like core analysis, aggregation framework, you know, all sorts of computation framework that abstract out the common data work that's done in in different companies. I see the forces I was talking about earlier, so over the next few years, our data engineers are gonna align more with their verticals or are they gonna align closer to their infrastructure peer and work on more horizontal product? So that's still unclear to me and there might be some we have to figure out which way it's gonna go. My draw and my personal interest is to go towards more horizontal, than than vertical, but I'm not sure which way, the industry is gonna go. And then 1 question around the the future and the next few years, and 1 thing that I I truly hope we start to see is some sort of convergence in in the data ecosystem. So that means, you know, can we all agree that you know, we should all use Spark Streaming and not Samsa or can we all agree kinda what happened with Kafka? People kinda converge and say Kafka is the tool that we're gonna use, and you won't have to choose between 5 or 6 different tools or frameworks. So hopefully, there's convergence in that area where, you know, people agree on how they should build it and maintain, their infrastructure.
And also, you know, I'm hoping that there's you know, when you think about the data infrastructure work which is installing and maintaining, you know, Hadoop clusters and Druid clusters a Spark streaming cluster. I'm really hoping that some cloud providers or some service providers commoditize that so that we we don't so that every company does not need to go and and install, and and maintain that all of that infrastructure. It's just inefficient. So that might be a trend in the future where we see some really good service providers, where you can stand up, you know, a a Druid cluster in in 10 minutes and and get value from it right away.
[00:41:18] Unknown:
1 of the things that you sort of alluded to briefly is the idea of data scientists writing some analytical code that then needs to be operationalized. So how much of that responsibility falls on data engineering versus a more full fledged software engineer who needs to then take that analytics code and write it as a scalable piece of software?
[00:41:39] Unknown:
Yeah. That's a very good question as to like, whose responsibility is it to, make the data scientists obsolete. Right? Is it is it the data science job to, kinda automate their own work? I don't I don't think it's their their natural draw since they're not necessarily engineers and, you know, their skills don't necessarily align that way. At both, at both Airbnb and Facebook, which are my 2, most recent company, there is, some form of a, machine learning infrastructure type team. I believe at Facebook, it was called data science infrastructure, and at Airbnb, I believe it's called, either machine learning infrastructure or data science infrastructure as well. And that's that's a slightly different role. Are data engineers the right people to do this? I guess we're getting into 1 of these gray zone where we're looking for a unicorn type of person that has, you know, all those skills at the, you know, in in the middle of that Venn diagram of data science, software engineering, data engineering, and those people are extremely, rare. If you find 1, just, make sure they don't, you keep them for a long time because it's almost impossible to find these people. And it's and it's hard for for people that just so say, if you put 1 data scientist, 1 data engineer, and 1 software engineer in a room and you try to make them work together, you know, the results may vary. But, you know, I I think this idea of, data science infrastructure is super super interesting, probably the most 1 of the most exciting area right now. Right? Where we can commoditize data science. And I believe in a lot of cases, as I said earlier that, you know, data science value is not necessarily to do, like, 1 really intelligent machine learning model to solve 1 simple problem. But I I believe the real value is to do some basic, you know, machine learning across the board. Right? To not necessarily do something very complex in terms of of like the math or the libraries or, you know, doing some very complex data science work. And really often it's just about having some some, just a little bit of machine learning applied in the right place and across the board then commoditizing that. So hopefully we're gonna we're gonna see some of that, like right now I think it's only, you know, a handful of companies doing that and and then kinda struggling. There's not really companies that are that are doing that or services or libraries that are provided to do that sort of stuff.
[00:44:07] Unknown:
So are there any other topics that you think that we should cover before we close out the show?
[00:44:11] Unknown:
No. I think we covered a lot of things. So, yeah, I think I think we we're pretty good.
[00:44:17] Unknown:
Yeah. It's definitely a large space with a lot of different concerns, but I think that it's important to sort of lay the groundwork of when you say data engineer, what is it that you're really talking about? And I think that we did a pretty good job of at least doing a good approximation of
[00:44:31] Unknown:
that. Right. At least like centering, find finding the middle point of, you know, the core of what is a data engineer, and then from that point, you decide how far from that point you wanna allow people or people to go to. Alright? But the core to me is right around, you know, the data warehouse and and the pipelines that organize the company's, data and metadata.
[00:44:52] Unknown:
Alright. Well, for anybody who wants to keep in touch with you or follow what you're up to, I'll add your preferred contact information to the show notes. I just wanna say I really appreciate you taking the time out of your day to share your thoughts about data engineering and your experiences working with it. And I hope you enjoy the rest of your evening. Thank you very much. That was an honor, and a pleasure to be on the show.
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Go to www.dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page, which is linked from the site. And to help other people find the show, you can leave a review on iTunes or Google Play Music, tell your friends and coworkers and share it on social media. Your host is Tobias Macy. And today I'm interviewing Maxime Beauchmann about what it means to be a data engineer. So Maxime, could you please introduce yourself?
[00:00:41] Unknown:
Yeah. So, my name is Maxim Boschmaier. I'm trying to, I think you did a fairly good job at pronouncing my name, which is pretty good. But yeah. So I'm a data engineer at Airbnb. I'm the kinda the main maintainer for, Apache Airflow, which is a kind of a distributed, batch processing workflow engine, as well as, Superset, which is a data visualization and exploration platform. Before Airbnb, I worked at Facebook, as well as Yahoo and Ubisoft. So I've been working with data for a very long time, you know, since been doing data engineering since way before data engineering. Actually, the name existed.
[00:01:21] Unknown:
And what is it about data management that first attracted
[00:01:25] Unknown:
So I started I think it's a very long time ago that I started to work with data. So that was sometime around like 2, 001 or 2002. And I was working at Ubisoft and they started looking into analytics and building a data warehouse to, just to kinda organize all of their company information, mostly like financial information, you know, account receivables, accounts payable, cash flow. So there was really kind of a financial aspect to their data warehouse and supply chain. And I kinda got dragged into this project because there was a need for it and I just knew the right people at the right time and just started picking up a few books about data warehousing and started building data warehousing.
There was like some different technology stack at the time, so we were using Microsoft SQL Server, something called Hyperion Sbase, which is an OLAP database. And we were doing just I was writing a lot of SQL, a lot of store procedures,
[00:02:25] Unknown:
and some ETL tools at the time. And how would you define the term data engineering, and how has your definition of that term evolved in recent years as it becomes more of a recognized discipline?
[00:02:37] Unknown:
Yeah. I I feel like we almost, like, coined the term data engineer while I was at Facebook in, 2 2, 000. I believe that was, like, around 2011, 2012. So we, as I wrote in my blog post, I started at Facebook as a business intelligence engineer, that was my title. And then we started kinda realizing that we were, really using very different tools, going away from from traditional tools and the ETL tools, because there are just no tools out there that could manage the the volume of data that Facebook had at the time. So, we started building our own tools. We started kind of developing new processes and, at that point we we kinda renamed ourself and we changed the name of the team and, started open, like, position to actually recruit for data engineers at the time, which was a fairly new term.
As to like the definition of what a data engineer is or might be, I feel like that we're kind of a special breed of software engineers that are specialized in data infrastructure, data warehousing, data modeling, data crunching, and metadata management. So that's that's a wide kind of, that that is a fairly, wide description. But what what's important to note is, you know, we're basically software engineers with a deep focus in on data.
[00:04:02] Unknown:
So it's kind of analogous to the idea of a full stack engineer, but applied to the particular realm of analytics and data.
[00:04:09] Unknown:
Exactly.
[00:04:10] Unknown:
Yeah. I know that that's another term that has come into common use in recent years as the idea of what makes a software engineer a software engineer has continued to evolve. And the complexity of our systems has continued to grow. So the problem domain that 1 person needs to be capable of handling is growing larger as
[00:04:31] Unknown:
the expected delivery times are becoming shorter. So Yes. On 1 side, there's there is some sort of specialization, but we also expect for people to, to be kinda wider and cover, more and more fields or related, you know, specialties. So it's really hard for someone that's very specialized and that knows only kinda 1 aspect of software engineering or data engineering to be useful at a company. It's always a balance of going deep in some areas, but, kinda wide and and being able to do kinda full stack as much as possible too. 1 1 other fact is, the idea that, you know, in data engineering, as in software engineer engineering, like, the the only constant is is change, you know. Things are changing very fast. The Hadoop ecosystem and the data landscape is definitely, like, diverging still. Right? We don't see a lot of convergence. Now there's more databases than there ever was.
There there's more kind of pieces of technologies and platforms and frameworks and libraries. So there's this explosion of knowledge and code that we're essentially, like, trying to stay on top of as data engineers.
[00:05:35] Unknown:
And do you think that the DevOps movement that has sort of come to prominence over the past few years has had any impact on the discipline and the concept of data engineering as a whole. And I'm wondering if there are any particular kinds of crossover that you've seen whether as far as the philosophy or the tooling that's available.
[00:05:53] Unknown:
Right. So I think in terms of tooling, you know, companies share, all these pieces of infrastructure. So when you when you talk about DevOps, I'm not sure exactly kind of which part of infrastructure you'd be talking talking about. But but let's say 1 component might be stuff like, continuous integration, unit testing, kinda automating the work of developers and you know, data engineers are definitely just as interested in that as people in dev ops or any type of software engineering. 1 place where we see kinda like an historic kinda divergence and and things are starting to change a little bit is, the way the data that that we work with. So dev ops or, you know, ops group will have, typically an ODS or an operational data store, where they'll have, you know, time series databases or things like at Airbnb, we use something called Datadog. At Facebook, there's something called, ODS that was all about kinda real time metrics of, machines and performance metrics of, you know, anything related to to development.
And, we're starting to and on the on the other side, you know, on the the data engineering side, we've been typically very focused on the data warehouse itself. In a lot of cases, kinda slower systems, kinda 90 days analytics, batch processes. And, you know, historically, it's been different kinda technologies for databases on both sides. We're starting to see more and more data engineers gonna get outside of of, kinda batch processing and get more into real time with technologies like Spark Streaming, technologies like like Druid.
And, you know, slowly, we see, you know, DevOps people using the same databases as data engineers. And and 1 example of that at Airbnb would be, the use of, the Druid dot I o database, which is a a real time, distributed column store that works very well for, like, real time analytics or just like fast analytics, kinda fast paced, big scans, and, kinda being able to, crunch and, like, crunch a lot of data. So, so both the people in DevOps and the people doing more classic analytics and data scientists use, this database.
[00:08:01] Unknown:
And for somebody who is working in data engineering, I guess, how much interface do they have to the actual managing of the underlying infrastructure versus just having an operational team who provides those servers and deploys the services, and the data management team is responsible for building the pipelines and tuning the actual, services running on those instances.
[00:08:22] Unknown:
Right. So I think it's a factor of how big perhaps like the company or the the data team actually is. But we definitely see, at some point in time, some sort of specialization where at the very beginning, maybe as the first data engineer or the first software engineer with a data focus at a company, your first task is gonna be around, you know, setting up some infrastructure, getting, you know, things like Kafka and Hadoop, and, you know, Spark up and running. And you know, if there's only a few, data engineers in the team, then there's probably gonna be, you know, distribution of of tasks where people will do a little bit of both. And over time, they they will kinda shake out in in a certain way that, people will specialize.
Whether, you know, whether managing, deploying, and managing, and maintaining the data infrastructure is the role of the data engineer. I would I would argue that not that it's not the case in in a bigger company that, you know, other people that are perhaps a little bit more dev DevOps y, or people that are a little bit more focused on, infrastructure would would take on, those tasks in a company, and it's very different skill set in a lot of cases. So so, yeah, I think over time, data engineers will will tend to go and focus more on things like the data warehouse itself and the all the data plumbing and, you know, data structures and getting consensus as to how the data should be organized, and as for for for, for the company.
[00:09:52] Unknown:
And for somebody who's interested in getting started in the field of data engineering, what are some of the necessary skills that they should possess and what are some of the most common backgrounds that you see those people coming from? So so there's, different types of people. There's the people,
[00:10:06] Unknown:
the dinosaurs like me who came more from the business intelligence and a more traditional classic data warehousing field. But you'll need to if you're coming from that background, you're gonna need to be able to, to code or you're gonna need to develop, you know, the the skills around around source version control, writing code, and starting to, change your processes so that your daily tasks, look a little bit more like a classic software engineer. So so there's definitely, people would have to kinda recycle their skills. And the people that are coming from there usually are very strong at things like data modeling, and ETL and performance tuning. And, it's typically people that know very well how databases work. Perhaps people that were, you know, DBA in the in their previous life. And then there's all the people coming from from other other places, like just, new grads out of computer science. There's people coming from the field of data science that realize that they're perhaps more interested in building persistent structures.
Right? People that that were in data science, but realize that they're more interested in doing engineering type work. So I would say patience is also probably a very important thing because a lot of the, you know, batch processing data pipelines can be pretty cumbersome. And there's definitely like a part of, data data engineering that is about kinda plumbing. Right? So data plumbing and managing the pipes and sometimes, you know, all hell breaks loose and you gotta get in there and and fix this stuff. So that's 1 aspect of the job maybe that that's a little bit less glamorous, you know, the the plumber aspect. But then there there's all tons all sorts of other aspects, like, if you want to automate your work, there's, like in data engineering as in any type of of software engineering, there's always tons of opportunities to create abstractions and create tooling where you can automate kinda your own work, as a as a data engineer and then build, say, services or systems or framework that, that do the things that you would, you would have done manually as a data engineer before. So for the data engineers out there who are, getting bored with the job or or that are thinking, you know, data engineering is is there's a lot of like, you know, data data pipeline writing and that's not very exciting. There's always an opportunity to to take the things that are redundant and and build services and and systems around it. And that's kinda what I've been doing over the past few years with with Airflow, and that's been super super, rewarding.
[00:12:35] Unknown:
Yeah. In my other podcast, we had a nice long conversation about the work that you're doing in Airflow. And since then, I know that it's become an Apache incubated project. Wondering if you can touch briefly on some of how it has evolved since the last time we spoke. I know that it had been using Celery as the execution layer. I don't know if that's still the case. It it is still the case and, I'd Yeah. I can talk a little bit about the the Apache,
[00:12:59] Unknown:
software foundation and and how, it it, you know, helps a project grow, but it also, really changes, the dynamics of the the project too. So I've been through that for the first time, you know, recently. And Airflow started as an Airbnb project, and we we had kind of full control over the project, and we, you know, would release whenever we were ready. And then at some point in time, we we were interested in joining the Apache Software Foundation because, you know, there's a lot of companies that that consider it kind of a risk to, start running software that they don't have, any guaranteed control over. So what the Apache Software Foundation does as you as you start the incubation process is first, you you have to donate your code and your trademarks and your intellectual property, to the Apache Software Foundation, which is only gonna make sure that, you know, people are not gonna change their mind about sharing. So, you know, from that point on, the Apache Software Foundation owns your code and the trademarks and is has some responsibility around, some of the legal aspects of the of the project. Right? If someone was to try to sue us because, say, the name airflow has been used by some other company, or something like that, the Apache Software Foundation would would help us, would probably manage this litigation. You know, this is as much as I know about the the Apache Software Foundation. So maybe some of the things I'll say are not a 100 percent right. So hopeful hopefully, they they mostly are. But a part of the process also is to, the Apache Software Foundation make sure that, there's different people from different companies involved jointly, and they create what they call a PMC, which is, I believe a project management committee that has to be made of, you know, 5 or 6 or perhaps more committers that have kind of the same level and of involvement with the project. So I make sure that say, in the case of, for example, I don't know the whole story with Apache Storm, but at some point Twitter had developed Storm. A lot of people were using Storm.
And at some point, they were like, okay. We're done with Storm. Now we're just gonna build this other thing called Heron, and we're not interested in contributing any more code to, Storm. So at that point, they, they gave the the software to the Apache Software Foundation, and other people from other companies created a little committee and, manage jointly the project together. From from a developer standpoint, it's also a pretty interesting thing because, like, right now the work that I do say on airflow is if airflow is owned by Airbnb and something changes as to my employment status with Airbnb, I might not be able to, to work on airflow anymore and I might have to fork the project or something of that nature. But with the Apache Software Foundation, you know, I I know that I'll I'll always be an airflow committer and have the same kinda relationship with the project regardless of of my, employment, which is cool. So I think also there's just kind of a a brand that's associated with it too. So companies have a lot more trust into, say, installing or running an Apache project, than they they would have, with just running like a piece of software that's developed at some other company. Yeah. Maybe a a few more things about the Apache Software Foundation. So so joining Apache has definitely kinda slowed down the pace of of change for the project. For and that I think that's part of a maturing project to kinda slow down and focus on things like stability and just making sure you have, like, a solid product that works for everyone. So the the pace of development slows down a little bit, but the quality of the releases goes up. It's been kind of a challenging for the the airflow project for us to come up with a release. I believe the last release was about, like, a little bit more than 6 months ago, and, people in the community are working on the next release, but for different reasons. Now it's like we have all sorts of code that comes from different companies, and that's all and that's that same, melting pot. And we have to make it work for everyone, and that can be challenging, especially the first release. But I'm hoping that, you know, with Thyme and with the community, we're gonna be able to, to come with, like, a steady release process pretty, pretty soon. Yeah. It definitely seems that having the stamp of the Apache Software
[00:17:15] Unknown:
Foundation, it's a boon to a project. And like you said, it does connote a certain amount of maturity and stability. And it's interesting how many of the different widely used projects in the big data and data engineering space have ended up under the umbrella of the Apache organization.
[00:17:30] Unknown:
Right. Especially considering kind of all the restrictions that come with it and the bureaucratic process that, goes along with going Apache. Right? So there's we have to use Jira, for instance, instance, and there was some limitation on how we could use, GitHub. And I was like there's this like, these old mailing lists that we need to use for, for legal reasons. So we're, you know, say, unable to use GitHub issues. We have to use Jira instead. And, you know, at the beginning of the project, that was a little bit frustrating to just have to step back and use some of the Apache infrastructure that that is really kinda decade old at this point. If you go sometimes you'll end up you do a Google search, and you'll end up on some Apache mailing list, you know, website where it looks like the CSS is from, you know, the 1900. But, but all of that, you know, I think it was a really good choice for Airflow, and I think that it really contributed to the project's, growth because it attracted, you know, very talented committers and developers that were like, hey, I can be an Apache committer. Or some of them were Apache committer before on other projects, and then they can really get involved, and trust that that things are going somewhere. Right? If you if you say if I wanted to get involved, with, say, Luigi, which is a similar product to Airflow in some ways, I would have to kinda, you know, negotiate with with Spotify or somehow, you know, try to get them to, like, let me let me join the project and there might not be a process for that. There might not be any guarantees. With an Apache project, there's kinda always a way for a company to get involved and to become decision makers as part of the project.
[00:19:05] Unknown:
What do you see as some of the biggest challenges facing data engineers currently in the discipline of data engineering?
[00:19:12] Unknown:
1 of the new things is kinda all the real time data. So, you know, Spark Streaming, Tamsa, databases like Druid. So that's kind of a new stack for a lot of us and that's that's also new type of technology, new set of constraints, you know, so people adapting to things like beams, Apache Beam, which is the Google data flow, based on the Google data flow paper. And then not only knowing all the batch processing stack, which is fairly complex, right, with things like MapReduce and, Hive and, you know, schedulers like Airflow. Now on top of that, you you have to go and learn the the real time stack. Or perhaps it's in bigger organization, it's different people that will focus on these different aspects of data engineering. What else? So some of the challenges is, like, always always working, with, with data science has always been a challenge, and there's always this kind of push and pull as to, like, whether a data engineer should be building kinda horizontal engineering should go and work closely with a project team. So that would mean something like someone at at Airbnb might, be the the data engineer assigned to a specific product team, say, working on the new trips feature that we have, and build the the data structures for them. And, you know, the data engineer has been always gonna have the pivotal point as to, like, whether they work on horizontal or vertical things. And sometimes it's hard to kinda to please everyone being in that pivotal point, and there's all these forces pulling you in different directions. What else is challenging? I mean, just keeping up with the diverging data landscape and ecosystem has always been a challenge. Right? There's so many there's a new flavor of of database every day, and there's just so many packages and libraries and frameworks, that you have to stay on top of. And that's always been exciting, but but kinda challenging too. Right? If you stay for too long put at the same place or kinda same stack, you can be kinda pigeonholed in a certain role or on a certain stack. So, those are some of the challenges I can see.
The the main challenge for a data engineer, I think, that I see is operational creep. Right? So I think it's true of of a lot of careers in information management in general and software engineering, But it's really easy over time to build more and more more things that you have to maintain and to go from a place where you start perhaps, like, doing 90% And then And then over time, as you stay with the company for, you know, 1 year, 2 year, 3 year, it's really easy to pivot and realize at some point that you're spending, say, 80, 90% of your time, like, just kind of fixing stuff and doing some some fighting fires and not adding a lot of value because you're just kind of the guy that keeps the things, running. So it's important for people that identify this to to go and invest on tooling and on, you know, refactoring and paying off that technical debt so that they can stay challenged and and and so that they can keep creating as opposed to, you know, just kinda fixing stuff that they've built in the past. And you touched briefly on this, but how much analytical knowledge do you think is necessary for somebody who's working as a data engineer?
So I think you you need to have definitely like this analytical instinct. Right? And that is, you know, vital for for data scientists to have, peep People have called that product sense too. I'm not sure if that's what you're you're talking about here, but, you know, you need, say, as an analyst, as a data scientist, you absolutely need to have that product sense and have that intuition as to just being being really curious and kinda prone to dig in until you find the answer you're looking for. So I would say, that is critical for analysts and and data scientists. I would say it's really important for for data engineers too. Though, you know, for a data engineer, there's that other urge, which is the the urge to build infrastructure and to to build things that, are there to stay, that you have to balance, with with the more sometimes, like, ephemeral, analytical urges you might have.
[00:23:29] Unknown:
And also how much sort of statistical analysis or understanding of some of the different mathematical principles or probabilistic principles are necessary for being able to properly manipulate the data so that it is in a shape that is accessible for the people who are doing the end analysis on it.
[00:23:47] Unknown:
I think this part I would call so that we call this data modeling. Right? And that which is very different from statistical analysis, like knowledge about stats. Right? So, let me talk a little bit about data modeling. So data modeling is is not a perfect science. There's been all sorts of books written on it. Right? There's this concept of, you know, star schemas and snowflake schemas and Ralph Kimball and, Bill Inman are people that wrote about that stuff in in the in the nineties. Right? And those books in part, you know, still some aspects of those books are very relevant still today. Some aspects might be a little bit less relevant than they than they they used to be. But data modeling for a data engineer is, like, kind of a a core skill. Right? It's how how are you gonna structure your tables, your partitions, where you're gonna normalize and denormalize data, in the warehouse.
Are you gonna be able to say retrieve attributes of say the customer at the time of the transaction versus the the most recent attribute of the customer? So how do you model your data so you can ask all these questions? And data modeling for analytical purposes is very different than, say, data modeling for OLTP types type of applications. So OLTP is just, you know, classical, kind of databases structure for, building building simple software. So data modeling, I would say, is very important. Now talking about, knowledge about stats and statistical analysis, I would say not very much so. And I would even argue, you know, and I'll probably, you know, get people outrage on that but I would say it's, it's also not very important or not as important as 1 might think for data science. Because a lot of what we do in analytics is just counting things, right? And trying to figure out, like, are we doing better than we used to? And, you know, what is the growth rate and, you know, counting things. The stats that we use and the stats that I worked on more recently that are kind of very important, but that that we have extracted out for pretty much all companies working on an experimentation framework. Right? So most modern companies do experimentation, with their users and that usually, you know, people will call experimentation like an AB testing framework. And it's really important then for a company or especially a web company to be able to run a lot of experiments.
And those experiments, you know, to usually usually, you'll have different treatments and a control, and you wanna see whether there are the changes in behavior are statistically significant. But say at Airbnb, I was involved in in building some of the data pipeline for our AB testing framework named ERF. I believe there's some papers out there or some, probably some videos and presentations that some of my colleague have done over the past year or 2. But that part of stats has been commoditized at Airbnb where now you can go and create an experiment, and deploy it and consume the the results. And of course, you need to know what a p value is and you need to know what confidence intervals are from just a consumption perspective.
And you don't necessarily need to know and understand all of the aspects of the mathematics behind it. So I would say, you know, the stats work in a lot of cases and analytics is a little bit overrated. Though, you know, in some cases, it's really important for data science. But I think it's sometimes, it's often overrated and it's also often very well abstracted out too. So you can easily do some very complex things by using very simple libraries and functions, with clear APIs.
[00:27:29] Unknown:
And to make sure that the data that you're working with is of sufficient quality, what are some of the considerations that you need to be aware of when you're establishing new data sources?
[00:27:38] Unknown:
So data is kind of a jungle. You know, there's there's all sorts of data out there and as a data engineer, I didn't I didn't talk about that workload before or that that burden, which is, data integration. Right? So you need to go pretty often and go and fetch data coming from, you know, different partners and different companies. And you need to integrate your, say, your referentials, like your list of users or your your list of accounts and transactions, with some external service providers. And that's always, that I think that's as challenging as it's ever been. Like, we thought at some point that, kinda the b 2 b, you know, data flow would get kinda fixed over time and that that would it would become easy that you could just kinda have some easy APIs that would sync up your, your data with, say, Salesforce or some of these service providers. And data integration is is, you know, as challenging as it's ever been. I think going going back a little bit more to your question as to data quality, right, So in some cases, we wanna get data from other systems from And we're concerned about data quality because we don't have control, around the instrumentation of how this data is generated. The the first thing is probably to make sure you you get things right where you're actually generating the data. Right? So say, most web companies will have some sort of what we call an instrumentation, framework, meaning that engineers that wanna track certain actions on on a website or on mobile, will have a a way to put kinda tokens and little trackers on the site to be, like, when someone hits, the booking button at Airbnb, make sure we track that. And I think it's really important to have what we call, you know, schema enforcement as up stream as possible. So that means as you generate that data, you want for that data to be to come to life, you know, to be originated with as much quality and have all the dimensions and all the metrics that you need from that moment on. And that can be really challenging, to do that. Now when, when you integrate with external, sources.
Right? So if you you're fetching data from Salesforce API or, you know, in our case, we we use Zendesk, which is a kind of a ticketing system provider. And, you know, when we get data from them, sometimes you you don't know what to expect. Right? They might change their API overnight and let not let you know, or, you know, a certain referential that you might have some very specific rules on, might change over time. So, you know, in in batch processing and in stream processing, you you can, embed some data quality check. So that would mean that as you write your pipeline, you might have an idea of problems that may occur in the future and set up some some alerts and some break points saying, hey. If there's certain variables that go over a threshold, don't load that data into production. And if there's a value I don't recognize for a certain field, you know, send an alert. So that's really, very much case by case. And in a lot of cases, it's it's about kinda developing a, immune system over time. So that means, at first, you work with the API, you take for granted that it's not gonna change, you build your pipeline and 1 day, you know, something happens and the the feed, you know, there there's a bug of some of some sort or the data is not what it should be, then you might go and add some more checks and, like, kinda develop some mechanism to to alert or to, to prevent from this data from from making it to, to production before someone looks into it and and fixes the issues.
1 pattern that that we do a lot, at Airbnb on airflow is, you know, as we do batch processing, very often we'll stage the data. So that means, we bring the data into a staging area. You can think of it like, you know, if you have a warehouse, that would be like where all of the trucks kinda drop the things before they get sorted out and brought into the warehouse. So we have this persistent staging area where we load the data and then we'll, we'll perhaps, like, run a set of checks on it. So stage, check, and then exchange. And the idea of exchanging is to bring that data into, into production or into a validated space. So, with airflow, we use a we made this little sub DAG. So it's like a kind of a construct that we reuse everywhere. And, yeah, it's the nature of stage check and exchange. And over time, the the check step, of of that pipeline will grow more complex over time as we identify patterns of of data quality issues. Right? So we could we might say, like, hey. If there's a week over week change of more than 30%, failed the pipeline and send an email to the people, to the owner of the pipeline, for instance. And we have tons of these checks, all all over the place at Airbnb.
[00:32:33] Unknown:
So would you say that that's fairly analogous to the idea of unit testing and continuous integration for standard software development but applied to data?
[00:32:41] Unknown:
I think it's it's fairly different, right? Because it's it's reactive and it's production code. So everything, you know, if if you look at continuous integration, you you're trying to prevent, logic or changes to make it into into production. So if you do your job well with your unit tests and your continuous integration, you're not gonna deploy code that that would result in problems. Where data quality checks and data warehousing might is more of a reactive thing of, like, we have the logic, we don't control, you know, the ingredients that are coming in this recipe, but we'll make sure to, you know, if we detect certain things that we, we cannot detect, you know, we we won't bring that data into production. So I guess in some ways, you know, instead of being code that would make it or not make it into production, it would be more like data that would make it or not make into production. So it's it is kinda analogous in some ways. I can I can talk a little bit more about kind of unit testing too and that part of kinda continuous integration for data warehousing? So the state of that is is not really great, and a lot of people ask, like, what is the best practice for airflow? Like, how do you run your unit tests? And how do you make sure that when you launch a pipeline that it's not gonna, you know, break production? And, you know, the answer is there's probably as many ways to to test and validate a pipeline as there are, engineers and and pipelines out there. So we have different ways of doing it. But 1 thing that's clear is that you cannot really have, like, an at scale development environment. So you look at when I when I was working at Facebook, we, you know, were close to the exabyte in the in the data warehouse, and you could not just say, oh, we're just gonna have a dev instance of the warehouse that's also gonna be an exabyte. So really often, you need to create some sort of microcosm of your, of your data warehouse. And sometimes, they might just be like pushing some very like, some sample data through and making sure that there's no, you know, no fatal errors that are happening and making sure that say all the rows that make made it in the pipeline actually made it or got summarized in the end. The reality is, data engineers don't systematically test, a lot of what they do, right? And they they'll probably, you know, test the little piece of pipeline that they work on. So that means they might run that this piece of the pipeline and and diverge it into an alternate temporary table that they can check. But once, you know, it's checked, they'll just pipe it right back in and and push it into production. Just because doing things right or by the book would be super private prohibitive, in terms of infrastructure cost.
And kind of the how much more confident you're gonna get for the amount of work that that's required, sometimes it is not worth it. So it's really kind of a case by case, and there's all sorts of techniques out there. I should probably, you know, write a blog post on the subject, 1 day. But the reality is is that data engineers are often less thorough than software developers just because the costs, involved and the challenges involved.
[00:35:46] Unknown:
And have you seen any points where the work done by data engineers and managers of data infrastructure have bled back into more mainstream software development and systems engineering.
[00:35:57] Unknown:
I I know 1 1 aspect of this is that the the same way that data engineers are doing more software engineering, I think it's also true that that software engineers do do a lot of or a fair amount of of data oriented work. Right? So in places like Airbnb and Facebook or any modern web companies, a a software engineer needs to be able to set up a an a an AB testing test, right, or an ex an experiment, so that they can measure whether the change they're making to a website is gonna to measure exactly, like, what how it's gonna influence the different metrics that they're trying to to move. So from that perspective, software engineers are doing some of the things that data engineers do. So writing pipeline or instrumenting a metric. Right?
Building dashboards and building their own little, pipeline and dashboard is not uncommon for for a software engineer working working on a product. So how do you see the role of data engineers evolving in the next few years? I see that that the role is gonna become a little bit more formal and balanced in regards to data science. So I think a lot of companies have recognized that they needed data scientists in order to compete in in in their field with their, you know, in their business. I I don't think all companies that have identified that they need a data engineer to organize the data structures and pipelines that the data scientists are gonna source from. And 1 fact is that data data scientists are pretty horrible at building infrastructure and data pipelines. Right? So they'll they'll they'll build these pipelines that are very brittle and will fail over time that are not manageable.
Typically, the data modeling is not done right and there's a lot of throwaway work and redundant work where all the data scientists will go from the the raw data, right, and then do their own analysis and apply their own transformations, and then you and then you end up with big problems like, the the same metric or the metric with the same name that has very different value depending on who computed it, in data science. So companies I think that I've kinda skipped this part of say data warehousing, data engineering, get serious about their data structures and how they organize their data and metadata, they're suffering because the data scientists are doing that work. They're not that great at it, and there's no consistency in in the way they look at their numbers. So no 1 you get into this issue where no 1 can trust any number, where the CEO is like, I don't even wanna see the dashboard because I know it's wrong because it's different from that other dashboard. So I see also like data engineers starting to do more abstracted work. Right? So that means whatever a data engineer does today, which is building pipelines and building the warehouse, I see that engineers starting to build services and frameworks that, that other people within the company can use. So an example of that would be say an an Or an obvious example is the AB testing framework, where maybe, originally, you would have for every experiment, some data scientists would go and do all the stats work necessary to figure out whether that experiment, was successful or not or how it moved different metric. Now the data engineer along with software engineers and you know, with data scientists that together will build the reusable framework that can be used for all experiments.
But that you know, that's just 1 application of reusable components that data engineers can build. So there's other ideas like core analysis, aggregation framework, you know, all sorts of computation framework that abstract out the common data work that's done in in different companies. I see the forces I was talking about earlier, so over the next few years, our data engineers are gonna align more with their verticals or are they gonna align closer to their infrastructure peer and work on more horizontal product? So that's still unclear to me and there might be some we have to figure out which way it's gonna go. My draw and my personal interest is to go towards more horizontal, than than vertical, but I'm not sure which way, the industry is gonna go. And then 1 question around the the future and the next few years, and 1 thing that I I truly hope we start to see is some sort of convergence in in the data ecosystem. So that means, you know, can we all agree that you know, we should all use Spark Streaming and not Samsa or can we all agree kinda what happened with Kafka? People kinda converge and say Kafka is the tool that we're gonna use, and you won't have to choose between 5 or 6 different tools or frameworks. So hopefully, there's convergence in that area where, you know, people agree on how they should build it and maintain, their infrastructure.
And also, you know, I'm hoping that there's you know, when you think about the data infrastructure work which is installing and maintaining, you know, Hadoop clusters and Druid clusters a Spark streaming cluster. I'm really hoping that some cloud providers or some service providers commoditize that so that we we don't so that every company does not need to go and and install, and and maintain that all of that infrastructure. It's just inefficient. So that might be a trend in the future where we see some really good service providers, where you can stand up, you know, a a Druid cluster in in 10 minutes and and get value from it right away.
[00:41:18] Unknown:
1 of the things that you sort of alluded to briefly is the idea of data scientists writing some analytical code that then needs to be operationalized. So how much of that responsibility falls on data engineering versus a more full fledged software engineer who needs to then take that analytics code and write it as a scalable piece of software?
[00:41:39] Unknown:
Yeah. That's a very good question as to like, whose responsibility is it to, make the data scientists obsolete. Right? Is it is it the data science job to, kinda automate their own work? I don't I don't think it's their their natural draw since they're not necessarily engineers and, you know, their skills don't necessarily align that way. At both, at both Airbnb and Facebook, which are my 2, most recent company, there is, some form of a, machine learning infrastructure type team. I believe at Facebook, it was called data science infrastructure, and at Airbnb, I believe it's called, either machine learning infrastructure or data science infrastructure as well. And that's that's a slightly different role. Are data engineers the right people to do this? I guess we're getting into 1 of these gray zone where we're looking for a unicorn type of person that has, you know, all those skills at the, you know, in in the middle of that Venn diagram of data science, software engineering, data engineering, and those people are extremely, rare. If you find 1, just, make sure they don't, you keep them for a long time because it's almost impossible to find these people. And it's and it's hard for for people that just so say, if you put 1 data scientist, 1 data engineer, and 1 software engineer in a room and you try to make them work together, you know, the results may vary. But, you know, I I think this idea of, data science infrastructure is super super interesting, probably the most 1 of the most exciting area right now. Right? Where we can commoditize data science. And I believe in a lot of cases, as I said earlier that, you know, data science value is not necessarily to do, like, 1 really intelligent machine learning model to solve 1 simple problem. But I I believe the real value is to do some basic, you know, machine learning across the board. Right? To not necessarily do something very complex in terms of of like the math or the libraries or, you know, doing some very complex data science work. And really often it's just about having some some, just a little bit of machine learning applied in the right place and across the board then commoditizing that. So hopefully we're gonna we're gonna see some of that, like right now I think it's only, you know, a handful of companies doing that and and then kinda struggling. There's not really companies that are that are doing that or services or libraries that are provided to do that sort of stuff.
[00:44:07] Unknown:
So are there any other topics that you think that we should cover before we close out the show?
[00:44:11] Unknown:
No. I think we covered a lot of things. So, yeah, I think I think we we're pretty good.
[00:44:17] Unknown:
Yeah. It's definitely a large space with a lot of different concerns, but I think that it's important to sort of lay the groundwork of when you say data engineer, what is it that you're really talking about? And I think that we did a pretty good job of at least doing a good approximation of
[00:44:31] Unknown:
that. Right. At least like centering, find finding the middle point of, you know, the core of what is a data engineer, and then from that point, you decide how far from that point you wanna allow people or people to go to. Alright? But the core to me is right around, you know, the data warehouse and and the pipelines that organize the company's, data and metadata.
[00:44:52] Unknown:
Alright. Well, for anybody who wants to keep in touch with you or follow what you're up to, I'll add your preferred contact information to the show notes. I just wanna say I really appreciate you taking the time out of your day to share your thoughts about data engineering and your experiences working with it. And I hope you enjoy the rest of your evening. Thank you very much. That was an honor, and a pleasure to be on the show.
Introduction and Guest Introduction
Maxime's Journey into Data Engineering
Defining Data Engineering
The Evolution of Data Engineering
Impact of DevOps on Data Engineering
Skills and Backgrounds for Data Engineers
Apache Airflow and the Apache Software Foundation
Challenges in Data Engineering
Data Quality and Integration
Data Engineering and Software Development
Future of Data Engineering
Closing Remarks