Writing The Book That Offers A Single Reference For The Fundamentals Of Data Engineering

Atlone is the metadata hub for your data ecosystem.

Instead of locking your metadata into a new silo, unleash its transformative potential with Atlant's active metadata capabilities.

Push information about data freshness and quality to your business intelligence,

automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans could focus on delivering real value.

Go to data engineering podcast.com/atlin

today, that's a t lan, to learn more about how Atlan's active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork, and Unilever achieve extraordinary things with metadata.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.

With their new managed database service, you can launch a production ready MySQL,

Postgres or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs.

Go to data engineering podcast.com/

lunode today and get a $100

credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.

Your host is Tobias Macy. And today, I'm interviewing Joe Reiss and Matt Housley about their new book on the fundamentals of data engineering. So, Joe, can you start by introducing yourself?

Yeah. Hi. I'm Joe Reiss, recovering data scientist, co founder and CEO of Ternary Data, author of Fundamentals of Data Engineering.

And, Matt, how about yourself? I'm Matt Housley, CTO of Ternary Data and coauthor of Fundamentals of Data Engineering, also a recovering data scientist. And going back to you, Joe, do you remember I, hey, first guy started working in data? Yeah. I mean, I've always worked in data in some capacity or in others for over 20 years now. So I started by, I guess, pursuing the actuarial route, then moved into

analytics,

data science, and then data engineering.

And, Matt, how did you get started in data? You know, I would trace it back to my undergraduate degree in physics and and then actually getting a master's degree in physics. So I was doing a lot of experimental work back then. And so there was always a data component even back then. Even though I wasn't using anything fancy like a database, it tended to be a lot of Excel and such. And then I took kind of a break and got a PhD

in pure math, and then I came back as a data scientist later.

And like Joe said, you know, recovering data scientist meaning

kind of recognized the need for data engineering to get my job done and then just ended up becoming a data engineer that way.

In terms of the topic at hand, can you just start by explaining what possessed you both to write such an ambitious book? When we surveyed the

books on data engineering that were out there, we thought there were a lot of very useful

titles,

but they're typically geared towards very specific technologies.

So maybe machine learning with, you know, a certain cloud platform or a certain language,

or the books were, you know, along the lines of Design Data Intensive Applications, which is a fantastic book, 1 of my favorite books of all time.

But we really felt like the books

didn't provide a comprehensive

ground level view of data engineering as it was practiced in the early 2020s.

And so we really wanted to take a step back and understand,

you know, what are the things that data engineers would need to know

to

get hired, to do their jobs, and to succeed,

as a data engineer today?

I think we got a lot of inspiration

from the data engineering community as well.

I feel like the world of podcasting and, like, medium and substack posts is kind of way ahead of what exists in books in terms of defining what data engineering is all about.

But at the same time, we felt like maybe it was time for someone to codify that

in more of a formal way and just to gather those ideas and that information.

And so, for example, we took a lot of inspiration from Google Cloud Platform has their notion of the data life cycle. And instead of just thinking about tools, you're thinking about how data flows and why you care about data and what you're gonna do with it. So I think that almost became a starting point for our ideas about the data engineering life cycle,

plus, again, a lot of ideas just from the data community. Plus it was also to answer a lot of questions that we'd get, like, how do I become a data engineer? What do I need to know to be a data engineer? And so forth. And so you wrote the book because everybody got angry when you said read Data Designing Data Intensive Applications, and they said, I don't have 3 months to do that.

That was definitely a response that we would get. Yeah. That book, you know, was still 1 of my favorite books in data, but it's definitely not the most approachable book. Yeah. It's a bit of a slog. Problem too, and I think Martin himself basically said this, is that you can read the whole book and you'll know a lot about design and distributed systems, but you're not necessarily

going to know what it is to be a practicing data engineer,

especially where we emphasize a lot of managed services. Like, where you can use something off the shelf, don't reinvent the wheel.

Go read Martin's book so you can understand how these systems work, but you're probably not gonna be in most jobs designing really complex systems day to day

in terms of, like, you know, 1, 000 nodes and managing all these nodes and that kind of thing. Absolutely.

So as far as the objectives that you had, you said that you were tired of answering the same questions over and over again. But what are the overall goals that you had going into writing this book, and how do you feel about your level of success in actually addressing those goals?

The goals were really

what

is likely to be

at least somewhat immutable in the next 5 to 10 years. To actually invert the question, what were some of the goals we did not want with this book? We didn't wanna write a book that was very transient,

very

ephemeral.

So that would have been 1 of those, you know, data engineering with technology x, y, or z, or cloud platform a, b, c. Right? So

we really wanted to take a step back and assess what would be the things where

Matt and I, being incredibly lazy people, wouldn't need to rewrite the book in a year or 2 years.

What were the things that if you were to pick up the book,

say, you know, again, 5, 10 years from now, most of it would still be relevant or still be useful.

And so I'm sure we have all those types of technology books, you know, in our libraries. And

when I pick up those kinds of books and I reread them, there's

a sense of gratitude that the author took the time to kind of take a step back and really understand

the bigger picture that was happening in the field at the time that they wrote it. And I think as well there's this sense that

again, we intend this as a compliment to all the technical books,

but it takes a long, long time to accumulate the knowledge

to of what it is to actually be a data engineer, especially if you start with 1 or 2 technologies. Like, wait. What's the big picture here? What are the goals? How do I actually be successful in this profession?

And so that's the laziness idea. We hope that we provided a shortcut to people that we've achieved that goal. It's kind of a scaffolding that they can plug technologies into and kind of accelerate their careers in that way.

Yeah. That's exactly the framing that I was gonna say is that this book kind of sets the groundwork for you to figure out what to plug into which of the pieces and what those slots are for you to even know what to start thinking about and what technologies to evaluate beyond just, okay, I need to get data from a to b. I guess I'll write a bunch of Bash scripts in Cron and hope that it works.

I think in general, the problem with any really hot field that's trendy

is that people tend to focus on the wrong things. And I think we've seen this happen in data science and then, you know, kind of the next iteration machine learning. And I think we saw it happen with the big data movement and then subsequently with the modern notion of data engineering.

Everyone wants to focus on, like, the shiny objects as opposed to the blue collar aspect of the job. And the blue collar aspect of the job is what gets things done. Right? It's not just grabbing the latest version of Spark that's actually gonna deliver quality data. It's understanding the whole life cycle, what it means to serve data, and who your customers are, and those kinds of questions. As far as the overall process of figuring out what are those evergreen topics, what are the

foundational components that are necessary for somebody who is interested in working as a data engineer or at least understanding what it is the data engineers do. How did you go about figuring out what those pieces were and then boiling them down to their essence?

This is 1 of the harder parts of writing the book. It took a lot of, I think, soul searching on our part to understand, okay, so in our day to day jobs working as data engineers

and, you know, knowing a lot of other data engineers, what were the subjects that they'd most care about?

And not in a pandering sense where we wanted to write a book that would only appeal to certain nuances and areas and not others, but,

you know, what were going to be the things that intersected with everybody's experiences as a data engineer? And so

at the same time, we're looking at lots of data lifecycle diagrams for you know, just to see, okay, so how is data life cycle described by various organizations? If you Google data life cycle and look at the number of images, there's

countless of them. It's it's like you looked at the James Webb and saw how many galaxies there are in the universe. It's about as many, data life cycle diagrams exist. So, but there are certain commonalities to these. Right? So, and the commonalities that we found were applicable to data engineering, which we'll talk more about in a bit. Those are the ones we kept that represented the subject areas that

were probably going to be most relevant for this book. And I think a lot of it too was just, again, reading blog posts, listening to podcast episodes by people we respected.

And what we found at some point is there was this cultural shift happening in data engineering,

where people started talking a lot more about data ops and data management and data quality and data observability and, like, you know, other problems.

And so those were the topics where, like, okay. That's the stuff that really needs to go into this book because

it's not only trendy, but it's actually trendy for a very compelling reason, which is this is the stuff that helps us to get the job done, and that's what should go in the book.

And there's also the interesting

moving target of what the data engineer job role actually is because

as data becomes more of a core consideration for more organizations,

it starts to bleed into, okay, well, software engineers do data engineering, or data engineers do some software engineering, or data engineers are doing some of the analytical engineering or data science work. Curious how you thought about what those boundary conditions are for what do I

The

The way we thought about it was, you know, the data engineer's role

done,

I suppose, either properly or in a theoretical vacuum would be getting data from source systems, whether those are, you know, databases that reside in application,

APIs, or whatever other source system.

Making that data useful for downstream consumption by data scientists,

analysts,

maybe other processes like reverse ETL.

And so really the data engineer,

as we describe it today,

sits in between those source systems and the downstream use and application of data.

It's interesting seeing how this definition of data engineering

as a practice has shifted over the years.

This data engineering really wasn't a title until

ETL

you know, working with data or a BI engineer or

a ETL developer or any number of other titles that, you know, are now sort of, you know, kind of hodgepodge in with data engineer.

So, I mean, this practice has existed for quite a while, actually,

but data engineering as a practice, I think, could really be boiled down to the steps that you just described where you're getting data from source systems and making it useful for downstream use. I I think in terms of defining

our boundaries and our edges too, it was very helpful for us to think about roles that we were targeting.

Basically say, this is the persona

of our intended audience. Hopefully, other people will find this book useful as well, but this is who we're really targeting.

And I think a lot of data engineering books have tended to focus maybe on the FAANG audience to some extent,

like people who work on very sophisticated systems at companies with a lot of resources.

And the problem with that is that that type of expertise is actually not necessarily very transferable out of that thing world. So for example, if you work at at Google,

you may just work on their Colossus storage system, and that may be your job. You just fine tune that system for different applications, including big data, data engineering applications.

So we defined our target audience to be more the, let's say, blue collar data engineer who's not as interested in the low level fine grain details of tools and much more interested in just getting the job done.

And I think in terms of longevity, we hope that that role is here to stay even if some of the details evolve over

time. I think it's the first chapter. You address sort of what that evolution of data engineering and its

responsibilities has gone through over the past few decades, and you say that you're focusing on what you term the data life cycle engineer, which is a term that you mentioned earlier. And I'm curious if you can talk to

what that means for somebody who's actually doing that work. And

for somebody who is maybe working at those lower levels, how much of the topics that you're addressing in this book are really going to be relevant to them

or are worth knowing about as somebody who might be consuming the tools that you're building. So if you are the person who's responsible for keeping Colossus up to date or adding new features to that, How much of the broader use case of data life cycle engineering is necessary

to have to be able to do your job effectively?

I think there is utility there, and the utility is that it does help you to understand the big picture.

And, you know, as you evolve in your career and maybe move into leadership in other areas, hopefully, this will be very useful to say, you know what? I'm no longer working on Colossus. I'm now a lead engineer,

and I'm supposed to define the direction of Google. How do I do that? And let's think about the big picture of how Google uses data and how they consume data and how they ingest data as opposed to just 1 little slice of things.

Yeah. I think there's applicability

across companies of different sizes and data maturity.

Say that you're working at a non tech company, quote unquote. Right? Bigger companies are gonna have teams dedicated to certain parts of the data life cycle. Right? So

whether that's a team that's, you know, in charge of managing the storage systems where all the data is stored or some aspect of it, other teams are responsible for data pipelines.

And we hit on this in the book a lot. It's incumbent to know what your upstream and downhole stakeholders want, right? This isn't just a technology discussion. It's very much about team dynamics, about communication as well.

Typically, when we see data projects fail, it's very seldom due to the technology.

90 plus percent of the time, it has to do with the people and how they interact and communicate or probably don't more to the point.

So

these are the things I would say was just as important as the technology to get across in the book, which were,

I guess, empathy

and ways to communicate and understanding the needs of, again, your upstream and downstream stakeholders, whether those are who work on source systems, whether that's, you know, the team that works on ingestion, storage,

transformations,

you know, BI, machine learning, and so forth. You really should understand

what their job entails,

what they want in order for you to to do

your job most effectively.

Kind of back to your earlier question,

you know, kind of about how the definitions

of data engineering and sort of who this book is for, There's this old trope that goes around about data scientists spending 80, 90% of the time getting data, cleaning data,

all the non, quote, data science y stuff. The job of a data engineer, at the end of the day, really should be to,

I guess, invert that percentage. Right? So maybe, you know, the data scientist is spending, you know, 10% of their time getting and cleaning data, 90% of their time doing the kinds of work that they were professionally trained to do, whether it's machine learning, analytics, and so forth. The data engineers should enable,

you know, their stakeholders to do, you know, the best version of their jobs that they're hired to do. And so again, to bring it back to the question you just asked, you know, data life cycle engineer really does mean

understanding, you know, the the tools, practices,

as well as the stakeholders involved in the data life cycle and helping them do their job the best they can. Yeah. I mean, because of our backgrounds,

maybe the stakeholders are actually your primary focus because

you've done that job, and so you want people to be able to do that job as a data scientist. In fact, I tell people, you know, if you wanna be a really good at data engineer, you should go work as an analyst or a data scientist for a bit. Because it's only when you understand the the outcomes and the outputs. So I think you really can work backwards and understand how best to get to those results.

The other interesting part

of working in the data engineering space is, you know,

similar to the software engineer versus sysadmin space of the balance of rigor and strictness and repeatability

with flexibility

and iteration speed and ability to just get things done. And as you said, it's definitely necessary to have worked in both sides to figure out how do I, you know, manage that balance and how far in which direction to push things depending on what type of operating environment I'm in.

Oh, for sure. Yeah. I mean, especially when you're doing stuff like sprint planning, for example, if your company happens to do

that, knowing what's involved in the outcome, I cannot begin to tell you how important that is. Whereas if you just have a sort of a passing notion, or if you've ever talked to your stakeholders, like, how are you really going to know, like, if you're best serving them or,

you know, or if they're getting the, you know, the best outcomes that that they should be getting. So

As to the specific audience of the book, you said that it is oriented towards people who are looking to get into data engineering, but, obviously, there's a broader audience that would benefit from the lessons here. And to your point of understanding

what are the things that go into all these pieces of work, I can definitely see the emerging role of the data product manager

as being somebody who should really read something like this book, if not this book specifically.

And I'm curious how you thought about

being able to balance focusing on the audience of somebody who wants to be a data engineer

while making it approachable for people who just want to understand the context of data engineering.

We tried to write it in such a way that while there is a very specific target persona,

the applicability would be much broader. And even going beyond data product manager, we hope that people

who are trying to undertake some kind

of digital transformation, you know, that's a huge cliche. I think we know what we mean by it to some extent.

We'll be able to read the book and make decisions about, like, hiring and steps and sequencing in order to get from a to b. And so we hope it's written in such a way that it does have this much broader audience. Oh, we've gotten a lot of requests for that. Data engineers are very excited about this book. We get a lot of questions from them. Data scientists, analysts, product managers, the 1 we were most surprised by. We got a lot of feedback from product managers that, you know, wanna pick up our book just to understand what's going on. You know, I I talk to data engineers every day. I have no idea what they do. Maybe your book will help me at least understand what they

are doing or maybe should be doing. So and that's the other feedback we've gotten too is the book is really much a rubric, not just from data engineers, especially. They're like, well, I've been working as a data engineer for a while, but I've never had, you know, kind of a holistic context of the title I have or the job I was hired to do. And so this is something we've been talking to, you know, team leads and execs about increasingly is

leveling up the knowledge on data teams for data engineering in particular. Because as of now,

I think without exception, there aren't really, like, standardizations

of practices or skills

for data engineering teams. It's kinda like, woah. You know about databases? You can type on a keyboard? Cool. Join our team. You know about,

AWS? You'll be a great data engineer in our team. I mean, can you imagine, like, trying to pick a sports team that way? So your goal is to go play the Super Bowl.

Right now, it's pretty random. It's

kind of a after school, like, flag football team, more than it's a professional sports team. And I think, hopefully, you know, stuff like this, you know, like this book, and I think just standardization of practices and expectations

goes a long way towards, you know, making data teams,

I think, just a lot more focused, specialized, and at least playing on the same playbook.

Digging into the contents of the book specifically, I'm wondering if you can talk about how you thought about the overall structure and which concepts to address in which sections and just sort of the the overall

layout and organization

of the book and how you thought about bringing people through that journey of, okay, this is what data engineers do and why you might be interested in reading this book through to,

okay, now you've read this book, go out and, you know, be effective and, you know, these are the next steps for you to actually

put this into practice.

I think we decided pretty early on to structure it around this idea of the data engineering life cycle. And again, that was inspired to some extent by Google and other sources where we're like, okay. If you boil data engineering down to its fundamentals, what are you really doing? Like, let's get down to the essentials.

Once we had those sections,

we realized that you probably need a discussion of actually how to choose technologies because that's actually something that isn't

we don't do a good job at as a profession. We talk a lot about why certain technologies are cool. We don't talk about how to make technology decisions.

And because so often that falls on data engineers, we ended up adding a section for that. And then we thought further about it and we're like, okay, is architecture really the same thing as choosing technologies? It seems like it's a bit different actually because you're making

decisions about team structure and organization, about responsibilities,

about data flows, and eventually about technologies too. And so we decided we needed to add a chapter on architecture for that reason,

Especially because in small organizations,

data engineers

often end up being kind of the

de facto architects.

And then originally we thought we would have a fairly strict hierarchy. Like we'd have topics that would sit under ingestion, topics that would sit under

generation, but then we realized that there were these ideas in data engineering that don't fit into a stage of the life cycle. So for example, security.

Security is everywhere. Right? If you're not thinking about security at every stage, then you're just asking for a breach.

Same thing with orchestration,

you know, like Apache Airflow

or many other technologies.

Orchestration cuts across all the stages because you're really managing data flows across all stages.

That's this concept of the undercurrents where they cut across each chapter and we keep revisiting them, and we hope that that's, like, very

concrete pragmatic stuff that will help people to become

data engineers. Yeah. When it came to stuff like data governance and data management, for example. Right? Like, I'm sure if you've heard the DM book of knowledge,

that thing's

huge. I think certain areas of that book that applied more than others to data engineers and data engineering life cycles. So we just said, well, fine, we'll choose some of these as undercurrents. We'll highlight these as being what you need to know. But

for further reading, please read these other books, like Dima and

many other books that we,

you know, listen in our book. So there's no shortage of great resources out there. But at the end of the day, we really thought that, you know, the engineering life cycle, the undercurrents,

architecture,

choosing tools again, in the end, we highlight security. That's a huge 1. We just wanted, like, reemphasize that at the end. Like, don't forget about security, because if you do that, bad things happen

to good people like you. And, you know, we finished it off with, you know, the future of data engineering, sort of what is our take on where things are going.

That was a fun 1 to write because you could be totally wrong and still have a lot of fun writing it. So the other the other chapters, you had to be pretty correct, or very correct, actually. So that was fun in itself. I would say the the part 2 of the book where we go through the, data engineering

life cycle in particular, it was great. Our panel of, tech reviewers was, I would say, world class

and just really beat the crap out of us and the drafts. So the end product is very happy with it. In that process of

saying, okay. We're going to write this book. These are the topics we're going to cover. This is how we're going to lay it out. Then there's also the hard work of saying, okay. Well, how do I

apply this, you know, level of rigor to make sure that I'm getting things right and I'm not leading people astray and not just writing this just for the sake of writing something because, you know, because obviously, it's going to make you rich. Right?

Yeah. That's why we did it. It's all for the royalties. Yeah. Yeah. Yeah. Matt has new Bentley on the way. Yeah. That's right. Yeah. That's gold standard for Yeah. But just curious how you actually went through that process of saying, okay.

You know, we've created this framework, but now we actually have to go and do the research and distill it and make it addressable

for our end users because,

obviously, nobody's going to be expert in everything. So I'm just curious, what were some of the pieces that you had to go and actually do research and do your own independent learning on to be able to reflect that to your to your readers?

I know personally, it was, like, 90%

research and reading and, like, 10% writing. And the book took us about a year and a half to write. So you can kind of understand how much research was involved in this. It was not trivial.

If you look at our further reading section, for example, you can kind of get an idea of how much we read.

So I'd say, you know, definitely going out podcasting helped a ton too. So we have podcasts,

you know, we talked to guests. And I think by doing that, got a lot of opinions and insights that we just wouldn't have got if we were just reading, right? And watching videos

on YouTube helps. But at the end of the day, you know, it's

sort of the Lollapalooza effect, as, you know, Charlie Munger calls it, where you just get a hodgepodge of ideas, synthesize them, and something new comes out at the end of the day. You know, the outline of it, I think, was the easy part. And we came up with that because we had to do the book outline

and the table of contents, but then researching stuff. It's like data is such a huge field. Data engineering is huge. And

every time you think you got a subject and you start reading about something else, you're like, Well, crap, I gotta add that in too.

So it just

grew and grew and grew and grew. That's what happened.

I think,

to some extent, some things you know that you don't know, basically. So you go do a bunch of research, and then sometimes you would start writing, and you're like, who's this area here actually? I'm gonna pull on this thread and you're like, ah, I actually don't know enough about this topic. And then you would run off and do some research on that. That happened many times.

And then just us questioning each other on stuff was very helpful. Yeah. Joe questioning me, me questioning his writing, saying, hey,

is this really complete or is this a correct view of this? Or should we take into account other points of view? And then our our reviewers were just invaluable in this respect too in terms of pushing back on certain things or just suggesting like, hey. Have you taken this into account?

That was invaluable too. And then I would say also

part of what we did, which was kind of painful. So Joe mentioned, like, d m bach and some of these, like,

really old school, enterprise y, like, practice books that frankly

peep people in the data sharing space don't particularly like in many respects. We kinda realized that we probably need to give credit where credit's due and also just borrow the best ideas from those books by reading them and then integrating them where possible and kinda saying, take these as enterprise practices that you might view as stodgy and then make them your own, like, modernize them. They actually have a lot of really good ideas here about data management. There's a lot of nuggets. I think we realize is there's nothing really new out there per se.

Even the innovations that you see from big data tools all the way to now, they really do stand on the shoulders of, you know, papers and

journals and people that have come up with these ideas decades ago, right? And I would say the best thing to do is just read a lot of history and a lot of the original writings. And I think that gives you the context to write a book like this. When we talked to O'Reilly about this book, they thought we were insane.

So I remember Jess Haberman, our acquisitions editor, was like, guys, do you really wanna have this as your first book? And we're like, yeah. Why? And she's like, because it's hard. Nobody's done this for a reason. There's so much to pull on here. Like, this is not your typical

book. You know? But she,

you know, apparently, didn't persuade us not to write it. You know, it got green lit really fast once we had the draft you know, the book proposal done. So she's like, okay. You guys think you wanna do this? Go do it.

So

she did everything in her power, I think, to

discourage us from writing this. She's like, this on a scale of like, you know, 1 to 10, this is probably a 9 or a 10 in terms of difficulty to write. This is not a trivial book. Not at all. Because think about what you're trying to do. You're trying to come up with a definition of a field that I think has been sort of defined, but pretty vaguely, and then come up with, you know, the practices

of a field that has been not really defined.

We all do it, but it hasn't been comprehensively defined in this way before. And so that was very challenging. But, you know, along the way, what was pretty cool was calling up people like Bill Inman and asking him, does this section on data warehousing look correct to you?

And he'd say,

here's we fixed a couple of things here, but it looks pretty good. So just, know, getting, you know, the expertise of people who, you know, were truly originators in this field, I think was, you know, 1 of the great experiences of writing this. So so everyone wanted to pitch into, I think, because they, at least, quite a few people were rooting for us towards especially towards the end where they're like, this is pretty cool. Yeah. It's definitely not trivial to be trying to write a definitive

guide in an evergreen format for such a moving target.

No.

So because of the fact that you are

trying to treat this broad subject area in an evergreen fashion

and trying to remain technology agnostic,

I'm curious how you balanced that with the need to actually provide

useful examples for people to be able to

crystallize and conceptualize

the different topics that are being covered and just how you thought about

where and when and how to introduce

specific technologies or references to technologies

as ways to

illustrate the examples or the subjects that you're trying to cover? I think, ultimately, we had to choose

a collection of what we consider to be best in class

technologies

to use as examples

across the board. And so we always try to, like, cite several examples of technologies where possible

or maybe illustrate

very narrow examples with a single technology. We also try to keep it pretty broad

and to come up with applications where, like, here's where you would use Apache Druid, for example. Here's why it's

useful. Dig into details there. Because, yeah, otherwise, you risk getting just pulled in to the direction of a single technology and sort of obsoleting yourself and obsoleting your book within a couple of years. I don't know. What's your Yeah. I mean, you try and find the technologies that will be around, hopefully, by the time the book's in print. It was kinda funny, though. Originally, we did mention quite a few modern data startups. And I would say, overwhelmingly, a lot of those startups, by the time we were about to finish the book, they either got acquired or something happened, and they weren't around anymore. That was interesting. So, you know, we tried to go with, you know, at least either blue chip

or abstracting it away and saying, well, here's an example

of,

you know, x, y, and z technology.

And so no. Caveating it as much as possible saying, you know, maybe when you're talking about technology, we say, maybe, for example,

such and such technology does this and so forth. But we are cautioned by, you know, authors, you know, other authors of technology books that definitely keep the dimensions of specific technologies down to a minimum.

It's gonna date your book automatically.

The other thing I'd mentioned, and we definitely took some inspiration

from designing data intensive applications

for this, is that

we tried to to slice things into some big ideas about where data engineering was going. And so for example, we have a pretty extensive discussion about separating compute and storage. And then I also go on to talk about how in reality, most of these systems are hybridizing, you know, separation and colocation in some ways.

And then we just cite, you know, a whole kind of litany of examples of how different systems do this. You can get the principles of how you can both

separate compute and storage and yet improve performance through things like caching by just talking about how a lot of different technologies do this. And, ideally, you know, if you as new technologies come out, some of them will do new things. In many cases, they'll do old things better. And so, hopefully, a lot of those approaches and ideas are already on that list cited.

Yeah. And to your point about the way that the designing data intensive applications book treats things, 1 of the things that's standing out in my mind is that a specific technology that is used as an example to illustrate a particular

database internals

concept is the React database, which is no longer a thing. It has gone defunct, and it's not available anymore. But the example of what you're trying to illustrate is still valid, Like, regardless of the fact that React isn't a technology that's in use,

it still does the job of showing you this is what this is doing and how. I think that that's definitely a

kind of useful

sort of firewall that you have where you're not trying to be instructive about a technology. You're just using the technology for illustrative purposes. So even if it does go out, you know, cease to be maintained or the company goes out of business, the fact that it was there as a illustrative example

is still valid even 5, 10 years from now or, you know, once that company gets acquired and goes in a different direction. Or

Oh, for sure. I mean, you can pick up designing data intensive applications today, and it's still a great, great book. Really fresh. Yeah. It does. It reads really well. I believe Martin's working on a new edition of it. I can't confirm this,

but but even he's, you know, I think recognizing that, you know, perhaps the book needs, you know, some updating. So we'll see how that turns out. I'm very curious. To cite another example, I mean, we realized from our consulting practice that it's very important to explain to people how columnar databases work. And it's actually shockingly often we see people just abusing columnar databases in various ways, and then they get poor performance and high bills.

But at the same time, to your point, you need to illustrate that in concrete ways. Right? So you need to talk about, like, Parquet files and Snowflake and BigQuery and other technologies, Redshift that are columnar,

so that people can get feel like they're getting a bit more hands on. Because if you just talk about columnar databases abstractly, it's it becomes very hard to understand. Oh, that sounds boring. Yeah.

And

because of the fact that you were using some of these technologies to illustrate different points and you had to go through the exercises of being

sufficiently knowledgeable about them yourselves. I'm curious how you approached that process of

selecting which tools to use as an example, which examples to actually work through, and just validating your assumptions and your instructions

yourselves and how you managed the sort of iterative process of saying, okay. This is what I'm trying to cover. I'm going to have to actually set up this environment and do these operations to make sure that I'm not saying something that's not exactly true or that this isn't behaving the way that I'm saying it is. Yeah. It's definitely a mix of experience and just, I would say, ruthlessly

reading documentation

over and over again just to make sure you're not misstating something. Right? But, again, I think it comes from a lot of experience with these tools as well. And so and observations too, I would say, both great patterns and antipatterns.

So it's a mix of everything. I think observations, experience, and what is it? RTFM.

So

but but, yeah, all these things combined, I think, was how we approached a lot of these examples. I mean, you know, Matt's got a lot of experience with, you know, certain systems. I have a lot of experience with different systems. We often have the same experiences with the same systems. And so, you know, just really drawing on the body of knowledge and talking to engineers too. I think that's another big thing where

we felt like talking with data engineers about their experiences and practices practices was invaluable. As Matt and I say too, like, if 1 of us misses something, the other misses it too. We hang out so often that we can finish each other's sentences, basically. So

the danger in writing a book like this with a co author who's basically your clone is that, you know, again, it's great, but it also forces a certain myopia that you can only break out of by, you know, talking to a lot of other people.

Prefect is the data flow automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale.

Guided by the principle that the orchestrator shouldn't get in your way, Prefect is the only tool of its kind to offer the flexibility to write workflows as code. Prefect specializes in gluing together the disparate pieces of a pipeline and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations

and supported by over 20, 000 community members, Prefect powers over 100, 000, 000 business critical tasks a month. For more information on Prefect, go to dataengineeringpodcast.com/prefect

today.

That's prefect.

In terms of the final results and the shifts that you saw just in the process of writing the book over that year and a half, what are some of the things that you anticipate needing to revisit in a v 2 over the next 2 to 5 years? And what are the pieces that you are confident will remain evergreen and you're not actually going to have to touch?

The basic ideas are going to stay in place. In other words, these stages of the data engineering life cycle, you

data gets generated in source systems, you ingest it, you store it, and so on. I think those core ideas are not going away.

I think what's likely to happen is that the next edition is going to emphasize streaming even more than we already do.

The reason I believe that is that we're already seeing in managed services

is that managed services have made stream processing and real time processing

much easier than it was 5 years ago.

The uptake of that is still kind of a slow process. Like, your heavy hitter tech companies are doing a lot of real time processing.

I think outside of that, it's certainly there, but not as much as it's going to be in 5 years. I think in the next 5 years, what we'll see is a huge increase

in, well, just massive simplification of streaming services

even more than has already happened and then really big uptake.

And so just the amount of ink that we spill on streaming

would need to increase quite a bit and then, you know, talk about some things that really haven't even been defined yet. Like there's kind of this active debate in the data engineering community about how you model streaming data. And so hopefully we, as a community, we'll kind of figure some of those things out and then we'll be able to write about that in the next edition someday.

Sort of the periphery. Right? So on 1 end, you have software engineering. On the other, you have machine learning. I think the intersection between software engineering, data engineering, and machine learning is gonna be very fascinating to watch.

As Matt points out, as real time data becomes more ubiquitous,

what I see happening in the space is

streaming is gonna become,

commodity, just like data warehouses and data lakes were commodity in the 2010s.

Prior to that, those were very expensive multimillion dollar on prem installations. Now it's like, who buys that right now? That same simplification and accessibility

and democratization is gonna happen in streaming,

machine learning. I think a lot of stuff's gonna move back to the application layer for software engineers, which means they may in fact become the data engineers and not the data engineers themselves. They'll still

manage whatever the data engineering life cycle happens to be. I don't see that as a principle going away, but maybe the life cycle shortened dramatically. The feedback loops become a lot shorter too. I think this seems like somewhat of an inevitability. Although every time I say that word,

I should punch myself because

nothing's inevitable. But

at least from where I sit, that seems to be where things are going. So, you know, the things that we probably need to revisit is probably exactly these assumptions. You know, we were talking to Jordan Tagani a couple of weeks ago, 1 of the co creators of BigQuery, and he's talking about small data now and, you know, using DuckDV for everything.

So every time you think you have, you know, a sense of where things are going, you know, I think it pays to have an open mind and, you know, maybe it goes the opposite way. But there's nuggets of truth to everything too. It's not like there's sort of binary outcomes

either. Right? So I think it's

it could involve all the above and to some extent.

Yeah. It's interesting because the data ecosystem is very fractal, and in some ways, to your point of, you know, oh, big data. Oh, no. Now, small data. I I, you know, both are true, and it's somewhat sort of the the redshift of the universe applied to the data ecosystem where everything is expanding and it's moving away at an accelerating space.

So if you started at the point of, you know, inception of, you know, database systems in the 19 seventies and you have remained in the industry till today, you know, you have to pick a direction to move in because you're never gonna be able to encompass the entirety of the ecosystem. And so for somebody who is coming into it today, you know, you have to figure out, you know, what area of that universe you wanna actually live in because you're never gonna be able to go from 1 end to the other within your lifespan.

Oh, yeah. For sure. And the other big thing I think is gonna happen is data modeling is I wouldn't say it ever went away, but it is coming back into vogue to some extent.

But the interesting thing is,

like, a holistic view of data modeling from, like, a streaming,

you know, incorporating things like graph databases, machine learning, and stuff like that. Something that I'm hugely interested in right now. And whether or not, like, it's, you know, revised in the next edition of the book or whether that's a standalone book, I think is still TBD,

but that's an area that I personally feel is that's gonna get a lot more attention as well pretty soon. So because a lot of the techniques we've been talking about, as you point, you know, talking about old stuff, it's like it relates to relationships

within things like structured data

for the most part. Right? But what about all the other data out there, images,

audio, text, everything else? So that's that's fascinating.

I think Joe and I both believe that a new conversation on modeling needs to happen to your point, which is we have these kind of established versions of modeling like Kimpel and Inman, which are really fantastic for what they cover.

But we need to figure out ways to update those for newer technologies like columnar databases where you don't necessarily want to normalize everything. It actually doesn't make sense to do so, and also to cover more types of data and somehow be more holistic about your modeling and extend it beyond the boundaries of just the data warehouse.

In your work of writing this book and doing other research and working with your reviewers and the community, what are some of the most interesting

or innovative or surprising elements of the space that you encountered or that you learned about? Doctor. There's so much

to learn and to know. It was interesting. I was talking to, you know, 1 of my friends. He was, like, the 4th user of Hadoop

in the world. So, like, pretty old school. And I would say he's like the data engineer's data engineer. He reviewed our book and he said it was kinda crazy the amount of stuff you need to know as a data engineer. And so I think that that was

actually a very interesting

you know, some interesting feedback because for Matt and I, it's like, are we covering enough? So maybe we covered more than thought we would. But I think that as you just kept unraveling the strings, so to speak, it was like there's just more and more to it. And I don't even think that we covered, you know, to a level of detail, what is possible out there or what's available.

There's a lot. Well, I'll say too, I mean, that yeah. The conversation with Jordan Tigani kinda genuinely surprised me because

after so many conversations,

we saw a lot of iterations on existing ideas and various incremental improvements.

But this idea of, like, having a hybrid sort of traditional database with an edge type web app that queries data kind of like that's the standard for mobile apps now, for example. Right? Back end that lives in the cloud with a lot of local processing.

I'm like, okay. That's a genuinely

cool new idea. I mean, you're borrowing from the domain of web development, but that's a really cool new thing. It's all I'll be curious to see what comes out of that of that in terms of evolution.

Yeah. But once in a while, like, you talk to a lot of people and hear a lot of cool stuff, and then once in a while, something just, like, really pops for you. And you're like, I hadn't thought of that before. The fun part is just talking to people. And we talked to hundreds of people, you know, experts in this space, people that you know of. And I think that was probably the most fun, was just getting to know these people and their, you know, their accomplishments and their thoughts in the field, and also just their backgrounds, you know, and kinda what makes them tick. Like, I think data engineering is as much about the technologies as about the the players and the the personalities in the space. I think that's what makes it pretty cool. Like, people like yourself, right? I've listened to your podcast for ages, and now we're talking on your podcasting. That's really cool. You know, and last week, you know, us doing our Flip the interview and understanding what makes you tick. I think that was a very, very awesome experience I remember for a long time. And just, you know, understanding, like, what is it that drives the space forward? That's the most surprising things too is talking to people who have invented a lot of the technologies that we take for granted these days. Right? But understanding, okay. So, like, what what's the motivation to create this? Like, why do this? Why not use something off the shelf? Or why not take a different approach? But that was the approach that they took and, obviously, it worked really well. So I think that's a cool thing because it highlights just the important principles of technology in general. The field is never static. People are always trying to solve new problems. And to me, that was, I think, the most interesting and inspiring part of writing this book, which is, again, getting to know the personalities that we all know and, you know, have helped drive this space forward. So it was cool.

Yeah. And to the point of always solving new problems and its intersection with the law of conservation of complexity is, you know, how many of the problems that you're solving are ones that you created by the previous solution?

Conservation complexity. That's awesome. Yeah.

That's interesting. Yeah. Because I feel like

a lot was accomplished in the Hadoop era

in terms of suddenly being able to scale systems much larger without spending, you know, $100, 000, 000.

But we seem to build a lot of complexity in that era too, and some of that wasn't necessary and just ended up being a lot of tech debt in the long term. And so I guess the question is, what are we doing right now that's gonna cause, like, massive pain in 2 years? It's just Well, I mean, you see people aren't complaining about the modern data set causing that kind of pain. So I think people just like to complain too.

Yeah. We love complaining. It's fun. It's like it rolling. Things are good enough. You know? You always have to complain about it. If you complain about it loud enough, it'll justify all the time you spend on trying to build the next new thing to solve the thing you're complaining

about. It's how you raise your next round of funding. Exactly.

And so in the process of writing the book and actually doing the authoring and the research, what are the most interesting or unexpected or challenging lessons that you each learned personally in the process?

We're talking with Justin Borgman today from Starburst on 1 of our shows about this, and I think it was

writing a book and running a business was probably a really dumb idea. So and I think that, you know, that was just a lot of work, but we managed to get it done. And I think just how much you learn when writing a book. This is our first book. But it definitely pushes you to boundaries that I would say you will never get

otherwise. And so by writing, it's really a good way of refining your thinking. And I say refining your research as well around ideas.

And then finally, I think it was also interesting because as we were finishing the book, we started getting a lot of good feedback on it. And then we started noticing on places like Amazon, it was already the number 1 new release, and this is, like, while we were writing it. And so you wanna talk about having pressure to finish

strong.

Like, it's hard enough finishing on its own, but when you get these external validations of your book when it's not even done, I think that keeping focused

got a lot harder for me personally just to say, like, it has to be really good. Like, you can't screw this up. So hopefully, the audience likes it.

We'll see. I would say too, I mean, it turned out to be more fun than I thought it would be to write this book. Mhmm. Certainly, it was challenging, and, you know, Joe and I had occasional disagreements and such. Just under a lot of pressure in trying to get things done and hit deadlines. Odd couple.

But it turned out to be surprisingly fun, and I think

Joe has been tempted into maybe starting another project sometimes. And we'll see if that happens, and I'm not quite ready for that yet. I need to let this percolate for a bit before I start Yeah. Reminds me of, like, when we,

you know, when I had my first kid, it was kinda like, yeah. We'll

probably not do that for a while. And, you know, time heals all then you're like, oh, yeah, let's have another kid. And I feel like that's what books are like, where you're just like, you get out of it and you're just

thinking, oh, I'm never gonna do that again. That's that's terrible.

But what happens is that you get into book mode, right, and you get into thinking mode. I think you start seeing things differently too. Like, the person you are before the book is not the person you are after. Like,

you see problems differently. You kind of see the industry differently, and then you wanna probably write more about it if you're a masochist.

So that's who we are.

And so digging now into

the predictions that you have for the future of data engineering, You wrote a bit about that in the book, and I'm wondering if you can talk to some of the

themes that came out there and maybe some of the ways that you're thinking about it has evolved since you've hit save on that last version of that section of the book? Yeah. I mean, we kind of alluded to this with what we call the live data stack. So that's, you know, sort of the shortening of the feedback loop between,

you know, applications,

you know, and machine learning, you know, embedded analytics and so forth. I think that's gonna be a real thing, Some variation of that. Maybe it's not called the live data stack, but something resembling that I think is

is gonna happen. We also talk about data engineering is becoming more, quote, enterprise y. And what we mean by this is

as tooling becomes more abstract, data engineers' focus is gonna shift from maintaining tools

to maintaining processes,

management,

you know, all the stuff that was once considered sort of, you know, reserved for large enterprises and so forth. That's trickling down. You can see it. Observability, that wasn't talked about a few years ago. DataOps,

quality,

all these, like, quote, boring subjects are now, like, you know, the hottest thing that's going on in data engineering. And I think finally you know, it's kinda weird saying this, but I think there's gonna be a lot of attention put on spreadsheets. Like, spreadsheets are

sort of the dark matter of the data universe.

It's amazing that we spend all this time talking about tooling around data

warehouses

and all this other stuff. When you just go out into the world, like people are using spreadsheets

way, way, way, way more

than these, you know, fancy data systems that we all use on the show.

To me, that's 1 of the uncharted areas for data is actually

tooling and better practices around spreadsheets and incorporating those probably back into some sort of infrastructure. But,

well, there's a 2, 000, 000, 000, 2 point something 1000000000 spreadsheet users out in the world. I'll add. I mean, we talk about that specific theme in the book, and it may not actually be spreadsheets that are the future of this area, but what I'll call it is like a new way for people to interface with data. Yeah. If you think about it, when did the spreadsheet revolution happen? Well, it happened with Visa Calc on the Apple 2, and that alone was enough to boost the Apple 2 into success. Like, that alone became a huge driver of the Apple 2 as a business platform.

And spreadsheets have evolved a lot since then, but I think there needs to be a next step of some sort.

Maybe we completely rethink the way that people interface

with data, but it feels like with modern spreadsheets and with dashboarding tools, we're not quite there yet. There's another potential step we can take

as, you know, data engineers, analytics engineers

to make data much more accessible

to people who actually need to use it like CFOs and, like, business managers. Because there's huge disconnect. Every time we see exec teams, they're all using spreadsheets. Yep. That may have been populated from the data warehouse. I think that's the entire point is there's just a huge data divide right now between

where data and decisions are getting made and, you know, the type of work that data professionals do.

Oftentimes, these are not the same thing. So that's the other, you know, quote prediction for the future of, data engineering is it becomes less exciting, I suppose, for some data engineers. It's but, you know, spreadsheets are kinda like the business mullet of the world. So Going back to what Joe was saying about the live data stack, I think, again, that's gonna be driven by improvements in managed services. I think what we've seen right now is that if you're on, like, AWS or Google Cloud Platform,

they have really, really nice tooling for managing real time data that was not mature like 5 to 10 years ago. So for example, you just go in and you turn on Amazon Kinesis data streams,

and then you can connect that to what is it Kinesis analytics, and there you go. You have, like, a real time data analytics platform. But there's still this last mile problem where getting the data from your application database streaming into Kinesis

is still tough, and the same is true on Google Cloud Platform. And so I think

in that managed service domain, something is gonna happen where in the future it will be almost turnkey, where I can turn on Postgres, and then I have streaming change data capture

right there. And it's just that means that I, as a product manager, whether I'm an application product manager or specifically a data product manager,

suddenly I can start thinking from the get go about what my analytics are gonna look like for this application. Like, hey. I have an application, but also I can have embedded analytics for my SaaS platform users from the get go.

And maybe I can have, you know, 1 or 2 data engineers on my team to get that up and running rather than having to have, like, 10 to run Kafka,

you know, spin it up and manage it myself. So it's gonna be really interesting to watch that evolution.

And another evolution I'll say is that what we've seen in the data engineering space, even though we complain a lot about people getting centered on technologies and products,

certain technologies have just, like, revolutionized the space very quickly. And that's hard to predict, but, like, something new will come down the pipe in a year or 2. And it's gonna be the hot new thing, and it's gonna change the way that people look at data in a way that we predict. So curious to see what that is. Excel. It it's it's just Excel 3 60 1.

Yeah. To your point about spreadsheets, we're we're already starting to see that. So I had

the founder of the company Canvas on a little while ago where that's their whole product is spreadsheets for the, quote, unquote, modern data stack where it does hook into your DBT workflow, it hooks into your data warehouse, and it's just this spreadsheet interface for

product owners and, you know, business users to be able to actually explore the data that you're working with in an interface that is familiar to them,

but still having that escape hatch for data engineers and analytics engineers to be able to actually provide some useful guardrails so that you don't end up with this, you know, gross

spreadsheet with 5, 000 formulas that don't necessarily tell you what you think they're telling you. Oh, yeah. For sure. And I was writing in our newsletter last Thursday, predicting the future of technology is more about analyzing where the pendulum is

sort of along the continuum.

And it seems to go from 1 extreme to another. Right? And so, whereas I would say several years ago, the pendulum was

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links