Joe Reis Flips The Script And Interviews Tobias Macey About The Data Engineering Podcast

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.

With their new managed database service, you can launch a production ready MySQL,

Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs.

Go to data engineering podcast.com/linode

today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.

Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend. Io, 95%

reported being at or overcapacity,

With 72% of data experts reporting demands on their team going up faster than they can hire, it's no surprise they are increasingly turning to automation.

In fact, while only 3.5%

report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. That's where our friends at Ascend dot io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability.

Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug in architecture,

as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark and can be deployed in AWS, Azure, or GCP.

Go to dataengineeringpodcast.com/ascend

and sign up for a free trial. If you're a data engineering podcast listener,

Macy. And today, we're flipping the script. I've got Joe Rees of ternary data who's gonna be interviewing me about my time as the host of this show and my perspectives on the data ecosystem. So, Joe, can you start by introducing yourself? Hey. I'm Joe Rees, recovering data scientist,

co founder, CEO of Ternary Data,

and author of Fundamentals of Data Engineering.

And do you remember how you first got started working in data? I've only worked in data for over 20 years now. Yeah. So

in some capacity or another, data's always been something I've done. So this time around, you're actually gonna be interviewing me, so I will hand the reins off to you and take it away.

Cool. Well, your host is Joe Rees, and I'm here to interview Tobias Spacey today. So welcome, Tobias, to your own show. So like we kinda joked earlier, you know, long time listener, first time caller. So thanks for having me on to, I guess, flip the script. You know, to kinda get things kicked off here, I mean, you do a lot of podcasts. I think podcast.net

started back in 2015.

Data engineering podcast was, you know, January 2017, and now you got a new machine learning podcast. I think you're 2 episodes into that.

Walk us through the kind of the origins of each of these podcasts. Like, why did you get started in podcasting?

It started really because I was waiting for somebody else to do it, and nobody did. So I said, okay. Fine.

But going back a little further, I had been listening to podcasts

for a number of years. I

actually got started working in the tech fields around the same time as I was going to school for computer engineering.

So while I was getting my degree,

I was starting to listen to podcasts

about different areas in tech.

Started working in the field as a sysadmin

back in,

I wanna say, 2009,

2010 time frame.

And so just podcasts were a great way for me to learn more about the space, gain new skills, understand context about the things that I was working on.

And don't remember exactly when I really started getting involved with the Python language, but

when I did start using it kind of in anger for my day to day work, it really kind of stuck with me. So my actual first experience was

deploying and managing Ruby on Rails applications in my sysadmin job.

And so

I had actually gotten exposed to a lot of Ruby work. I had done some work in Ruby.

And then there was 1 occasion where I was just doing a simple admin interface on top of a database, and there was a project called,

blanking on the name, but it was some sort of Ruby admin

library.

And I did some work with it. It didn't really do what I wanted it to do. And so then I just said, oh, well, I hear this Django thing has a good admin interface. I'll give it a shot. Set up a simple Django app, started using the admin there.

And a lot of what I had learned about Rails translated really easily to Django because they're both MVC apps, and so that's when I really started getting into Python.

The quote that a lot of people say is that it fit my brain, and so I really got involved in that. That became my default tool chain.

And so I listened through all of the backlog of all of the different Python podcasts that I could find,

and none of them were producing any new episodes. And I think at the time that I finished

listening through all the back catalogs,

the newest episode on any of them was maybe a year or 2 old. So I was waiting for somebody to start a show focused on Python. In the meantime, I had been listening a lot to

another Ruby language podcast just because it covered a lot of general software engineering topics, and I was hoping that there would be something to fill that gap for Python. And eventually, I gave up and said, okay. Well, nobody else is doing it. I guess it has to be me.

And so I started doing the groundwork of getting set up to run the podcast.

Actually started it up with a cohost because I wanted to do more of a panel style show. So my cohost was with the show for maybe the first 2 years or so of the podcast.

But the funny thing is that 2 days before I published my first episode of podcast thought in it, Mike Kennedy published his first episode of Talk Python to me. So the show that I had been waiting for for so long that I eventually gave up and said I have to start my own show actually came at the same time that mine did.

So we've actually been running our podcasts in parallel for the same amount of time. Got started in podcasting just because nobody else would do it for me. That's interesting you mentioned that too. I've been listening to podcasts in the 2000 as well, and and there weren't a lot of podcasts in general,

I think, let alone very tech, you know, related podcast. And so yeah. I mean, good on you for, I guess, taking the reins of it because that's about what you had to do if you wanted to get a podcast. Apparently,

Michael Kennedy also felt the same way. But, I mean, it's both great podcasts. So what is it? Talk Python to be end? Yep. Yeah. I mean, I listened to both of them. Fantastic.

So so thanks. And then data engineering, you know, that was a couple years later. What happened between 2015 and 2017?

Yeah. So I had been running the podcast for a couple of years. It was seeing some success. I managed to pick up my first sponsor and then a few other sponsors. So it was worth the time and effort that I was putting into it because of that fact. And my wife actually has been helping me all the way along, and she had been doing a lot of editing on the podcast dot init. And she actually suggested because I was doing well with podcast thought in it and I seem to be enjoying it, why don't you try another show? And so I thought about it for a while and said, okay. Well,

what else is there that isn't really being addressed?

And in the 2017

time frame, I had been working for a while for a company where

I was actually doing a decent amount of data engineering work. My, actually, job title, I think, was maybe DevOps or back end engineer or something like that. But I was effectively building an event stream application for loading data into BigQuery, which that's a whole rant aside right there about why that was the wrong tool for the job, but it came down as mandate. They said, the project was already started when I joined, and they said, okay. Build this thing.

So I made it work, but so I was doing a lot of work in data.

Data science was gaining a lot of

attention and mindshare. It was, you know, the hottest job of the 21st century, and so there were maybe a dozen or so podcasts at least that I knew about that focused on data science, but nothing that I knew about or could find that touched on data engineering.

And so similar to the situation with Python where there weren't any shows addressing it, I said, okay. Well, I guess I'll start a show about data engineering so that somebody's talking about it and so that I can learn more about it, which, really, that's 1 of the big motivators for me across all of my podcasts is that I treat myself

as the proxy for the audience where it's something that I say, if I wanna learn about something, then I go out and I find somebody who is an expert in the field or who built the project that I wanna use, and I try to get them on the show so I can learn more about it. And so that's a big part of how I've shaped the overall kind of tenor of all of my shows is using myself as a proxy for the audience to say, is this something that I find interesting

that as an engineer, is this something that I find useful as an engineer? Like, is this a show that I would wanna listen to if I weren't the host? And so

when

I started the data engineering podcast, like I said, there weren't really any shows focused on data engineering. There were tons about data science, but none that talked about the actual,

you know, the grunt work and the technologies and all of the complexity that goes into getting the data ready for the data scientists because

that was around the time that there was really starting to be that tipping point of companies hiring data scientists and realizing that just throwing a data scientist at the problem isn't gonna fix anything for them. And, actually, the data scientists were getting burnt out because they had to do all the data engineering work. And I think it was maybe around the same time that Max Boeschmann

was publishing his episodes about the rise and fall of the data engineer, and he was actually 1 of the first guests that I had on the podcast to talk about that. And so it was just a good time where data engineering was nascent. It was emerging as an actual job description.

It was something that people were starting to focus on and, you know, kind of time has kind of proven me right in that it is an actual worthwhile

subject material where the past 5 years has seen a massive explosion in the amount of attention and investment that has gone into data engineering

as a job description

and as

a industry category.

Those are really good insights. I kinda wanna dive into this a bit because it seems like you have really good

instincts on timing as well as to what's

potentially gonna be hot. Right? So

Python back in 2015, for example, for the audience, you gotta remember, Python wasn't always the number 1 widely used language. In fact, it was you know, in the early days, it was considered kind of a toy.

Right? So in the 2000, 20 tens, you know, if you're in data, the the language was definitely gonna be not

Python. But,

you know, I think it was around 2015 that it really started kinda hitting an apex. I started a Python meetup here in Salt Lake where I live. And it felt like around

2014, 2015 is when things really started to take off with Python.

But then fast forward a couple years to data engineering. At that time, you know, it was it's coming to everyone's radar at the time, but I think it seemed like a very contrarian bet at that time.

The rational thing to do or the popular thing to do would have been to start a data science podcast,

certainly not a data engineering podcast. Do you feel like you have good instincts, are able to keep a pulse on sort of when trends are about to surface

in technology and data? I don't think it's really even just a matter of trying to focus on the trends. It's really just going back to using myself as that proxy about what is it that I find interesting? What do I wanna learn about? I'm not really aiming for something that says, okay. What is the best way that I can get to, you know, the front page of Hacker News or at the top of the Apple Apple podcast list or whatever it is. It's what are the ways that I can further my own understanding,

learn in ways that are useful to me in my career,

and how can I

benefit other people in the process? So I'm not necessarily looking for what's the splashiest, what's the fastest way to gain popularity.

I'm looking for what is the thing that

needs to be understood.

Yeah. I guess you would have started a web 3 or 4 podcast at this point. So

Yeah. Bitcoin podcast.

Yeah. And going back to the Python podcast too, and I started it because of anything having to do with the rise in popularity of it as a data science language. It was actually,

at that time, my area of focus was still more in the web application space. I was building Django and Flask applications. That was where I was kind of working at the time.

And so my focus was just on how do I learn more about this language and the community that I'm investing in, and how do I understand more about some of the, you know, trends that are happening there? How do I learn about the tools that I am using for my own day job? And then it was just kind of after that, by happy coincidence, that Python continued to be, you know, on the rise and kind of catapulted to the forefront of data science and the data engineering

ecosystem. They're kind of intertwined in a lot of ways too. Absolutely. Whereas Python for web dev, I mean, it's still going strong, but it's, you know, it's definitely the attraction's all data, at least where I sit. Yeah. It's funny because it used to be, you know, when I first started the show, all of the newsletters were about, you know, oh, here's the latest Django plug in. Here's the latest, you know, celery or whatever.

Now every time you look at a Python newsletter, it's here's another, you know, plugin for PyTorch or a TensorFlow library or a new machine learning framework or something along those lines.

That's awesome.

So so you published,

like, 306

episodes of the data engineering podcast, 370

episodes of podcast.net,

then you got a new machine learning podcast. How do you find the time to do all this? And then walk us through your process of creating shows.

So as far as finding the time,

a little bit of it is just a maybe overactive work ethic.

But I'm fortunate to have a job that supports me in the work that I do with the podcast. So I I don't have to

do everything

at late hours of the night. I'm able to schedule both my

primary work and the podcasts to to be able to work together, so I don't have to kind of fight that scheduling aspect.

And,

also, I've just gotten a lot of systems in place that make it reasonable. So, actually, recently, as I was getting ready to launch the machine learning podcast and trying to make that final decision of, is this something that I want to do? Is this something that I have time to do? I started tracking my time about how much time am I actually spending each week on getting the podcasts out.

And

with 1 episode a week of podcast.onit

and 2 episodes a week of data engineering,

I was spending about 10 hours a week on podcasts.

And so that's a manageable amount of time. And that was also actually while I was doing some of the prep work to get the machine learning podcast up and running. So 10 hours a week across all 3 podcasts. So

it's a manageable amount of time, and it's a lot less than I used to spend when I was doing consulting. So

when I was early in my career,

I was working as a software engineer and then a DevOps engineer,

and I was also doing consulting work on the side because I needed to make some extra income. It was a way that I was able to also

improve my skill set, learn more about different areas that I might not work on in my day job. And so

there was maybe an entire year actually where I was spending

30 plus hours a week doing consulting on top of working full time. So and also going back to when I was getting my degree

and starting to work in tech, that was another situation where I was working full time at a job and going to school full time. So I've always as long as I can think of, really, I've been doing kind of at least 2 things at once. So it just seems kind of natural that I have something that I'm doing in addition to my day job that's not just sitting back and relaxing. I don't really know that I would know what to do with my time if that was all I was doing. You can make a podcast about relaxing, I guess.

So stretch your own edge. I do podcasts as well. And

I know that when I started out, I made a ton of mistakes, whether it's, like, just crappy audio for whatever reason. You know, in my case, that shows are sometimes live. So cameras, you know, getting all screwed up, everything in between. Walk me through some of the, I guess, maybe the early regrets that you ran into doing podcasts. And especially back in the day

when

this definitely wasn't, like, a popular medium like it was now or like it is now?

Yeah. So

definitely lots of mistakes. So, fortunately, I made most of them in the early days of Podcast. Init. So by the time I got around to doing the data engineering podcast and now the machine learning podcast, I had worked through a lot of the kinks, built up the processes. But,

yeah, I mean, early on, I had gotten, you know, 1 of the cheapest mics I could get that was reasonable,

but I didn't actually set it up properly.

So I had it plugged in. I thought I was recording with the mic that wasn't my laptop mic, but for the first few episodes, it was still just my laptop mic, and so the audio was tinny.

I actually went back maybe a year or 2 ago to try and listen to some of the audio of the very first interview I did on podcast. Onnit, and just the delivery was so forced and awkward.

It was very painful to try and listen to it. And I apologize to Thomas Hatch, who was the first guest on that show. He was very gracious to take his time to help me with that. But yeah. Was it like a Between 2 Ferns episode, or, like, what was, a Who? No. I was too scripted in the layout. So, you know, I had all the questions laid out, and I was just very mechanical. I hadn't really found my voice, found my cadence, you know, figured out how to actually be an interviewer. I was an engineer. I was just trying to talk to another engineer, but I didn't really know how to actually make it sound or seem natural or figure out the flow. So that's definitely 1 of the things that beyond just helping with success of the podcast has just been beneficial for my own purposes in my own life is figuring out how do I actually manage a conversation.

It's really subtle.

Right? I think I ran into the same things too when I started podcasting, where you have this idea of what a podcast should sound like, but then when you sit down to record

with another person, it's a much different story, and you just gotta do it. Absolutely.

So you mentioned audio, kind of forcing a script.

What are some other areas that you found that you could improve your podcasting game? 1 of the things too that definitely I built up is figuring out, like, what are some of the rough edges that you can smooth over in the guest experience? Like, how can you make it easy for the guests to say yes? How do you make it easy for the guests to participate and not have to invest a lot of their own time? Just kind of show up, talk for an hour, and be done with it. So just that overall process of helping to set the expectations. So at the beginning, it was just a lot of, you know, typing the same email or variations of the same email over and over again. And so then saying, oh, well, actually, I can just copy and paste this template and then just fill in these couple of fields, and I'm good to go. So how do I kind of scale my own time without having to really think through a lot of the processes? How do I make it automatic?

How do I

help the guests?

So things like scheduling tools, I've gone through,

at this point, probably 5 or 6 different scheduling tools.

Couple of them, I switched over because the 1 I was using actually got acquired or shut down. A couple of them, I switched over because it just was clunky and not really easy to use. So the tool I use now is actually something called SavvyCal,

and it's been great. So it actually lets me say,

you know, here's a link. I just sent it to somebody, and they can pick a time that fits their schedule. I can set a limit on the number of times that somebody can set a particular event type in my calendar within a given week or on a given day. So I can say, for data engineering, I'm only gonna allow up to 3 interviews per week, only 1 per day. So I don't have to worry about my schedule getting completely overloaded with the podcasts.

And so then I can just say, okay. I need to get something scheduled. Here's a link. You pick your time. I don't have to worry about it. It just shows up on my calendar, and I show up and get it done.

Another thing that I've really kind of figured out is, you know, what are the interesting questions to ask? How do I figure out the flow of the podcast? So that's something that's definitely just kind of come about the typical kind of programmer exercise of do something enough times until you figure out what is the common abstraction. Abstraction.

So there's just a pattern that has kind of fallen out of the podcast that works well where, you know, I start off, you know, who are you? What is this thing we're talking about?

Why does it matter? Why does anybody care? How does it work? How do I use it? And, you know, what are some of the things that you learned in the process is kind of the general story arc that I've fallen into,

and it fits pretty much everything that I've come on the show that, you know, that I've talked about on the show. And it's easy to adapt to specific topics or specific tools or specific guests and just, yeah, just building up the systems,

both mentally

and in terms of the actual,

you know, processes and tools that I've got. You know, definitely,

constantly room for improvement. You know, I'm actually in the process of starting to tinker with and design

a specific web application that will help with automating more of those pieces and manage some of the hosting aspects. But,

again, you know, too many spinning plates, and maybe I'll get to it. Maybe I won't. If you ever need a beta tester, let me know. So, yeah, it is 1 of these things where I think the perception of podcasting is that, you know, you just get a couple people together,

you hit record.

But as you say, there's a lot that goes on behind the scenes, scheduling, logistics,

you know, booking, rebooking,

all this stuff. Right? The thing I like about the data engineering podcast in particular is that when I listen to episodes, there's definitely

a formula to it, and it's predictable. But the thing is you do a really good job of balancing

kind of the guardrails you put in place with the technical acumen. Like, your questions, you know, we'll get into that a bit, but I you know, the questions are very, you know, technically

astute. So it's not like you're just saying, hey, tell me about

x, y, and z. Like, why is that important? But it, you know, it seems like you really have a knack for diving into

very technically detailed types of questions that

wouldn't seem obvious, I suppose, for listening to that stuff for the first time. Because what I mean by that is, like, you know, I was listening to, actually today's podcast. Maintain your data engineer sanity by embracing automation was the 1 I listened to here. But the types of questions you're asking, I mean, it had both historical

and technical context that I think was pretty impressive given the amount of interviews that you do. And so I guess, you know, kinda given that notion, how do you stay on top of the developments in data engineering?

There are kind of a few different aspects to it. So 1 of the things that helps is, you know, I've been working in the engineering space for,

what, 11 years now? 12 years? Something like that. And

because of the combination of working in the field, getting my degree in the fields, and doing a lot of consulting work as well gives me a very wide breadth of understanding of getting deep into the actual nitty gritty and bugs and challenges of doing the engineering work, and then just

also

reading about and learning about what are the,

I guess, fundamental lessons

that hold true across

different contexts and technologies and use cases

and not spending as much of my focus on

what is the latest shiny tool. You know, what are the actual things that are true no matter which tool you're using is really the most important thing for any engineer to actually

gain understanding of as they progress in their career from junior dev to mid level. I think that that's definitely 1 of the hallmarks of being able to call yourself a senior engineer is understanding those fundamentals. And so

both in my actual day job and my engineering work as somebody who's working in the field and also in my work of trying to

understand the context and the technologies of what I'm interviewing people about.

I look to, you know, what are the foundational aspects of these technologies?

So that way,

I don't have to try and relearn or learn at a high level everything that I'm talking about because,

you know, whether I'm talking about the latest data quality tool or distributed systems technologies or a database engine or a front end framework,

there are elements of those technologies that are going to be the same no matter what, and there are elements of working with those systems that are gonna be the same no matter what. And so because I've gained that context and gained that understanding

both through a lot of conversations,

a lot of self study, a lot of work, you know, hands on keyboard work, I've just gained

a useful

intuition into, you know, how do things work. And so

going back to the data engineering podcast, and when I started it, I used myself as a proxy for the audience of this is something I wanna learn about. So I had been doing work with data, but even today, I still don't really call myself a data engineer. I mean, I know a lot about the space, but there's always also that aspect of impostor syndrome of, you know, yes, I know a lot about data and data engineering, but I'm not the person who first built Kafka or whatever it is. But when I first started the data engineering podcast,

I was mostly coming from the background

of a sysadmin and software engineer

and starting to figure out what is data engineering, how does it work, how does this play into the broader technological ecosystem.

And so I just

learned a lot by asking people questions and trying to, you know, prepare for shows. So another thing that goes into kind of the behind the scenes aspect of running the podcast

is whenever I run an interview, I actually prepare a list of questions

beforehand. So I don't just come into it blind and say, okay. Here's a topic. Let's see what happens. I actually say, okay. Here's the list of questions. These are the things I wanna talk about and understand.

I send that to the guest so that they have an opportunity to say, okay. Well, that question actually doesn't make sense to talk about, but here's something over here that might be more interesting.

These days, there's not as much of that because I've gained that intuition of, you know, what are the interesting things to discuss.

Even to this day, I'm still a little surprised when guests get come on and say, oh, I'm, you know, really impressed with the questions that you asked, you know, the level of detail of these questions because, again, you know, imposter syndrome. I always feel like I'm still a neophyte, but just through a lot of

work in the trenches,

study,

and repetition, I've gained that kind of

understanding of how things work, you know, at the foundational levels to be able to

go from

whatever we're talking about at the time and then say, okay. Well, let's play that back to either some of my own experience or a past episode or a specific

area of curiosity that I have to say, okay. Well, what about this aspect of whatever it is that you're building? That's really interesting. And, you know, kind of going back to 2017, it wasn't like there was a lot of information on data engineering at the

time.

Right? It was still, I think, a new term

and a discipline that was still forming. You talk a lot about building what I'll call mental models. Right? Mental models around data engineering.

I think it'd be beneficial for the audience to really understand what are the key components

of the data engineering mental model that you've constructed.

Oh, man. That is a

deep and broad topic.

So for the next 5 hours Yeah. Be discussing. Right. Let let me give my PhD dissertation.

No. I think that

some of the critical things to understand

are

I think, really, the things that everything kind of boils down to are

something needs to be stored somewhere.

There needs to be some information about what you're storing.

You need to understand why you're storing it and how,

and you need to

understand what has been done to it or what do I need to do to it, and then who is going to use it at the end of the day.

Just kind of thinking on my feet here. I think that those are kind of the core principles.

And then from there, you can dig into, you know, more detailed aspects of, okay. Well, today, I'm talking about Spark. Well, okay. Well, when do I wanna use Spark structured streaming? When do I want to use Spark in batch mode? When do I wanna use Spark SQL? Why do I wanna use Spark SQL instead of Snowflake? Well, what about Trino?

How do things like the iceberg table format play between Spark systems and Trino systems? So there's the fundamental aspect of what are the building blocks and then the high level aspect of what are all the pieces that are in the ecosystem

and, you know, what are the roles that they play? And that piece has really just come about from

talking to people about it for the past

5 years, and for the past

year plus, it's been at least twice a week. So just a lot of conversations.

I find it interesting to kind of standing on the shoulder of giants, to use a cliche, but there is a lot to be said for, you know, talking to experts often. I know my personal experience podcasting has meant, like, it's been an accelerated learning curve. There's no way you could get these insights, You know, I would say, you know, reading articles or blogs, you you might. But,

you know, just talking to the people themselves who have built a lot of these systems, it's hard to replace that knowledge. And and it's also hard to, you know, without doing that, to ask questions of these people as they come up. So

I can definitely empathize with that, kind of learning through osmosis in a weird way. Yeah.

And in some ways too, I'm kind of starting it all over again with the machine learning podcast because

I don't have a strong mathematical background. I mean, I understand mathematics. I understand the sort of mental models that go along with it. I

understand conceptually

what machine learning is doing. But if you asked me to say, okay. Well, you know, write out the equation for doing a logistic regression on a, you know, set of data. I wouldn't even know where to start. So the Machine Learning podcast is another opportunity for me to kind of jump in head first, dive into the deep end of an area that I know precious little about but find interesting and

valuable. And

using myself as the proxy for the audience, figure out how do I

explore and find my way through this space. And so

that's, I think, a big part of why I've been successful,

especially in the data engineering podcast and hopefully in the machine learning podcast going forward is that I'm not starting from that space of, I am an absolute expert. I built all these systems from the ground up. I have a PhD in distributed systems theory. It's just, I'm an engineer.

I wanna figure this out, and I'm take taking everybody else along for the ride. It's gonna be really interesting. I think you're approaching it from, I think, a really mindful place too.

On topic of the machine learning podcast. Like, what are gonna be some of the major differences between that podcast and other machine learning podcasts out there? I think the main thing that I'm trying to bring to the table is

pragmatism and practicality,

which I think is a big hallmark of what I've been able bring to the data engineering podcast is just, you know, again, I don't have a PhD. I'm not an expert in everything that I'm talking about. I'm just somebody with a lot of curiosity

and enough understanding to be able to ask the questions.

And so

in the machine learning space,

you know, a lot of the podcasts that are out there are very focused on

theoreticians

or data scientists or people who are already experts in a lot of the mathematical concepts that go into it. I'm just coming at it as an engineer saying, okay. Well, how do I make sense of this thing? What is machine learning useful for? When is it the wrong choice? How How do I make use of it? How do I support teams who are building machine learning models?

How do I take machine learning from, I have this great idea

to I've delivered this thing, and I'm running it in production, and now I need to make sure that I can keep it running in production. And then figuring out how do I even understand what is a machine learning problem. You

know? Because a lot of times people will say, oh, machine learning. Well, okay. I'll just throw machine learning at it, and everything will be great. Or I'll throw deep learning at it, and I've got this pile of data and magic.

And just trying to bring a lot of kind of grounded approach to it to say,

machine learning is a tool just like every other tool. It has its benefits. It has its drawbacks.

Here's how we can take a journey together to understand

what are the benefits and limitations of this and what are the things that are still being figured out. You know?

So with the data engineering podcast,

at the outset, what I said is this is a podcast focused on data engineering and data engineers, and I am not trying to address the data scientist audience.

So I've had a lot of people come to me saying, oh, hey. I wanna talk to you about this machine learning topic, or I wanna talk to you about this data science topic. And I say, this isn't the right avenue for that. So my litmus test for, is this a topic for the data engineering podcast is, you know,

is this something that a data engineer would do, or is it something a data scientist would do? So I wanna talk about everything up to the point where I say, okay. This data is ready for the data scientist to use.

And if it's beyond that point, then it goes on to my other show. So till now, a lot of that has gone to the podcast dotnet.

Going forward, a lot of that will go to the machine learning podcast, which is even part of the reason I started it is that I was getting a lot of inbound

traffic to say, hey. I wanna talk about this. You know, and it's more ML focused or MLOps focused. And I say, okay. Well,

for a little while, I was trying to push the Python podcast in that direction to say, okay. Well, I'll just make it the Python machine learning podcast thought in it, but it was a little awkward. There was too much history where that wasn't the primary focus, and a lot of the audience was starting to say, well, what are you doing here? So I split it out into its new show. That way, it's easier for people to say that, yes, this is something that I'm interested in. I wanna learn more about the machine learning aspect.

Really trying to focus on machine learning as a practical and useful tool in the toolbox

and not spend all of my time focused on, you know, what is the latest theoretical research?

You know, probably dig into some aspects of theory or some of the research, but more from the perspective of, okay, well, how is this going to help me next year or 5 years from now and not just, you know, this is useful just for the sake of peer research.

It's interesting timing too. King, you talked about why you didn't start a data science podcast back in 2017. Right? And and data engineering was really

it came up as a response to, you know, data scientists not having the proper foundations to do their job.

Fast forward 5 years, 2022,

I I could argue that's no longer really the case. Right? And I would say in large part because data engineers

help facilitate

success of data scientists and practices maturing along the way and so forth. Now it's

the type of podcast that you would have done in 2017 on data science is not the same type of podcast you'd be doing now. Right? It's just people are actually doing machine learning and production,

you know, I think much more successfully than they were back then. And so it seems like really good timing

yet again for a podcast from Tobias Massey. With that said, given your experience between data science topics covered in podcast.net and then the data engineering podcast, like, where are you starting to see sort of the intersections

of data engineering and machine learning?

I think that the most obvious place is in this emerging category and topic of MLOps, or how do I take this machine learning model and operationalize it and make it a reliable piece of my business, reliable piece of my applications, where

back in 2017,

data science was,

I think, where analytics engineering is today from the perspective of what the output was. You know, you would hire a data scientist because you wanna understand

what is happening in this information that I have. And a lot of that work

has been pushed into the space of analytics engineering through tools like DBT and the data engineering systems and platforms that we've built out to make that a possible kind of

career choice and career direction.

And a lot of the,

you know, mathematical modeling and data science has moved into the machine learning aspect. And also the biggest difference that has happened in that time frame is

deep learning has had a huge impact on the industry where it used to be, you know, data science was, okay, I need to figure out the difference between a, you know, gradient descent or a random forest or k nearest neighbors and which 1 do I use when to,

okay. I'm going to take a deep learning model, but, you know, am I going to use BERT for,

language transformer, or am I going to use

Yolo for a image recognition

use case? So there's been a lot of investment in, you know, the big tech firms and in research and academia

into

making

machine learning and deep learning

a more

tractable problem and 1 that isn't solely restricted to people who have those PhDs. So it is a problem now where I can take a tool like Luigi

and feed it a bunch of data and get something useful out of it without having to have that PhD. You know, I can just be a data engineer. I've got access to the data. I've got a couple of hours on my hands. I can see, is this something useful? And I can say, hey. Hey. This looks like it might be worth investing more in and hand it off to the data science team, for instance.

I think it's just become a much more accessible and approachable

area of effort, whereas before, it was only really viable for businesses who had the

money and staffing to be able to support

putting a lot of time and money into that effort.

Oh, for sure. I guess from your perspective being the podcast producer, how are you going to manage context switching

now between 3 somewhat related, but obviously different topics?

It's interesting that you bring up that question because there have been a couple of times where I was recording interview for the machine learning podcast

and started to find myself,

you know, asking questions in a certain direction or thinking about asking a question in a certain direction that would be where I would push it if I was in the data engineering podcast of, like, oh, well, okay. Well, how does this infrastructure work? How does this work at the bits and bytes level? And then remembering, oh, wait. That's not the audience I'm trying to talk to right now. I'm trying to talk to the person who's the machine learning engineer. They don't care necessarily

about all the infrastructure and the automation that's happening under the covers. They wanna know how does this help me do my job. And that's 1 of the things that I try to use as my kind of weather vane of, you know, am I delivering on what I aim to deliver with this episode or with this show is, you know, who am I trying to

help?

What are the problems and the challenges that the person who is likely to listen to this podcast is trying to tackle right now. So for data engineers, it's I need to figure out what this data is, why it's useful, who's using it, how to make it reliable, how to make it maintainable, how to test it. So it's, you know, a lot of the very, you know, mechanical and automatable

and

not so much in the statistical level. You know, there's still some of that too. But and then for the machine learning audience, it's, you know, how do I take this framework or this tool to be able to build a machine learning model or be able to run a machine learning model or understand,

is this a problem that's worth solving with machine learning, or

what does production mean for this machine learning model? Because a lot of times you say, oh, running machine learning in production. It's,

well, the the automatic thing that you think about is, okay, it's running in a web server somewhere, you know, serving API requests in real time,

but that's not always the case. Sometimes production for machine learning just means

I need to be able to understand the context of this data that I'm looking at. So I'm actually just going to manually trigger an execution of this model to see what the output is.

And that's production for some people. It's not serving 10, 000 requests per second for some high traffic use case. It's interesting. Kinda bring it back to the data engineering podcast. I mean, so you've been doing this for 5 years now. The industry's changed a ton. Right?

What have been some of the major evolutions of the topics that you've covered over the years?

So many.

The things that come to mind first are,

when I first started,

the Hadoop ecosystem

was still

relevant, I guess, is a good way to put it.

Might be a bit harsh, but, you know, Hadoop and a lot of the surrounding technologies

have really, you know, faded out of the day to day conversations. There are definitely still companies using it. It's definitely still providing a lot of value, but there's not as much activity there. So,

you know, in 2017, when I started, there were still new products and new projects being introduced into you know, that are tied into the Hadoop ecosystem.

And I think s 3 really ate Hadoop's lunch a lot.

Eve and then MapReduce as

a paradigm,

as a, you know, engineering approach has faded with things like Spark and with things like Trino and cloud data warehouses and just the availability of scalable massively parallel compute that was harder to come by when Hadoop was really on the rise. So I came in with the data engineering

podcast

right as Hadoop was starting its descent, I'd say. So now there's not really much conversation happening around that ecosystem.

We do still have a lot of legacies of the work that went into that. I think 1 of the things that comes to mind most prominently there is

the Hive metadata catalog is still the de facto way to understand what is the set of tables that I have in s 3 or in object storage or what have you. But even that is starting to get superseded by some of the other products that are out there. So AWS Glue

is, you know, for people running in Amazon,

a very

reasonable service to use in place of running your own Metastore.

Dremio is trying to push it a bit farther with their Arctic product to say, we don't actually need the Hive metastore at all. We're going to have this metastore that works with iceberg tables and has, you know, atomic commits across tables because of the work that they're doing with project Nessie. So I think we're at another transition point right now.

But over the past 5 years,

data quality, data monitoring, data observability, you know, data reliability engineering, whatever name you wanna give it, has gained a lot of ground.

Reliability engineering, whatever name you wanna give it, has gained a lot of ground.

Now that people are able to

build systems,

you know, they don't have to be an expert in distributed systems to be able to get a distributed system up and running. They can use a cloud service provider, or there are frameworks available or Helm charts for Kubernetes environments where you can say, okay. I can get this up and running, and I don't have to work on it for the next 6 months. It's freed up a lot of cycles to be able to invest in some of these higher order considerations like data quality,

data cataloging, metadata management. Those are huge topics right now. You know, to begin with, it was really just metadata management for a data catalog to understand what data do I even have,

who's using it for what purposes.

So Amundsen was 1 of the front runners there.

Now we've got, you know, next generation catalogs like DataHub and OpenMetadata

and Atlan that are trying to push that even further to say, okay. I've got metadata about everything in my system. It's not just what datasets do I have. It's what datasets do I have? How did they get there? What was done to them? You know, what is the full lineage of them? How do I then take the signals about changes in this metadata to drive automation to then do more things to that dataset. So

moving into, you know, what people are trying to call the active metadata. So being able to actually use metadata as a driving force for your data platform so that you, as a data engineer,

can,

you know, get network effects and, I hate to use the term, but synergies in your system

to be able to not have to do as many of the, you know, very rote tasks. You know, you don't have to engage in as much toil. You can actually start to treat your overall data platform as a holistic system

and work at a higher level and start to figure out, okay. Well, how do I actually

put myself in a position where I'm working to drive

the business value, not just I'm fighting to put out this fire and make sure that these files go from over here to over there on time.

Makes a lot of sense. It matches what I've seen over the years too, which is higher levels of abstraction,

and data engineering is becoming more enterprisey.

These topics of data management, they weren't happening the way they are now, you know, a few years ago. It was more about it felt like the conversation has moved from, hey, I got big data stuff to like, now it's just I got data. So it's been interesting to watch.

It's time to make sense of today's data tooling ecosystem.

Go Go to data engineering podcast.com/rudder

to get a guide that will help you build a practical data stack for every phase of your company's journey to data maturity.

The guide includes architectures and tactical advice to help you progress through 4 stages,

starter, growth, machine learning and real time. Go to dataengineeringpodcast.com/rudder

today to drop the modern data stack and use a practical data engineering framework.

I think 1 of the interesting aspects too is that you don't really hear the term big data as much. Before, it was a badge of honor. Like, everything had to be big data or you're not doing it right. Now it's just the data because the data is what's actually valuable. And it used to be too, you know, when Hadoop was on the rise, it was just store everything. Maybe someday it'll be useful.

And now with more regulations

and with the, you know, rising use of cloud and the, you know, paper use that can actually get quite expensive. It's okay. Well, I'm actually only going to store the data because I know I need it for something, not just because it might be useful. And so starting to be more deliberate in understanding what data you're collecting and how you're using it and for what purposes.

We're definitely now in an era of starting to

add more,

polish, and focus on user experience

for the different data tools and the systems that we're using.

Whereas up till maybe the past maybe 2 years ago, I'd even say, or maybe even last year, it was really just how do we get all the tools in place to be able to do things with data, period. And now it's, okay. Well, now how do I make that easier for data engineers, and how do I make that approachable for people who don't want to or don't need to understand

all of the complexities that go into making this a reliable system.

Zooming out, given your very unique perspective of data engineering, having interviewed countless hot people in the field,

what are some of the big trends that you see over the next 3 years in data engineering?

I think that

1 trend, I think, we're gonna start to see some consolidation

in a lot of the explosion that happened, especially over the past 2 years. You know, the past 2 years saw massive amounts of investment from venture capital that has allowed a lot of people to be able to throw ideas out there, see what

sticks. We're just now starting to see some of that consolidation of, you know, okay. I think we're gonna see things in the metadata category start to coalesce

into,

you know, whether it's data catalog tools or data lineage tools or data governance tools. Data governance

is gaining a lot more attention. You know, it's always been a thing,

but for a long time, I think it was relegated to the, quote, unquote, enterprise, and now everybody's realizing, oh, shoot. I really need to focus on this. This is important. I gotta get this right. So data governance,

really focusing on user experience and making data accessible to

nontechnical users or people who don't want to have to invest in understanding everything about the system, people who just wanna get their job done,

and really starting to bring application engineers and software engineers

into the overall conversation about data in an organization where

there was the DevOps,

quote, unquote, revolution

over the past 10 to 15 years

that brought software engineers and systems administrators

Data Engineering Podcast

Summary

Announcements

Interview

Joe’s Notes

Contact Info

Closing Announcements

Parting Question

Links