Cutting Through The Noise And Focusing On The Fundamentals Of Data Engineering With The Data Janitor

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance?

How much time could you save if those tasks were automated across your cloud platforms?

Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud.

Their comprehensive data level security, auditing, and de identification

features eliminate the need for time consuming manual processes,

their focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms.

Learn how they streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/amuta.

That's

imuta,

and get a 14 day free trial.

And when you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With their managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Daxter.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode,

that's l I n o d e, today and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Daniel Molnar about being a data janitor and how to cut through the hype to understand what to learn for the long run. So, Daniel, can you start by introducing yourself? Thank you for having me here. My name is Daniel. I'm an old bloke. I'm I'm doing, like, data for, like, 10 plus years these days. Startups, like, 20 ish. Calling in from from Berlin, Europe. Or July coming from Hungary, but I live in Berlin for, like, 7 years now trying to help Verizon startups and occasionally,

like, multinational corporations like Microsoft or Shopify here and there. Yeah. And I think that I got to realize that I'm actually like a data generator opposed to, like, you know, data engineer or like a data scientist or data whatnot. Throughout all these, like, shoveling stuff from A to B, kind of understanding that doing data, a lot of cases,

involves, like, maintaining

pipelines and making sure that you are invisible.

And do you remember how you first got involved in the area of data management?

10 years ago when,

after, like, the longest stint I spent in the in the e learning industry building a product,

I kind of fully realized that, you know, this whole data

hype is coming on us. And now you can do, like, analysis

called at even a time, like data science, like on your on your laptop, and, like, I was happy to chime back again to my Python and whatnot. But those are like different days. I mean, you could still buy like Google Analytics as an installable called Urchin. So things change pretty fast.

And so as you mentioned, you've described yourself as being a data janitor as opposed to being a data engineer or a data scientist or machine learning engineer.

Wondering if you can just start by sharing your thoughts on the overall state of the data management industry and the sort of

ways that we tend to think about ourselves and describe ourselves and the segmentation of roles?

Being old

maybe gives me, like, a strange perspective because I remember, like, before Java. You know, Java was the first thing

that had, like, I don't know, $500, 000, 000

as, like, a marketing budget. And And I keep that saying in my presentations and keynotes. I used to deliver at conferences.

I'm not sure what will happen to conferences.

You know, the hype is just getting, like, bigger and bigger. And, like, anybody who's working with data, I think, will live in, like, a wonderful times because we have so much things to be using and so much computing power and so much understanding,

but we should never forget about the incredible amount of marketing

money that just, like, showed on us, kind of trying to distort our our reality perception, may I say. So on 1 hand, I'm happy because I think we can do things that, you know, very few people could do, like, I don't know, 10, 20 years ago. I remember, like, the moment when that's named, like, Redshift appeared in the market. I was like, suddenly, wow, a data warehouse that doesn't cost you, like, a car as a starter. So things are changing a lot and things are getting cheaper and more robust and powerful.

On the other hand, noise is super crazy big, and it's very hard to figure out, you know, what's what. Some companies do allow you to make benchmarks, some companies do not allow you to make benchmarks, And I think if you might actually got, you know, the hacker news, like, from Pagebug,

then it's even, like, having this fear of missing out feeling every day, like, oh, something new happened.

And so in the face of all of the changes and sort of constant introduction of new products and new ways of doing things, what is your overall strategy for being effective in the face of so much complexity and the conflicting requirements for data access and control and the shifting needs within an organization for what information to collect and how to use it and

what

to think about and what to focus on for being able to actually deliver something valuable.

I think I might sound like my own grandfather on this 1. I think as people age, they get a bit more like conservative

and not necessarily because they think that, oh, everything was better in the good old past, but more like I would even point out to mister Taube, like Nassim Nicholas Taube, who is not like a tech writer, but does write a lot about like, you know, randomness

and blacksmiths and how the world works in his view. And even he points out to the fact that technology

does not really age. You know, if there's a technology that's with us for, like, 10 years or 20 years or even 40 years, it might make sense to look at why

they stayed with us. So I would say that keep it simple and stupid. It's always, like, a good starter.

Again, you know, SQL, just like, for instance, is with us for, like, a long, long time. There might be a reason why we still use it. There might be a reason why I have not been taught SQL as a computer scientist when I was in school. Again, like, talking about, like, how tech and business relates to each other. But anyway, it's here with us. He's gonna stay. And just like if somebody,

I know, publishes a new NoSQL store on the front page of hacking news, like, let's let's wait how it will age and how it will perform. Yeah. I think that that's definitely a good heuristic for understanding

what to look at and what to pay attention to is what is the staying power? How long are people talking about something? And there's definitely interesting things that come out of some of the new products, but it's also worth looking

at what is it bringing that's actually new versus just a new marketing veneer to put a new spin on something that's been around for a few decades.

Yeah. And, like, again, looking at, like, what they deliver already. Like, if you can solve something in bash, solve it in bash. If you can solve something with Cpop, solve it with Cpop.

If you can solve something with Make, let's use Make. Like, let's try to do things

stupid first,

and then if, you know, it did hit the wall, then it might not be, like, premature optimization.

So I'm also happy to be, like, the 1 in an organization

who did use the most code because, again, I think if I'm putting on my maybe even the engineering hat, but more like a janitor hat, it's like we're not paid to

write software or, like, or, like, add more lines of code to the repository.

We are paid to solve problems. And as we are humans and everybody's a human, we are prone to error. It should be funny to take, like, the latest, I know, Apache product, pull the repo and deploy something that has millions of lines of code and you don't know how it works. I'd rather be sleeping well. So I would say, like, it's it's better to have something stupid and simple in production that you actually understand

opposed to, like, you know, what's shiny and new and happy. Yeah. 1 of the ways that I've seen this referred to is to choose boring technologies

because they're the ones that have been around for a while. They're the ones that have been proven out in production contexts, and they're the ones that will let you actually sleep well at night knowing that it's not going to crash because of some undiscovered bug because you're using technology that was introduced 6 months ago. Absolutely. I'm a super big fan of Dan McKinley. I believe he coined this this expression, and I even have, like, a sticker on my notebook coming from him saying, Avro is bad.

And so in terms of the overall challenges facing data engineers, what are some of the common difficulties

that you see them contend with, whether it's technical or social and organizational? And what are some of the ways that you have used

to overcome some of those challenges or to help others overcome them? I would say again, we have to know

our role in this whole whole organizational

setup. And I believe still that, you know, like, a data engineer, like, a data generator does its job well. Invisible, you know, we kind of really making sure that everybody gets what they need. That also ends up having a bit of a problem here and there because nobody really, you know,

sees your contribution just when the whole thing

fails. So it's pretty interesting how to balance this out that, you know, we have to show that the value you give is something that's super important.

On the other hand, the nature of your job is not something that you can, you know, necessarily present at a retro and have, like, clapping people standing up. Hey.

Finally, production database is in sync with the analytical database.

Sometimes it's not that sexy. I think it's 1 of the major challenges, and it kind of boils back to that issue again that it also doesn't make sexy to become a data engineer necessarily.

I think data scientists

got a lot of this

love, like, everybody wants to wants to be that because that's the sexy thing to do. If you're, like, you know, become a data scientist and after 1 or 2 years, you realize that you might need also, like, other skills, how to actually get your data, how to actually put your model into production,

then you might start to value, like, data engineers a bit more, I would say. That's like more, like, I would say, the social organizational part. On the technological side, I think what really makes practical perspective is that, I don't know, if I look back,

in my previous jobs and I I look back who I'm talking to these days, almost everybody seem to be, like, doing some kind of a cloud migration or some kind of a hybrid setup approach.

And a lot of cases, just like production or, like, finance

drives this change.

In a lot of cases, data is just, like, kind of folding and try to, you know, stitch things here and there and make sure everything finds, properly and and well. I think that's something that can give you headaches in general. This type of

back feeling, like, data scientist,

on, like, what to do, how to do, what's possible, what's not possible,

Just because it runs on your computer and you manage to dockerize it, doesn't mean that we can service in production, and this is something where a lot of things come up. I mean, it's not hopeless. It's just a lot of efforts 1 has to put in to make sure that everybody are on the same page. I think that that's definitely 1 of the things worth digging into is that

to your point earlier about the data engineering role being invisible when it's done well, similar to when you're working as a systems administrator

or a platform engineer.

Unless something goes wrong, nobody knows that you exist, and your task is actually to try and automate yourself out of a job, which there are so many different things to do that it never actually comes to that point. But it's definitely difficult

to raise awareness to the value that you're providing when you're doing things well because, as you said, you're invisible until things break. And then digging into the point of

the sort of different use cases of data,

what are your thoughts on the position of the data engineer as far as

being able

to understand

what are the actual needs of the data and what are the highest priority items

to focus on when there are so many different people who are asking for information or trying to answer questions, and it's not possible for you to do all of them at once to be able to figure out what is the most impactful and most valuable thing that you can be doing. I think the most impactful thing is is communications. Like, in my impression,

data engineers are the ones who has to be the best at communication, like, compared to, like, I know, front and back end. Maybe even, like, Datapital Architects because

I've seen these roles

happening in kind of, you know, dark corners, like, you know, someone sits down and does the whole React thingy after, like, getting the tickets and just delivering. I think with the data engineering, it's really important to understand what is the business value we're trying to get because I think nobody really has the whole picture. Let's say you're the business analyst, let's say data scientist and data engineer, and they should be collaborating

pretty closely to have a good grasp that that what's feasible to deliver. And and I have, like, war stories. Like, I mean, I did shoot myself into, like, several times. That's why I I try to bridge this. That, again, finally, I think the data engineer has to have, like, very, very great communication skills.

And in some ways, in order to be able to

know

what to ask and sort of what to focus on, it's difficult, particularly as a junior data engineer, if you don't have somebody more senior to help you along to

be able to push back on certain requests to say, you know, I know that you're asking for this specific thing, but that's not actually what you need. What you need is for me to be able to focus on the foundational elements of this platform because, otherwise, I might be able to get you that information, but I can't keep

I might be able to get you that information, but I can't keep it up to date or I can't ensure its reliability

and being able to prioritize

within the

constantly shifting needs of how the data wants to be able to access data and the increasing number of sources that are being dealt with.

What are your

pieces of advice or some of the ways that you have

managed to level up yourself in terms of the

organizational awareness and being able to identify

what are the overall hierarchy of needs within the organization to

know where to start focusing and what you can deliver in the long run versus just right now? 1 is trying to be, like, super practical

and maybe, like, really depends on, like, different cultures, but, like, standing in front of the mirror and trying to pronounce, like, no with, like, a steady voice.

That's some exercise that is very worthy, I think. I always always tell to, like, to my juniors that, yes, it's total fine to say no. Sometimes it's a bit rude, but I think it's a good starting point that you really have to understand the business rationale behind, and I think it's always fine to ask

what exactly do you want to achieve having this piece of data. Take, like, 1 more step further, which I call, like, the the Friday,

5 o'clock rule because we've been there, like, Friday 5 o'clock, manager

sees Taro run-in the room and, like, yes, I want to have this piece of data now. I need it. Hopefully, you're in a position to ask back and say, like, fine.

Let's play out this thought experiment. I have this data for you now. What can you do about it? Can you make this product in production, like, with your people, with the team? Can we change the marketing campaign? Like, how how it's gonna roll out? Because, again, if you're asking the simple question otherwise, like, like, do you need this data, like, fast, or do you need this data reliable? I might not even play out, you know, the the the 3 options that you can have just 2 of them. But, again, it's very important to understand the context here. I might even point out the fact that if you look around in the job market, that's what I've seen, like, as a hiring manager, a lot of cases that somehow

there are some people who manage to get, like, a senior level by themselves,

but it's super hard to find junior data engineers. Like, they just don't don't grow. Like, here in Berlin, you shake a bush, 3 data scientists just pops up. But if you want to have someone who's kind of understanding

basics,

like, you know, like SQL,

Kimball, like, good old stuff that might be true or not true or correct or not correct with our current set of technologies, but, like, kind of have an understanding of what are the questions I can ask. That's that's super hard. And so in terms of the technical knowledge that you think is essential for data engineers

and to your point too about the difficulty of finding or training a junior data engineer,

what are the sort of foundational elements

of

the knowledge and capabilities and experience that you think are necessary for someone to be able to work as a data engineer,

either

within a team or

in particular in isolation within an organization.

I think the problem with this being a data engineer, like junior data engineer is that you do not necessarily have to have, like, super deep knowledge

of all things, but you have to have, like, a bit of a wider understanding

of where data could come from, how we can handle it in a way that it doesn't hurt us, and how we can give it to the end user or like the business trusted parties.

And this covers a lot of lot of lot of things. So if you start with, like, data acquisition,

you have to have some kind of understanding how logs look like in your back end. What type of databases do you might have there? What kind of third party APIs you have to talk to? Basic concept of entities.

There are these funny GitHub repos with titles of falsehoods programmers think about time, place,

names. And I might even say, like, very special example, like time zones. Like, who would have known that there's a time zone, like, 6.75

and why? But it's there in the database. Data engineers should be having a generic understanding of telemetry,

like, what you can do with third parties, like, Google Analytics, what's happening with ad blocking these days.

A lot of open source tools are out there more than what we need. You know, if we can put these things somewhere, they should be most likely queues.

Do you really have to have your own Kafka? Are you fine if fine with kinesis? Like, a lot of lot of questions and and a lot of lot of decision points.

Then you get to the point of, like, ETL or ELT,

whichever is more handy these days, and how you orchestrate these things,

how many people are trying to put things in your ETL,

because student landscape might look different for you.

What are the main, like, programming interfaces people are trying to talk to your data? Do you like more, like, happy with, like, SQL

decorative stuff? Or do they want to do, like, programmatic stuff with Python? Or even, like, you're into, like, functional?

All different setups. Then you came to, like, data stores.

I might even say, like, data warehousing because these days storage and computation is not necessarily tied together.

But most likely, you know, your DevOps team or your generic IT gave you some kind of cloud to work with or have some kind of, you know, setup, which you might not be able to change.

So you wanna have, like, an understanding that do you wanna use, like, a software as a service thingy, or do you wanna host your own solution?

And then it comes to PI tooling. You know, do you do, like, self-service data? If you do self-service data, who takes care of, the data definitions?

The KPI definitions, should they be, like, problematic?

If you want to serve, like, any kind of data product, be just, like, anything more complicated than dashboard,

Like, how do you deploy it with your own setup? Can you deploy the stuff your data scientists

produce you in or Python?

Then you have to have a bit of understanding of CDCIs,

some kind of testing

and how to how to, you know, put all these matrioshkas

inside of each other, like dock and kubernetes,

VMs, containers, just a bit at least. And, yeah, if you want to serve like machine learning in production,

that's also something. How do you precompute stuff? How do you cache things?

How is your feature factories working? And we're not even talking about, like, the actual product that your data scientists might wanna deliver, even just, like, I don't know, recommendation engine or, like, something related to search. So you should be at least a grasp of these things, I think. So you're really this kind of connector or, like, glue between different,

stakeholders, different type of engineers,

and this might be this might be

complicated. Yeah. I think that 1 of the defining characteristics

of data engineers,

similarly to people who are working in DevOps and platform engineering, is the capacity for systems thinking and being able to understand

what are all of the components of this overarching system that I'm trying to work with, How do they interconnect? What are the cases where they don't work well together? And then being able to

understand

heuristically

or sort of understand holistically

how it's all going to play together

and what are the different pieces that might potentially fall apart in that overall Lattice. Yep. Yep. And I think in this 1, being trigger happy as a data engineer is, like, always, like, a bad choice because it might be, like, super unsexy to

have files transferred between microservices

as a separation of concerns.

But maybe we shouldn't start with, like, a real time system

where, you know, if you lose, like, 1 record, you have to roll back everything. So,

again,

I have an example for that. Like, I think 1 of my best DevOps engineer ever

in my life was a guy who used to be like a like an auto mechanic. He didn't want to program.

He wanted to understand the things and solve them with the least amount of friction.

Yeah. It's definitely

a useful

background because as with

software systems,

there's the mechanical systems that have to be able to interact and play well with each other. And there's the similar paradigm of common interfaces between components so that you can swap things out and upgrade it in isolation

without having to worry about taking down the whole system as long as the contracts between the different elements is maintained. Yep. I agree. Totally. I agree. Totally. And I think that's also something

very important that, again, I don't know if you're, like, a front end engineer and you're dealing with your React app, maybe we might wanna interact with you when we talk about, like, telemetry and stuff.

But, you're kind of isolated. You might be like, that's your own turf, and if you can call in decisions here and there. With data engineering, like, it all boils down to you. So if there's, like, a leaking pipe up there somewhere, you will see it in some time. So you are much more exposed to all the components of, like, a big system even if, like, everybody is trying to make their job the best they can in, like, all these separate parts. You are relying on all of them. You are dependent on all of them in some ways.

In your experience of working as a data engineer and with other data engineers, what are some of the common gaps in knowledge or experience that you've seen people contend with and some of the ways that they've tried to remedy that?

I would say that going back to the past and going back to the roots,

I wouldn't say that, you know, the big pinball book is, like, a welcoming thing.

Nobody wants to read that 500 pager, like business oriented whatnot,

just like because they have we're gonna have fun with it. But again, it might give you some ideas that

how did we solve these data related problems

when we had like much, much slower and limited resources.

If your ETL

is technically like a directed acyclic graph, like there are things that depend on each other and I know they should be retrying and you should know what's the data quality, what failed, whatnot. Our open source ETL setups that we developed started as a joke. Like, you know, can we make it with Make? Yes, we could. Like, let's try

again to look back things that are there for like 50 years.

Let's not try to reinvent the wheel. It is possible.

And again,

1 practical thing is like lines of code. When we were at Microsoft

with my team,

we managed to open source our solutions

in a way that we figured that anything that is less than 5, 000 lines

is a code sample for Microsoft.

And we like this kind of limitation

because we could ship fully working,

diagnostic, event tracking solutions, even with test suite and whatnot, under 5000 1, 000 lines. And I think that it's kind of the feeling when I'm interviewing

like junior data engineers and I ask them like a super simple question. We have some data in the app store and we want to have the data in our data warehouse. What do you do? And unfortunately,

I would say, like, 7 out of 10 cases, the answer comes like, start up a Spark cluster. And I say, like, thank you. No, thank you. Like, you don't need to start up a Spark cluster to talk to a third party API. Maybe you need it, but most likely not. I mean, I'm a big fan of the band called Tapash Mode and I just recently realized that

their name is actually meaning that fashion of the day. And this is my

problem. I think this is, like, the biggest gap in a lot of cases that everybody is, like, super up to date. With the fashion of the day, They cannot stitch a little back dress. So I think if you're like an engineer,

even if you live this fashion industry we are in,

you should be able to deliver

that piece of clothing that goes well in every occasion.

You can start experimenting with different, you know, colors and textures.

And so in order to help to bring up a new generation of data engineers, you recently started down the path of building a boot camp for helping to train them. I'm wondering if you can talk through some of your motivation

particular approach to helping to bring people into this ecosystem and into this problem domain.

The greedy part from my side is that I had so

hard

times to find, like, a junior data engineer. I don't see them, like, coming from anywhere, and I see that there's, like, a lot of need for them. If I call myself, like, a data generator, then they now, you know, they're like data center helpers. You don't necessarily have to be able to, you know,

quote the Martin Klappman book on on designing data, applications on how to do, like so you don't need to build on a zookeeper. Okay? But you should be able to debug

a complicated SQL query. You should be able to understand what the explain task to you and and how it's going on. On the Nasdaq side,

I see, again, a need for these people. Like, it seems like that this COVID situation showed us that some jobs are a bit more resilient than others.

I'm seeing, like, there's just a recent tweet that, you know, the open deep learning positions just, like, plummeted

recently. I'm hearing about, like, 4 data science teams getting chopped by the finance team when they look at them and select. These are the cost and this is the uplift. Thank you, but no thank you. It might be the case that data engineering

is not a sexy thing, but you need plumbing. You will need plumbing. Whatever you do, you have to have these systems humming. And again, I think that DevOps and data engineering are the 2 most interesting fields these days. I think DevOps needs

from you much, much more deeper understanding of, like, holistic

system view.

And I can even imagine that for, like, a junior data engineer and you like this part of the game that you have to understand these interconnected things,

you might even want to be like a DevOps person later. Kind of the same way that, you know, if you look around and see

this as for like junior DevOps people. For me, it's a bit of like a contradiction in terms. You have to spend some time in industry to understand and feel the pain that what doubts means.

So I don't think this also kind of exists.

Definitely a good point too about the fact that junior data engineers is not really something that people aspire to or something that there's really a good pipeline for similar to DevOps engineers where

because of the breadth of knowledge that's necessary to really be effective, a lot of times, people will come in from adjacent roles where they maybe started as a software engineer, and then they decide to start working more in the data space, and they aggregate and accrue the necessary knowledge and experience

to eventually move fully into the data engineering space. And at that point, they're more of a senior level. And so I think it's definitely

helpful to have a way for people to jump start their career in data engineering and go straight into

being effective at the entry level with the breadth of knowledge that they require

versus just having to organically discover all of these different elements and then figure out later what the gaps in their knowledge are and try to fill them either by themselves or by taking individual

classes or reading specific books to be able to

understand what are the pieces that they don't fully grasp.

And from, like, a very pragmatic, like, you know, money oriented perspective,

what I kind of see in the market is that, you know, we had

this great influx of, I don't know, PhDs

who went to, like, a data science boot camp, paid some monies. They're super smart. They learn, like, data science. They get a job that's paid well and better than what they could earn as PhD.

They learn about industry first, you know, about their job, see how it's working, hearing about all this funny stuff like version control,

tests, and things. And after that 1 or 2 years, again, I believe that some of them might have this feeling that, I can be much more productive

if I know a bit more of a data engineering, either on the the data acquisition side or, like, putting my thing in production. I also had a express

experience that,

if you're like a business analyst,

you might not thinking that you need like super complicated models. You don't want to be a data scientist. Still, you want to acquire data. Still, you want to ship your data products. I mean, most of the data products I've seen in my life in production

were not like super complicated models.

Something that serves the company from the inside, from outside

doesn't have to have, like, deep learning, but it has to be, like, put in production.

It has to talk to database properly, etcetera, etcetera. And even in my previous, like, job, I was super happy to retrain

the front end engineer to become a data engineer. So I was like that much into data, but he understood, like, the engineering practice that, you know, again, what version control means, how to interact with people, how to talk about these things with them. I think it has to have some kind of ingrained, like, motivation. So, like, you feel the need for that and then you can you can add on top of that. I mean, it's not rush for rocket surgery. It's again, I think, more likely that what are the things that you really need to focus on? What are the things that select, yeah, you might need a tool, but if you need a tool, you really go, I don't know, in Udemy,

Udacity, spend some bucks, and learn about a tool. But it's just like a tool. It's not like

perspective. It's not like an understanding of what is a problem you're trying to solve. Yeah. There's a really good presentation

from Jez Humble on the DevOps side of things talking about don't hire DevOps engineers from external sources. Start growing them internally because

it's easier to start with somebody who has a base level of capability within engineering and then help them to acquire the necessary skills to be more broadly effective. And I think the same holds true with data engineering.

Absolutely. I agree with you because, again,

some kind of a t shaped

understanding of your systems.

If somebody is with you for, like, 1, 2, 3 years, they have a much better head start to become an effective data engineer. Even then opposed to someone who just comes in and knows Kafka and whatnot. But again, a lot of nitty gritty small details of how your company actually works. That helps.

For people who are applying to your data engineering academy,

what are some of the common reasons that they have for wanting to become versed in data engineering and move into this particular role?

I think we have a pretty nice feedback in terms of, like, we see this justified it. There was a fellow who's,

working as a research engineer, like a research research, like, in scientific setup. He'll get, our curriculum and said, like, yep. This is what I have to fill in, like, exactly these things. Like, I have to have, like, a solid practical understanding of data management,

day to day operations.

I know Apache projects come and go. Again, you can learn about them. Obviously, that's what we try to however, try to save you time is say no. Don't go down the route.

Then we also try to explain, like, why we think it's not something that really belongs to, like, core values of a thing. It might be like just a manifestation

for this given time. But again, if you understand, I know, how MPP data warehouses work, we might wanna skip Hadoop. And for people who are applying, what are some of the base line capabilities that you expect or anticipate for your target audience, and what level of proficiency or capability do you aim for when they've completed your program?

Prerequisites are really centered around

SQL as a decorative language, kind of understanding how that works. And also,

I really like Python because Python is not the best for anything, but it's good enough for most of the things. And it serves as kind of a lingua franca between different teams. Like, even if you don't program Python, you can kind of understand what's happening.

So some kind of background also in, like, procedural

approaches. And also, if you've spent 1 or 2 years at the workplace, anything, and you've seen

your companies or your team's specifics on how to handle tickets, how to handle version control,

what is your branching strategy. So kind of have a basic understanding of of the contemporary

software development setup even if you're not necessarily doing that, like, as a data scientist, maybe. This kind of context or awareness, I think, is really needed. And I think if you do this with us, we hope that you can walk away with

practical

knowledge,

practical examples of what can you deliver.

And, I mean, it's super hard to turn somebody

to anybody else in, like, 3 months time.

But I believe that

giving you the self confidence

that I can do this is the most important thing Because even if you are getting faced with the latest I know, oh, we use Flink and I haven't seen Flink before.

Great.

You might be able to understand where it's on the landscape,

what you can do about it, and most likely you can fake it with UNIX tools

first.

So that's that's always my promise, which I try to keep is that you might not be able

to deliver, like, the sexiest

application

or solution for a problem, but you can deliver 1. And that's a good starting point. I think that's the most important thing, what is in production and what is not in production. But if you ask me to ship machine learning product,

what I would do is I would try to ship

the simple

Python Flask application

according to your DevOps setup, I don't care where I should put it, that will roll your dice. And we know it's bad, what it gives you, but it's in production. We can improve it instead of trying to figure out what's the best way to cash or what. No. No. No. Let's put it there. And then we can then we have something that's working. Think that that general approach of don't let the perfect be the enemy of the good is a good way to approach it. And it's definitely easy to be stuck in the analysis cycle of

what are all of the different potential problems that I'm going to run into and how am I going to engineer around them before you actually even experience any of those cases. And that's where the sort of guiding principle of you ain't gonna need it comes in where

just solve the problem that you have today, but do also make sure that the way that you're solving it is extensible so that you can solve those other problems in the future, but don't try to solve everything all at once because then you're never actually gonna ship anything. I totally agree with you, and I think that's also part of being a data generator that it's kind of okay to ship something simple and half length. But you will be the 1 who has to keep that thing alive.

So so we have some kind of interest in making that thing happen, really.

And for people who are applying to your academy, what are some of the cases where you think they wouldn't be a good fit and they might be better served by doing some self guided learning or a different type

of educational approach?

I believe, again, that if you need, like, an understanding of a tool or if you have to do some very specific

cloud related setup,

there are like super, super resources out there, even produced by the companies that actually

make these products available.

What you will miss is kind of the camaraderie

of this whole thing. So we had like a trial run-in

spring for, like, 1 week, and we started out of school

just when the whole COVID thing hit,

Europe. We're, like, pretty big believers and still are pretty big believers in the face to face and being in the same room and sweating in the same room with these people,

doing pair programming, rubber ducking, whatnot.

And this was the main feedback we got from our pretty mixed

people who took part of this 1 week course. It's like, think about it. We have, like, product managers.

We had data scientists.

We had, like, I would say, brain oil engineers.

From Monday morning to Friday evening, we managed to get from nothing to a shaped,

dockerized

data product that they can push to partner it works. It was successful. They liked it, but everybody came back with the feedback that this should happen in person. Somehow,

that in the format of the bootcamp that, you know, we actually think about maybe we should start like a quarantine bootcamp that you cannot leave. I mean, you get there

and just focus on pushing through all these challenges and boundaries with with your team members. And we're happy to give this feedback even to people who apply to us that we better spend $30 or $100 or, like, $400

at this course and that course and other course because you're gonna solve the problem better. You will need a different

self motivation, a different

way of how you push

yourself through this whole setup. I mean, if you solve have to solve that specific problem,

we're gonna be able to help you. If you wanna learn Kafka,

there are super resources out there. Go learn it. We can talk to you about CUES.

Going back to the broader level of your experience working in industry and working with other engineers,

and if you're in the context of hiring somebody for a data engineering position,

what are the core capabilities that you look for, and what have you found to be useful methods for being able to assess their overall competence and fit for the particular role? I think I mentioned just recently this 1 specific example, like, you know, I asked someone to to get like data from a third party API and they they start to enlist their favorite tool, which is typically Spark these days. I think I know that we're not, like, a good fit in this 1.

My idea of a feedback, such a question or such a request would be like, okay,

what's the data volume we're talking about?

How often this changes?

What do we know about the schema?

What are the systems you already have? Again, I think

we'll try not to add complexity if it's not necessary.

We have enough complexity

coming from the company and the organization all over the place. So if you can remove a component, then we'd rather do it. So this kind of reductive approach, I think, is very interesting. And again,

kind of the car mechanic perspective.

Just because I have a hammer, not everything is a nail, I have to figure whether if I have to pull my hammer at all to solve this problem. So I think, again,

communication,

understanding of context,

problem solving,

not coding skills. I mean, I'm super happy

if you can deliver code in Clojure. I know you're super smart.

Actually, I think 1 of the best microservices we had ever in production was with an enclosure.

But, you know, if you wanted to change something, it was like medical conceiving, like, everybody got together, like, which bracket should we remove type of thingy. So I'd rather

have a person who,

again, solved everything in Python, but can talk about it, why she's doing it, how she's doing it, because most likely this will happen talking to the other part of the organization.

Then to have someone who,

you know, is like a functional language guru and can deliver the best performing whatnot.

In terms of your overall

perspective on the data management and data engineering space, what are some of the current and emerging trends that you are worried by, and what are the ones that you are personally keeping an eye on and that you're happy and excited about?

What is worrying me a bit is this repeating cycle of what typically I see happening with Apache

solutions or products is that, you know, there is like open source solution

that should be working

for everybody and should show, like, a problem. No big deal. And then suddenly,

there emerges,

like, an a company company

that gives you, like, consulting services

that gets funded by VCs with, like, 100 of 1, 000, 000 of dollars.

While we know that consulting is not something that, you know, you can scale, like, either building a product, but then this whole open source thing is just like a marketing ploy. So it's happening with everything, even with a thing like Airflow, you know, if you wanna put airflow in a container,

you'd better pay for it because, you know, it's super hard to do it on your own. And I found this worrisome.

I don't think that there is like,

and on the other hand, what I'm a bit worried about, again, I mentioned that I see a lot of, like, teams and companies are migrating from cloud a to cloud b. We're trying to do some kind of a hybrid setup. I never heard back anything

good from that up until now. I mean, you can do it if you have to. But the big providers are working hard so that the solutions are not exactly comparable,

not even in pricing.

If you have, like, a properly working setup in 1 cloud, good luck to find all the corresponding components in the other cloud. This is, again, I mean, it's good news for data engines because we're gonna have a job because these things has to be handled,

but, you know, we could do better, I think. You asked me about also, like, what I'm keeping an eye on.

I see that the SQL is coming back. I'm happy about that. I mean, it's always like this kind of bastard child of thingy,

and I think that was the reason why I has not been taught SQL in computer science when I was learning because it's just, like, the business thing. You know? We don't care about that. That's not the language. That's, like, something that business people do. And I'm seeing

that bouncing back properly.

Again,

what I see is that we have tremendous

amount of computing power and crazy good solutions

that we can use for even, like, peanuts.

In this 1, I'm, like, we are so much empowered

from a technical perspective wherever I see, wherever I go.

Like, every

cloud provider or, like, every every I just mentioned it was Apache

products or solutions, I can find something

useful and great. So this is out there.

It's just like, again, it's another job

to be able to keep up with all the news and whatnots.

And as you continue to iterate on your

work at the pipeline data engineering academy

and work through the initial cohorts and start to plan for successive ones. What are your overall goals for that program and some of your plans and

some of your personal aspirations

for the work that you're doing? I would like

to share

the pain and the shortcuts.

So I think what our personal aspirations are that we believe that we can help people

to get a decent job,

not making a lot of mistakes that we did because nobody told us that this is not something that you should do. I mean, the heart that it has to feel is to believe it. So, like, most likely, some of these things has happened to you, but I would be super happy if data engineering

is becoming more of a respected thing, again, at least as a plumber. If we can help to fill the need for for junior data science, sorry, data engineers,

then I could tap on my shoulder. Are there any other aspects of your experience

that you'd like to cover before we close out the show? I think I keep on talking for a long time, but it might be, like, around it closing now because

we can get into a lot of details on how we actually, you know, build up these teams and how you how you figure, how can you transform, and, you know, what is the ratio

of skills

and the amount of people, like, how many data scientists and how many data engineers and how they can work together. So, like, I think a lot lot more to, like, the organization

perspective

and it's very different in different size of companies,

different maturity

of your data culture. But I think that we need more

junior data engineers, and we'll try to do our best. Well, I'm definitely happy to have you back on to dig deeper into some of those other topics.

But for now, for anybody who's interested in getting in touch with you and following along with the work that you're doing, I'll have you add your preferred contact information to the show notes. As a final question, I would like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

We should have some kind of common understanding of metadata and data quality handling. I mean, there are some very great examples

how people try to address this. But, again, it's something that strikes back from the past good old gimbal data warehousing days. It's not as super sexy, but more and more people start to realize that if you have all these real time

queues here and there and everything is just up in the air, it's better to know how much you can trust 1 piece of information

and how you cannot because we cannot achieve, like, perfection.

Thank you very much for taking the time today to join me and discuss your experiences working in the space and your perspective on how we should all be thinking about and approaching it. It's definitely

a very challenging ecosystem that we find ourselves in and 1 that we're all trying to find our own way through. So I appreciate some of your perspective and guidance to the up and comers in the data engineering space. So thank you for all of your time and effort on that, and I hope you enjoy the rest of your day. Thank you for having me. Keep up the good work. Gitmej, Origin Master.

For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave review on iTunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links