Reflecting On The Past 6 Years Of Data Engineering

Truly leveraging and benefiting from streaming data is hard. The data stack is costly, difficult to use, and still has limitations.

Materialise breaks down those barriers with a true cloud native streaming database,

not simply a database that connects to streaming systems.

With a Postgres compatible interface, you can now work with real time data using ANSI SQL, SQL, including the ability to perform multi way complex joins, which support stream to stream, stream to table, table to table, and more, all in standard SQL. Go to data engineering podcast.com/materialize

today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring.

Your host is Tobias Macy, and today I'm reflecting on the past 6 years of the data engineering podcast and some of the major trends that have been happening in the ecosystem over that

time. So I started this podcast

in January of 2017,

so it's been just over 6 years now that I've been running it. And for most of that time, I've been releasing weekly.

For a little while, I was actually releasing twice a week, so there have been a lot of different topics and interviews that have come on the show,

and those reflect a lot of the major trends in the industry as well as some very

interesting,

examinations of some of the details there.

And so just to look back a little bit on the time that I started the show, that was around the same time that Maxime Beauchmann had published his very widely read posts about the rise of the data engineer and the downfall of the data engineer.

And that was right around the time that the

entire concept of data engineering as a specific role was starting to take shape.

And

reflecting back on that in some of the other episodes I've done and on my own, there have been a few thoughts about why

that happened, when it did, and how it did.

And

1 of the interesting things about this podcast is that I actually created it a little bit in answer to

a large number of podcasts that had started focused on data science. So

there were at least a half dozen, maybe a dozen well known widely listened to data science focused podcasts, but

there wasn't anything about data engineering. And this was largely because that was also around the same time that data science was a very hyped up job. Everybody said, oh, data science is going to do all these amazing things because we have data, and so we can find all kinds of useful insights. We can build machine learning, and machine learning was also still in its early days. This was before

deep learning had really taken off.

And so lots of companies were hiring data scientists because of the supposed promise of, if I hire a data scientist, they'll be able to tell me all the things that I need to do to make my business run better or improve my customers' experiences or bring in more customers.

And

that had been happening for a few years at the time that I started this show in 2017.

And I think that data engineering really

came about in answer to all of those data scientists being hired and then coming to the realization that before they could even do the work that they were hired to do, they first had to do a whole bunch of data aggregation, data cleaning, and so it wasn't really possible for them to do their jobs. And so they actually became

the first incidental data engineers

before companies started hiring for that role explicitly. And so because of the fact that companies were investing in data science and data scientists

and seeing that that investment wasn't paying off, they then started to hire specifically for people to

do that initial work of gathering the data, cleaning it up, and making it available to data scientists to do the work that they were supposed to be doing. And also in 2017,

this was on what ended up being the tail end of the Hadoop era where Hadoop came on to the scene in the, you know, early 2000,

and that was seen as, oh, this is this economical system that I can use to gather all the data that I want. And the term big data was really taking off. And so the general trend was that

if we just collect all of the data about everything all the time, then eventually we'll be able to make some use of it. And so there were a lot of people who were dealing with scaling problems for those Hadoop clusters.

And in addition to the challenges of scaling the clusters, there were also a lot of complexities

in dealing with

the programming paradigm of MapReduce and being able to manage repeatability

and

figuring out how long it was going to take for given jobs to run, sequencing jobs effectively.

And a lot of different

tools and platforms and add on components grew up around the core of the

Hadoop,

framework.

And so there were things like Hive, and that also led to things like Presto and HBase.

And there were a whole suite of tools, Uzi,

that came out in response

to Hadoop and trying to add simpler interfaces on top of it, being able to use it as storage layer for SQL engines,

and that went on for a number of years. And 2017,

there was still some momentum in the Hadoop ecosystem, but it had started to die down a little bit in favor of some of the next generation of tooling.

It was also around when airflow started to gain popularity

for as an orchestration layer for being able to make sure the tasks got sequenced properly.

And so it was an interesting time to start the podcast

because of that fact that there was a little bit of a transition happening.

People who were heavily invested in Hadoop were still trying to make it work and add on some of the new tools around it. There was also a lot of hype about streaming and different streaming engines and the fact that streaming was going to be superior to batch for a number of reasons, particularly because of the timeliness challenges that people were experiencing with Hadoop.

And streaming is definitely still a very prevalent topic now, but it was

very it was a very popular aspect of conversation because it was still new, and there were still a lot of engines that were being developed and had a lot of momentum behind them. So

Storm and Flink and Spark were all some of the major ones.

And this also gave rise

to the different

paradigms of how to address these data scaling issues. So the Lambda architecture

was created as a way to try and reconcile these batch and streaming

workflows

where you would effectively have to write your logic twice, where you would use streaming for real time and

a good enough approximation

of what reality looked like. And then you would have your Hadoop batch jobs that would come in afterwards and catch up to a certain point in time with a more accurate view.

The CAP architecture was a little bit of a response of just stream everything so you don't need this batch layer.

And those are conversations engine

as your only source of truth and still be able to use that same logic to engine as your only source of truth and still be able to use that same logic to replay all of history. And

there's also been an interesting shift in that concept of

just capture all of the data all of the time because people realized, 1, that storing all of that data is expensive, and you don't necessarily capture all of the value that you put into storing it and processing it.

But also the regulatory environment has changed where there's a lot of increased risk for

storing all of the data that you might have, such as personally identifiable information or,

with GDPR, the risk of having to delete

data when a customer requests it and just being able to understand where is that data, what data do I have. And so companies have gotten a lot more judicious about what data they capture and making sure that it is going to have some value rather than just capturing it for the sake of capturing it. And

that also brings in the era of data catalogs where data cataloging

had existed, but with the big data mentality of just throw everything into the data lake,

you didn't always know what you had. Or if you did have it, you didn't know if it was useful or how it was being used.

And there were tools such as Alation and some of the other commercial

data catalogs that were there, although they were largely manual where people would enter in the different data that they had, what the schema was supposed to be. There wasn't necessarily validation

of that.

And then Amundsen was 1 of the first

tools that gained a lot of popularity for some of the kind of automated cataloging,

being able to integrate with things like Airflow or your other,

databases and orchestration engines to understand

what data you have, how it's being used. And so the visibility

of the data also made it easier to gain value from it, and so you didn't necessarily have to capture everything and then spend a lot of time

exploring it to see what you had.

And

the data catalog

conversation

over the past few years has really evolved into

data discovery and the metadata layer, and I'll touch on that a little bit more in a little while. And orchestration engines also have been

gaining a lot of momentum as being

a topic that is core to the overall data platform where you have to have some orchestration engine

as the means of understanding what gets executed when and how rather than having somebody manually run a bash script or having a cron job set up. And so and and the orchestration engines have also gone through some generational shifts where it was initially just task based

and then coming to the realization

that the orchestration engine should understand

what the actual data is that it's processing.

Because even if a task says that it's completed,

it's possible that it could have had a silent failure. Or even if it does complete, maybe there's something wrong with the data, but it does knows that this finished. I got a successful exit code, so move on to the next thing.

And so some of the next generation of orchestration engines have decided

that being aware of what the data is

and why

and how it relates to subsequent downstream uses is a necessary

fundamental abstraction

to be able to actually build up scalable and successful

data platforms.

The real

catalyst in

the past probably 5 years

of data has really been the rise of cloud data warehouses, where Redshift was definitely

the first notable 1 that came onto the scene and really made people start thinking differently about

what data warehousing means, how it scales,

the cost benefit analysis of it, where it used to be,

a very expensive appliance that you would have to have in your data center, and now it's something that you can rent and, actually, it can be fairly economical. Obviously, there are different challenges of managing cost with a pay as you go model. But shortly after that, there were also BigQuery and Snowflake. And so Redshift, BigQuery, Snowflake have really been

the major

motivators for the modern era of data engineering. There's been a lot of hype about the modern data stack. And regardless, this concept of a cloud data warehouse in whatever form it takes

has really become the focal point of

how companies

work with their data.

So there have been whole businesses that have come about to build on top of the data warehouse for anomaly detection

or data quality analysis,

other topics I'll touch on shortly.

But the data warehouse has really become

the juncture of everything. And by putting everything into 1 location

and having

schema visibility of that data

is, I think, the main thing that has really enabled the current,

approach to data engineering and some of the ways that we are able to continue to evolve because there is a common understanding of how to work with that data,

and you don't necessarily

have to be

a distributed systems engineer to be able to get anything out of it. The other thing that

these cloud data warehouses have really done because of the fact that they're scalable, because of the fact that the storage is more economical than it used to be,

is shift the conversation from

extract, transform, and load where you have to make sure that the data

is in a specific shape before you even store it in your data warehouse because doing it afterwards is too expensive or because,

storing all of it is too expensive and really moved us into extract load transform. Obviously, there are different orderings of the transform and load step, but it really has allowed for

bringing in all of the data and then transforming it and iterating on it so that you

can be

more I'll use the term agile about it. Agile in the software sense of

being able to build on top of

successive iterations,

being able to,

deliver value quickly without having to do a huge amount of upfront work before you can actually get anything done.

And another tool that has really been

transformational,

no pun intended, although, I guess, it should have been intended,

in that space is DBT, which also gave rise to the concept of an analytics engineer of people who are doing the analysis being involved in the repeatability and robustness of the data that they're working with and bringing them into those software principles.

And so

with DBT

and the cloud data warehouses,

that really

catalyzed

us into where we are now with

our capabilities, as well as allowing more businesses to be able to actually get in on the game of using data to power

their companies and

improve their customers' experiences.

Data lakes have still been a conversation.

Hadoop

was probably the first major iteration of that.

Larger organizations

have maintained data lakes because of the scalability

aspects and the fact that they're very flexible in terms of what you can do with them, but they've always posed a problem.

And 1 of the interesting

paradigms in the past year or 2 that has really taken off is the idea of a data lake house where you can have the benefits and scalability of the data lake, but you can also have the

organizational

and user interface

improvements of data warehouses and be able to get the best of both worlds together.

And so

you can still use that same approach of ELT

with DBT and a SQL interface for being able to work with the data.

And so that you can bring more people into the experience and into the work versus

with the data lake where it used to be that you had to write some very complicated code to be able to process the data and load it into anything, and then you probably had to put it into a data warehouse anyway to be able to then query it. You can do all that in 1 place. So because of the fact

that working with the data has really had a much lower barrier to entry now,

that allowed people who are working in the space to focus on some of the higher order concerns where that brought in some of the concepts concepts of data ops and ML ops of being able to make sure that everything is

repeatable

and stable and robust

and being able to know when things fail.

So data quality and data observability

are some of the core aspects of that.

Being able to monitor

the entire data platform, both in terms of making sure that your data is actually getting loaded when it's supposed to, but also making sure that as it's getting loaded, you're checking,

is this data

conforming to these

specific requirements? Is the schema the same, or did a new column get added or dropped?

Is the distribution of data within the range that I expect? You know, if I'm dealing with financial

transactions,

do I always have decimal numbers, or do I somehow randomly have a float in there? Because that's definitely not going to work very well.

Or if I have

a value that is only supposed to go up, why did it suddenly go down? Or if it's an enumeration,

you know,

I should only have 5 possible values for the string field, but now all of a sudden I have 7. What went wrong here? So those are some of the other things that have really come about in recent years because of the fact that there is this

core

shared abstraction that everybody can build on top of. There are a lot more agreed upon interfaces that

people can collaborate on so that they can build higher order tooling and higher order experiences,

and everybody can benefit as a result. And then circling back on the metadata concept, this aspect of data observability,

data discovery,

they they all tie in together of being able to understand

what is the shape of the data that I have, where is it coming from, where is it going to, how is it getting there, what is being done to it, Who is actually using the data? That's 1 of the really key pieces of understanding

all of this work that I'm doing. Is it even worth the time that I'm putting into it? Because if I'm spending

hours or weeks

building all of this tooling to be able to get the data out of this database and out of this SaaS platform

and load it into this report for somebody.

If nobody's looking at it, then why do I continue to maintain that? So that's another key aspect of being able to close the loop

on I've built this thing. Is it getting is it delivering the value that it's supposed to deliver? And that brings it back to that concept of the agile methodology of making sure that the work that you're doing is being done for a purpose and that that purpose is being fulfilled because otherwise you're just spinning your wheels. And why are you even doing any of it?

And so the fact that

metadata

as a broad concept is starting to coalesce into

tooling that is able to encapsulate some of those different concerns of

monitoring for quality. What is the distribution of these fields? What is the schema that I'm working in? As well as lineage of where is the data coming from and going to

being able to understand how it's being used. That

really empowers

much more useful and much more valuable

engineering around

how the information is being applied. And then

1 of the most recent trends in being able to actually have all of this metadata coalesced into a single layer is the idea of active metadata, where you can actually use

that metadata

to inform

some of the automation routines. So maybe I know that this job happens at 5 AM every day, and this report gets looked at at 9 AM every day. So I'm going to make sure that I automatically scale up my Snowflake cluster

to ensure that this report will complete in the time that it's

allowed so that when somebody's looking at the report, they have the freshest data possible.

Or I can use this metadata

to understand,

okay, this job

has, you know, 5 downstream dependencies, and this 1 step of it just failed. So I'm going to make sure that those other 5 things don't execute because otherwise that bad data will propagate. And I'm going to raise an alert to let somebody know that this is going on. And

the the conversation around reports brings me around to business intelligence, which has also gone through a lot of shifts where,

in 2017,

business intelligence

had already gone through many generations, but it was still very much

a build a report, build some visualizations,

hand it off to a business user, and let them make their own assumptions about it.

And then once they do see the report,

then, okay, well, what's the next action? And

business intelligence

was still the place where a lot of the,

semantic aspects of the data

was built up. So being able to say, okay.

In this organization,

I'm going to decide that based on these different attributes of a

of an event or of a product or of a user, that this is what means a conversion from a potential customer to an actual customer.

Because that can be a very complicated question to answer where maybe if you're a large business, you have different types of customers or different people within the business have their own concept of when somebody becomes a customer.

And so the business intelligence layer might even have 5 different definitions of customer, but you don't even know that there are those 5 definitions because they were all built by different people. And recent iterations of business intelligence have really focused on this aspect of the semantic modeling

and being able to have that be a shared reference so that there aren't these complications of disagreeing reports where you're all using the same data. You're looking at the same data, but you're looking at it in different ways. And so it creates

skew in terms of the perspective.

And then that also brings in the idea of the semantic layer where maybe that needs to be pulled out into its own component and not live in the business intelligence

tool. And the business intelligence tool needs to just reference that other system to understand, okay, what are these domain objects?

There's also been a rise in the idea of embedded analytics or customer analytics where

for a long time business intelligence was very internally facing where

maybe a handful of people would look at the reports

that you're building because they were core to the business and how the business was interacting with that information.

But there are also a lot of useful insights that you can surface to your customers from the data that you're collecting from their interactions and from other customers.

Recommendation systems have always been an aspect of that, but

there are a number of different ways that you can surface some of the, maybe, users

buying patterns.

Or if you're

a financial institution, you can use some of the aggregate information about your customers to help give

end users some

perspective on their spending or, their savings.

And because of the fact that we have these more scalable systems that are easier to operate largely because they're built as a service,

it enables us to actually

build those analytical reports and

expose them to a wider variety of people so that data is not just an internally focused thing, but it can also be externally focused and provide value to a broader audience.

Circling back on the concept

of ETL and ELT, there's also been a rise in trying to complete the cycle of data

where it has largely been a very 1 directional flow where you pull data out of a source system,

you aggregate it, you analyze it, you put it into your business intelligence system, and then it just stops there.

Except that it doesn't stop there because somebody is going to take an action based on that

report, but there's not any concrete way for you to see

what that next action is.

And so

by building a pathway for data back out of your data warehouse or your data lake into the systems that it was extracted from.

For instance, being able to update your HubSpot or Salesforce records from the information that you gathered from your application about customers buying patterns without it having to be a manual process.

And so that

creates

a data cycle instead of just a line of data. And that makes it possible to

continue to iterate on and improve the overall value of that data as you enhance it. And

maybe if you are loading data from your application into your warehouse, you're enriching it from data that you're capturing from your CRM or from

your internal business systems and then propagating it back out into the user experience, they have the opportunity to help you correct that data even if it's maybe updating their profile information or updating,

some of the aspects of their customer experience.

And

obviously, in all of this, there are myriad topics that dig very deep into some of the other aspects of the specific frameworks or the specific tooling or the specific applications of the data.

But as I was sitting and thinking and reflecting back on the 6 years of doing this show, those are some of the main things that really stuck out that have been

really indicative of their specific eras.

And looking forward,

there are

a lot of

new and interesting and potential ways to apply data, work with data. Machine learning, I think, is today where data engineering was 6 years ago when I started the show, where

the data engineers

have

built up these robust data pipelines and made the data reliable and trustworthy to the point that it's

easier to

work with it has completed the cycle where the data scientists came in wanting to do all these interesting things with the data.

Now they're able to because the past 6 years of data engineering have really leveled up their capabilities.

In parallel with that, machine learning techniques have gotten much more sophisticated.

There's been a lot of tooling built up around that to improve the user experience and making it easier for people to apply machine learning even if they're not a an expert in the underlying

theories and formulas around it.

And so machine learning is starting along that new transformational path where,

we've gone past just can we do machine learning to now let's operationalize it. And a lot of the investment that went into data engineering is starting to pay dividends

in that machine learning ecosystem,

which is a big reason why I started the machine learning podcast as a companion to this show to help explore some of that new and transformative

capabilities and try to

understand and evolve with the ecosystem as that grows.

And it's interesting for me because when I started this show 6 years ago about data engineering,

I definitely had experience working with data and operationalizing data,

but I was very naive as to the potential scope of it and all of the different ways that it's being applied. And so 6 years of running this show

have

been very informative and have helped me gain a lot of expertise and understanding on the industry.

And I feel that I'm in the same space

with machine learning that I was with data engineering, where I understand some of the principles of machine learning. I can grasp the foundational concepts and understand what people are talking about.

But

I I'm just at the very beginning of my journey on understanding

how machine learning really works fundamentally, some of the ways that ecosystem is evolving.

And so I'm definitely excited to explore that. And there's also

an interesting element too where because machine learning has become more sophisticated and more accessible,

it is also being applied to data engineering problems. I've had a few episodes on this this show talking about some of that. So Anomilo is a company that is entirely focused on using machine learning to alert on data anomalies and data quality issues.

There are also aspects of using machine learning to do entity extraction, to feedback into data engineering,

or being able to feed it into your data warehouse.

So a lot of

interesting back and forth and interplay between data engineering and machine learning. So excited to explore that cross section.

And

in terms of the lessons that I've learned while running this show that have been really interesting or unexpected or challenging,

Well, the challenging part is just keeping up with it all and keeping a consistent schedule of running the podcast and understanding

deeply enough

what is being done so that I can ask useful questions,

but also understanding

from the audience perspective what is valuable. That I think has probably been the hardest part is

really getting a good cross section and perspective

on

how the audience is engaging with

the podcast and understanding

what I'm doing right. What can what can I improve?

What are the topics that are really meaningful to people right now? I I use my own interests as a gauge for a lot of that,

but I'm always interested to hear people's feedback on

what are the main things that you want to know about, Who are the people that I should be talking to?

How can I make the show even better for you?

And so going forward into the future of the podcast,

obviously, gonna keep doing a lot of what I'm already doing, but

looking to

bring more

engagement with the audience and with the community. And so as part of that, I'm working through setting up some possible membership options, so stay tuned there.

I'll probably send the first announcements of that to my mailing list. So if you're not already on it, you can sign up on the website at data engineering podcast.com.

So I'm hoping to have something

ready to go in the next week or 2.

And, yeah, just really appreciate everybody who has helped make this show a success, both the guests and especially the audience. Because

without that validation of people listening to it and giving me some feedback, I've had a lot of people email me saying that they actually got into data engineering as a result of listening to this show. So I just really appreciate all of the,

value that I've been able to create and the fact that people are truly engaged with this show. And so now

as my final

closing question to myself, I've answered this 1 a few times,

some of them fairly recently about what is the biggest gap in the tooling or technology for data management today. I think the biggest gap is really just understanding what is even out there, and

I think there's a lot of useful information about how to solve the macro issues. But I think that as you start to really dig into a particular problem or really start to try and integrate across these different systems,

there are a lot of little sharp edges that crop up. And so just improve smoothing that user experience,

providing more information about

what are some of those roadblocks, what are some of those sharp edges. And so circling back on the question of the future of the podcast, if folks have experiences

of

trying to, you know, use their

Airbyte or Fivetran

data streams and get it into Snowflake and then query it to build their business intelligence report. And there were some weird edge cases or problems they had to figure out, or people who are having challenges figuring out how to apply,

data validations or implement data quality. Just any of those experiences, I'd be really happy to

discuss that on the show and dig really deep into some of those edge cases and some of the interesting problems that everybody has. And so it's really valuable to be able to get that firsthand perspective of, okay, I got all of these tools because they're supposed to make my life easier, but there's this 1 thing that I really had had a hard time with or I had to engineer around or I had to build my own thing to be able to do it the way I wanted to. So definitely appreciate everybody's time and interest and energy in helping to grow this show. Definitely looking forward to

continuing that and taking it further and trying to, build up some membership around the show so that I can be more engaged with the audience.

So thank you again.

Hope you enjoyed it, and I hope you enjoy the rest of your day.

Thank you for listening. Don't forget to check out our other shows, podcast dot init, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning Podcast,

which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and just tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Parting Question

Closing Announcements