Summary
This podcast started almost exactly six years ago, and the technology landscape was much different than it is now. In that time there have been a number of generational shifts in how data engineering is done. In this episode I reflect on some of the major themes and take a brief look forward at some of the upcoming changes.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Your host is Tobias Macey and today I'm reflecting on the major trends in data engineering over the past 6 years
Interview
- Introduction
- 6 years of running the Data Engineering Podcast
- Around the first time that data engineering was discussed as a role
- Followed on from hype about "data science"
- Hadoop era
- Streaming
- Lambda and Kappa architectures
- Not really referenced anymore
- "Big Data" era of capture everything has shifted to focusing on data that presents value
- Regulatory environment increases risk, better tools introduce more capability to understand what data is useful
- Data catalogs
- Amundsen and Alation
- Orchestration engine
- Oozie, etc. -> Airflow and Luigi -> Dagster, Prefect, Lyft, etc.
- Orchestration is now a part of most vertical tools
- Cloud data warehouses
- Data lakes
- DataOps and MLOps
- Data quality to data observability
- Metadata for everything
- Data catalog -> data discovery -> active metadata
- Business intelligence
- Read only reports to metric/semantic layers
- Embedded analytics and data APIs
- Rise of ELT
- dbt
- Corresponding introduction of reverse ETL
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on running the podcast?
- What do you have planned for the future of the podcast?
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png) Looking for the simplest way to get the freshest data possible to your teams? Because let's face it: if real-time were easy, everyone would be using it. Look no further than Materialize, the streaming database you already know how to use. Materialize’s PostgreSQL-compatible interface lets users leverage the tools they already use, with unsurpassed simplicity enabled by full ANSI SQL support. Delivered as a single platform with the separation of storage and compute, strict-serializability, active replication, horizontal scalability and workload isolation — Materialize is now the fastest way to build products with streaming data, drastically reducing the time, expertise, cost and maintenance traditionally associated with implementation of real-time features. Sign up now for early access to Materialize and get started with the power of streaming data with the same simplicity and low implementation cost as batch cloud data warehouses. Go to [materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access)
Truly leveraging and benefiting from streaming data is hard. The data stack is costly, difficult to use, and still has limitations. Materialise breaks down those barriers with a true cloud native streaming database, not simply a database that connects to streaming systems. With a Postgres compatible interface, you can now work with real time data using ANSI SQL, SQL, including the ability to perform multi way complex joins, which support stream to stream, stream to table, table to table, and more, all in standard SQL. Go to data engineering podcast.com/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring.
Your host is Tobias Macy, and today I'm reflecting on the past 6 years of the data engineering podcast and some of the major trends that have been happening in the ecosystem over that time. So I started this podcast in January of 2017, so it's been just over 6 years now that I've been running it. And for most of that time, I've been releasing weekly. For a little while, I was actually releasing twice a week, so there have been a lot of different topics and interviews that have come on the show, and those reflect a lot of the major trends in the industry as well as some very interesting, examinations of some of the details there.
And so just to look back a little bit on the time that I started the show, that was around the same time that Maxime Beauchmann had published his very widely read posts about the rise of the data engineer and the downfall of the data engineer. And that was right around the time that the entire concept of data engineering as a specific role was starting to take shape. And reflecting back on that in some of the other episodes I've done and on my own, there have been a few thoughts about why that happened, when it did, and how it did. And 1 of the interesting things about this podcast is that I actually created it a little bit in answer to a large number of podcasts that had started focused on data science. So there were at least a half dozen, maybe a dozen well known widely listened to data science focused podcasts, but there wasn't anything about data engineering. And this was largely because that was also around the same time that data science was a very hyped up job. Everybody said, oh, data science is going to do all these amazing things because we have data, and so we can find all kinds of useful insights. We can build machine learning, and machine learning was also still in its early days. This was before deep learning had really taken off.
And so lots of companies were hiring data scientists because of the supposed promise of, if I hire a data scientist, they'll be able to tell me all the things that I need to do to make my business run better or improve my customers' experiences or bring in more customers. And that had been happening for a few years at the time that I started this show in 2017. And I think that data engineering really came about in answer to all of those data scientists being hired and then coming to the realization that before they could even do the work that they were hired to do, they first had to do a whole bunch of data aggregation, data cleaning, and so it wasn't really possible for them to do their jobs. And so they actually became the first incidental data engineers before companies started hiring for that role explicitly. And so because of the fact that companies were investing in data science and data scientists and seeing that that investment wasn't paying off, they then started to hire specifically for people to do that initial work of gathering the data, cleaning it up, and making it available to data scientists to do the work that they were supposed to be doing. And also in 2017, this was on what ended up being the tail end of the Hadoop era where Hadoop came on to the scene in the, you know, early 2000, and that was seen as, oh, this is this economical system that I can use to gather all the data that I want. And the term big data was really taking off. And so the general trend was that if we just collect all of the data about everything all the time, then eventually we'll be able to make some use of it. And so there were a lot of people who were dealing with scaling problems for those Hadoop clusters.
And in addition to the challenges of scaling the clusters, there were also a lot of complexities in dealing with the programming paradigm of MapReduce and being able to manage repeatability and figuring out how long it was going to take for given jobs to run, sequencing jobs effectively. And a lot of different tools and platforms and add on components grew up around the core of the Hadoop, framework. And so there were things like Hive, and that also led to things like Presto and HBase. And there were a whole suite of tools, Uzi, that came out in response to Hadoop and trying to add simpler interfaces on top of it, being able to use it as storage layer for SQL engines, and that went on for a number of years. And 2017, there was still some momentum in the Hadoop ecosystem, but it had started to die down a little bit in favor of some of the next generation of tooling.
It was also around when airflow started to gain popularity for as an orchestration layer for being able to make sure the tasks got sequenced properly. And so it was an interesting time to start the podcast because of that fact that there was a little bit of a transition happening. People who were heavily invested in Hadoop were still trying to make it work and add on some of the new tools around it. There was also a lot of hype about streaming and different streaming engines and the fact that streaming was going to be superior to batch for a number of reasons, particularly because of the timeliness challenges that people were experiencing with Hadoop. And streaming is definitely still a very prevalent topic now, but it was very it was a very popular aspect of conversation because it was still new, and there were still a lot of engines that were being developed and had a lot of momentum behind them. So Storm and Flink and Spark were all some of the major ones.
And this also gave rise to the different paradigms of how to address these data scaling issues. So the Lambda architecture was created as a way to try and reconcile these batch and streaming workflows where you would effectively have to write your logic twice, where you would use streaming for real time and a good enough approximation of what reality looked like. And then you would have your Hadoop batch jobs that would come in afterwards and catch up to a certain point in time with a more accurate view. The CAP architecture was a little bit of a response of just stream everything so you don't need this batch layer.
And those are conversations engine as your only source of truth and still be able to use that same logic to engine as your only source of truth and still be able to use that same logic to replay all of history. And there's also been an interesting shift in that concept of just capture all of the data all of the time because people realized, 1, that storing all of that data is expensive, and you don't necessarily capture all of the value that you put into storing it and processing it. But also the regulatory environment has changed where there's a lot of increased risk for storing all of the data that you might have, such as personally identifiable information or, with GDPR, the risk of having to delete data when a customer requests it and just being able to understand where is that data, what data do I have. And so companies have gotten a lot more judicious about what data they capture and making sure that it is going to have some value rather than just capturing it for the sake of capturing it. And that also brings in the era of data catalogs where data cataloging had existed, but with the big data mentality of just throw everything into the data lake, you didn't always know what you had. Or if you did have it, you didn't know if it was useful or how it was being used.
And there were tools such as Alation and some of the other commercial data catalogs that were there, although they were largely manual where people would enter in the different data that they had, what the schema was supposed to be. There wasn't necessarily validation of that. And then Amundsen was 1 of the first tools that gained a lot of popularity for some of the kind of automated cataloging, being able to integrate with things like Airflow or your other, databases and orchestration engines to understand what data you have, how it's being used. And so the visibility of the data also made it easier to gain value from it, and so you didn't necessarily have to capture everything and then spend a lot of time exploring it to see what you had.
And the data catalog conversation over the past few years has really evolved into data discovery and the metadata layer, and I'll touch on that a little bit more in a little while. And orchestration engines also have been gaining a lot of momentum as being a topic that is core to the overall data platform where you have to have some orchestration engine as the means of understanding what gets executed when and how rather than having somebody manually run a bash script or having a cron job set up. And so and and the orchestration engines have also gone through some generational shifts where it was initially just task based and then coming to the realization that the orchestration engine should understand what the actual data is that it's processing.
Because even if a task says that it's completed, it's possible that it could have had a silent failure. Or even if it does complete, maybe there's something wrong with the data, but it does knows that this finished. I got a successful exit code, so move on to the next thing. And so some of the next generation of orchestration engines have decided that being aware of what the data is and why and how it relates to subsequent downstream uses is a necessary fundamental abstraction to be able to actually build up scalable and successful data platforms.
The real catalyst in the past probably 5 years of data has really been the rise of cloud data warehouses, where Redshift was definitely the first notable 1 that came onto the scene and really made people start thinking differently about what data warehousing means, how it scales, the cost benefit analysis of it, where it used to be, a very expensive appliance that you would have to have in your data center, and now it's something that you can rent and, actually, it can be fairly economical. Obviously, there are different challenges of managing cost with a pay as you go model. But shortly after that, there were also BigQuery and Snowflake. And so Redshift, BigQuery, Snowflake have really been the major motivators for the modern era of data engineering. There's been a lot of hype about the modern data stack. And regardless, this concept of a cloud data warehouse in whatever form it takes has really become the focal point of how companies work with their data.
So there have been whole businesses that have come about to build on top of the data warehouse for anomaly detection or data quality analysis, other topics I'll touch on shortly. But the data warehouse has really become the juncture of everything. And by putting everything into 1 location and having schema visibility of that data is, I think, the main thing that has really enabled the current, approach to data engineering and some of the ways that we are able to continue to evolve because there is a common understanding of how to work with that data, and you don't necessarily have to be a distributed systems engineer to be able to get anything out of it. The other thing that these cloud data warehouses have really done because of the fact that they're scalable, because of the fact that the storage is more economical than it used to be, is shift the conversation from extract, transform, and load where you have to make sure that the data is in a specific shape before you even store it in your data warehouse because doing it afterwards is too expensive or because, storing all of it is too expensive and really moved us into extract load transform. Obviously, there are different orderings of the transform and load step, but it really has allowed for bringing in all of the data and then transforming it and iterating on it so that you can be more I'll use the term agile about it. Agile in the software sense of being able to build on top of successive iterations, being able to, deliver value quickly without having to do a huge amount of upfront work before you can actually get anything done.
And another tool that has really been transformational, no pun intended, although, I guess, it should have been intended, in that space is DBT, which also gave rise to the concept of an analytics engineer of people who are doing the analysis being involved in the repeatability and robustness of the data that they're working with and bringing them into those software principles. And so with DBT and the cloud data warehouses, that really catalyzed us into where we are now with our capabilities, as well as allowing more businesses to be able to actually get in on the game of using data to power their companies and improve their customers' experiences.
Data lakes have still been a conversation. Hadoop was probably the first major iteration of that. Larger organizations have maintained data lakes because of the scalability aspects and the fact that they're very flexible in terms of what you can do with them, but they've always posed a problem. And 1 of the interesting paradigms in the past year or 2 that has really taken off is the idea of a data lake house where you can have the benefits and scalability of the data lake, but you can also have the organizational and user interface improvements of data warehouses and be able to get the best of both worlds together.
And so you can still use that same approach of ELT with DBT and a SQL interface for being able to work with the data. And so that you can bring more people into the experience and into the work versus with the data lake where it used to be that you had to write some very complicated code to be able to process the data and load it into anything, and then you probably had to put it into a data warehouse anyway to be able to then query it. You can do all that in 1 place. So because of the fact that working with the data has really had a much lower barrier to entry now, that allowed people who are working in the space to focus on some of the higher order concerns where that brought in some of the concepts concepts of data ops and ML ops of being able to make sure that everything is repeatable and stable and robust and being able to know when things fail.
So data quality and data observability are some of the core aspects of that. Being able to monitor the entire data platform, both in terms of making sure that your data is actually getting loaded when it's supposed to, but also making sure that as it's getting loaded, you're checking, is this data conforming to these specific requirements? Is the schema the same, or did a new column get added or dropped? Is the distribution of data within the range that I expect? You know, if I'm dealing with financial transactions, do I always have decimal numbers, or do I somehow randomly have a float in there? Because that's definitely not going to work very well.
Or if I have a value that is only supposed to go up, why did it suddenly go down? Or if it's an enumeration, you know, I should only have 5 possible values for the string field, but now all of a sudden I have 7. What went wrong here? So those are some of the other things that have really come about in recent years because of the fact that there is this core shared abstraction that everybody can build on top of. There are a lot more agreed upon interfaces that people can collaborate on so that they can build higher order tooling and higher order experiences, and everybody can benefit as a result. And then circling back on the metadata concept, this aspect of data observability, data discovery, they they all tie in together of being able to understand what is the shape of the data that I have, where is it coming from, where is it going to, how is it getting there, what is being done to it, Who is actually using the data? That's 1 of the really key pieces of understanding all of this work that I'm doing. Is it even worth the time that I'm putting into it? Because if I'm spending hours or weeks building all of this tooling to be able to get the data out of this database and out of this SaaS platform and load it into this report for somebody.
If nobody's looking at it, then why do I continue to maintain that? So that's another key aspect of being able to close the loop on I've built this thing. Is it getting is it delivering the value that it's supposed to deliver? And that brings it back to that concept of the agile methodology of making sure that the work that you're doing is being done for a purpose and that that purpose is being fulfilled because otherwise you're just spinning your wheels. And why are you even doing any of it? And so the fact that metadata as a broad concept is starting to coalesce into tooling that is able to encapsulate some of those different concerns of monitoring for quality. What is the distribution of these fields? What is the schema that I'm working in? As well as lineage of where is the data coming from and going to being able to understand how it's being used. That really empowers much more useful and much more valuable engineering around how the information is being applied. And then 1 of the most recent trends in being able to actually have all of this metadata coalesced into a single layer is the idea of active metadata, where you can actually use that metadata to inform some of the automation routines. So maybe I know that this job happens at 5 AM every day, and this report gets looked at at 9 AM every day. So I'm going to make sure that I automatically scale up my Snowflake cluster to ensure that this report will complete in the time that it's allowed so that when somebody's looking at the report, they have the freshest data possible.
Or I can use this metadata to understand, okay, this job has, you know, 5 downstream dependencies, and this 1 step of it just failed. So I'm going to make sure that those other 5 things don't execute because otherwise that bad data will propagate. And I'm going to raise an alert to let somebody know that this is going on. And the the conversation around reports brings me around to business intelligence, which has also gone through a lot of shifts where, in 2017, business intelligence had already gone through many generations, but it was still very much a build a report, build some visualizations, hand it off to a business user, and let them make their own assumptions about it.
And then once they do see the report, then, okay, well, what's the next action? And business intelligence was still the place where a lot of the, semantic aspects of the data was built up. So being able to say, okay. In this organization, I'm going to decide that based on these different attributes of a of an event or of a product or of a user, that this is what means a conversion from a potential customer to an actual customer. Because that can be a very complicated question to answer where maybe if you're a large business, you have different types of customers or different people within the business have their own concept of when somebody becomes a customer. And so the business intelligence layer might even have 5 different definitions of customer, but you don't even know that there are those 5 definitions because they were all built by different people. And recent iterations of business intelligence have really focused on this aspect of the semantic modeling and being able to have that be a shared reference so that there aren't these complications of disagreeing reports where you're all using the same data. You're looking at the same data, but you're looking at it in different ways. And so it creates skew in terms of the perspective.
And then that also brings in the idea of the semantic layer where maybe that needs to be pulled out into its own component and not live in the business intelligence tool. And the business intelligence tool needs to just reference that other system to understand, okay, what are these domain objects? There's also been a rise in the idea of embedded analytics or customer analytics where for a long time business intelligence was very internally facing where maybe a handful of people would look at the reports that you're building because they were core to the business and how the business was interacting with that information.
But there are also a lot of useful insights that you can surface to your customers from the data that you're collecting from their interactions and from other customers. Recommendation systems have always been an aspect of that, but there are a number of different ways that you can surface some of the, maybe, users buying patterns. Or if you're a financial institution, you can use some of the aggregate information about your customers to help give end users some perspective on their spending or, their savings. And because of the fact that we have these more scalable systems that are easier to operate largely because they're built as a service, it enables us to actually build those analytical reports and expose them to a wider variety of people so that data is not just an internally focused thing, but it can also be externally focused and provide value to a broader audience.
Circling back on the concept of ETL and ELT, there's also been a rise in trying to complete the cycle of data where it has largely been a very 1 directional flow where you pull data out of a source system, you aggregate it, you analyze it, you put it into your business intelligence system, and then it just stops there. Except that it doesn't stop there because somebody is going to take an action based on that report, but there's not any concrete way for you to see what that next action is. And so by building a pathway for data back out of your data warehouse or your data lake into the systems that it was extracted from. For instance, being able to update your HubSpot or Salesforce records from the information that you gathered from your application about customers buying patterns without it having to be a manual process.
And so that creates a data cycle instead of just a line of data. And that makes it possible to continue to iterate on and improve the overall value of that data as you enhance it. And maybe if you are loading data from your application into your warehouse, you're enriching it from data that you're capturing from your CRM or from your internal business systems and then propagating it back out into the user experience, they have the opportunity to help you correct that data even if it's maybe updating their profile information or updating, some of the aspects of their customer experience.
And obviously, in all of this, there are myriad topics that dig very deep into some of the other aspects of the specific frameworks or the specific tooling or the specific applications of the data. But as I was sitting and thinking and reflecting back on the 6 years of doing this show, those are some of the main things that really stuck out that have been really indicative of their specific eras. And looking forward, there are a lot of new and interesting and potential ways to apply data, work with data. Machine learning, I think, is today where data engineering was 6 years ago when I started the show, where the data engineers have built up these robust data pipelines and made the data reliable and trustworthy to the point that it's easier to work with it has completed the cycle where the data scientists came in wanting to do all these interesting things with the data.
Now they're able to because the past 6 years of data engineering have really leveled up their capabilities. In parallel with that, machine learning techniques have gotten much more sophisticated. There's been a lot of tooling built up around that to improve the user experience and making it easier for people to apply machine learning even if they're not a an expert in the underlying theories and formulas around it. And so machine learning is starting along that new transformational path where, we've gone past just can we do machine learning to now let's operationalize it. And a lot of the investment that went into data engineering is starting to pay dividends in that machine learning ecosystem, which is a big reason why I started the machine learning podcast as a companion to this show to help explore some of that new and transformative capabilities and try to understand and evolve with the ecosystem as that grows.
And it's interesting for me because when I started this show 6 years ago about data engineering, I definitely had experience working with data and operationalizing data, but I was very naive as to the potential scope of it and all of the different ways that it's being applied. And so 6 years of running this show have been very informative and have helped me gain a lot of expertise and understanding on the industry. And I feel that I'm in the same space with machine learning that I was with data engineering, where I understand some of the principles of machine learning. I can grasp the foundational concepts and understand what people are talking about. But I I'm just at the very beginning of my journey on understanding how machine learning really works fundamentally, some of the ways that ecosystem is evolving.
And so I'm definitely excited to explore that. And there's also an interesting element too where because machine learning has become more sophisticated and more accessible, it is also being applied to data engineering problems. I've had a few episodes on this this show talking about some of that. So Anomilo is a company that is entirely focused on using machine learning to alert on data anomalies and data quality issues. There are also aspects of using machine learning to do entity extraction, to feedback into data engineering, or being able to feed it into your data warehouse.
So a lot of interesting back and forth and interplay between data engineering and machine learning. So excited to explore that cross section. And in terms of the lessons that I've learned while running this show that have been really interesting or unexpected or challenging, Well, the challenging part is just keeping up with it all and keeping a consistent schedule of running the podcast and understanding deeply enough what is being done so that I can ask useful questions, but also understanding from the audience perspective what is valuable. That I think has probably been the hardest part is really getting a good cross section and perspective on how the audience is engaging with the podcast and understanding what I'm doing right. What can what can I improve?
What are the topics that are really meaningful to people right now? I I use my own interests as a gauge for a lot of that, but I'm always interested to hear people's feedback on what are the main things that you want to know about, Who are the people that I should be talking to? How can I make the show even better for you? And so going forward into the future of the podcast, obviously, gonna keep doing a lot of what I'm already doing, but looking to bring more engagement with the audience and with the community. And so as part of that, I'm working through setting up some possible membership options, so stay tuned there. I'll probably send the first announcements of that to my mailing list. So if you're not already on it, you can sign up on the website at data engineering podcast.com.
So I'm hoping to have something ready to go in the next week or 2. And, yeah, just really appreciate everybody who has helped make this show a success, both the guests and especially the audience. Because without that validation of people listening to it and giving me some feedback, I've had a lot of people email me saying that they actually got into data engineering as a result of listening to this show. So I just really appreciate all of the, value that I've been able to create and the fact that people are truly engaged with this show. And so now as my final closing question to myself, I've answered this 1 a few times, some of them fairly recently about what is the biggest gap in the tooling or technology for data management today. I think the biggest gap is really just understanding what is even out there, and I think there's a lot of useful information about how to solve the macro issues. But I think that as you start to really dig into a particular problem or really start to try and integrate across these different systems, there are a lot of little sharp edges that crop up. And so just improve smoothing that user experience, providing more information about what are some of those roadblocks, what are some of those sharp edges. And so circling back on the question of the future of the podcast, if folks have experiences of trying to, you know, use their Airbyte or Fivetran data streams and get it into Snowflake and then query it to build their business intelligence report. And there were some weird edge cases or problems they had to figure out, or people who are having challenges figuring out how to apply, data validations or implement data quality. Just any of those experiences, I'd be really happy to discuss that on the show and dig really deep into some of those edge cases and some of the interesting problems that everybody has. And so it's really valuable to be able to get that firsthand perspective of, okay, I got all of these tools because they're supposed to make my life easier, but there's this 1 thing that I really had had a hard time with or I had to engineer around or I had to build my own thing to be able to do it the way I wanted to. So definitely appreciate everybody's time and interest and energy in helping to grow this show. Definitely looking forward to continuing that and taking it further and trying to, build up some membership around the show so that I can be more engaged with the audience.
So thank you again. Hope you enjoyed it, and I hope you enjoy the rest of your day.
[00:31:40] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot init, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and just tell your friends and coworkers.
Reflecting on 6 Years of the Data Engineering Podcast
The Rise of the Data Engineer
The Hadoop Era and Its Challenges
The Catalyst: Cloud Data Warehouses
Data Lakes and the Emergence of Data Lakehouses
Evolution of Business Intelligence
The Data Cycle: From ETL to ELT and Back
Reflections on the Podcast's Journey and Future Directions