Summary
The dream of every engineer is to automate all of their tasks. For data engineers, this is a monumental undertaking. Orchestration engines are one step in that direction, but they are not a complete solution. In this episode Sean Knapp shares his views on what constitutes proper automation and the work that he and his team at Ascend are doing to help make it a reality.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
- The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.
- Your host is Tobias Macey and today I’m interviewing Sean Knapp about the role of data automation in building maintainable systems
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what you mean by the term "data automation" and the assumptions that it includes?
- One of the perennial challenges of automation is that there are always steps that are resistant to being performed without human involvement. What are some of the tasks that you have found to be common problems in that sense?
- What are the different concerns that need to be included in a stack that supports fully automated data workflows?
- There was recently an interesting article suggesting that the "left-to-right" approach to data workflows is backwards. In your experience, what would be required to allow for triggering data processes based on the needs of the data consumers? (e.g. "make sure that this BI dashboard is up to date every 6 hours")
- What are the tasks that are most complex to build automation for?
- What are some companies or tools/platforms that you consider to be exemplars of "data automation done right"?
- What are the common themes/patterns that they build from?
- How have you approached the need for data automation in the implementation of the Ascend product?
- How have the requirements for data automation changed as data plays a more prominent role in a growing number of businesses?
- What are the foundational elements that are unchanging?
- What are the most interesting, innovative, or unexpected ways that you have seen data automation implemented?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on data automation at Ascend?
- What are some of the ways that data automation can go wrong?
- What are you keeping an eye on across the data ecosystem?
Contact Info
- @seanknapp on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Ascend
- Google Sawzall
- CI/CD
- Airflow
- Kubernetes
- Ascend FlexCode
- MongoDB
- SHA == Secure Hash Algorithm
- dbt
- Materialized View
- Great Expectations
- Monte Carlo
- OpenLineage
- Open Metadata
- Egeria
- OOM == Out Of Memory Manager
- Five Whys
- Data Mesh
- Data Fabric
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Bigeye: ![Bigeye](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/qaHgbHoq.png) Bigeye is an industry-leading data observability platform that gives data engineering and science teams the tools they need to ensure their data is always fresh, accurate and reliable. Companies like Instacart, Clubhouse, and Udacity use Bigeye’s automated data quality monitoring, ML-powered anomaly detection, and granular root cause analysis to proactively detect and resolve issues before they impact the business. Go to [dataengineeringpodcast.com/bigeye](https://www.dataengineeringpodcast.com/bigeye) today and start trusting your data.
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Atlin is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlin's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to data engineering podcast.com/atlin today, that's a t l a n, to learn more about how Atlin's active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork, and Unilever achieve extraordinary things with metadata.
When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Sean Knapp to talk about the role of data automation in building maintainable systems. So, Sean, can you start by introducing yourself?
[00:01:37] Unknown:
Yeah. Absolutely. Thanks for having me. Sean Knapp. I'm the founder and CEO at Ascend. Io. And despite the CEO title, have had a long career of doing data engineering for the the last
[00:01:51] Unknown:
18 plus years now in software engineering and data engineering. So really excited to to chat about some of that. And for folks who haven't listened to your prior appearance on the show, I'll add a link. But just a quick recap, do you remember how you first got started working in data?
[00:02:05] Unknown:
Yeah. I do. Way, way, way back when. Given, you know, for those of us who are now thinking lots of gray in their their beards and hair. I actually started as a software engineer at Google back in 2004. And my remit was working on, front end for web search. Really exciting time to be there. You start pushing around pixels, experimenting with different user experiences. And we did this a ton for many years. And the thing was, when you push pixels and you wanna know whether or not they did something for user engagement, you end up writing a lot of data pipelines.
And so 1 of the first languages I learned inside of Google was our internal language, SalsAll, that allowed me to write data pipelines on MapReduce to analyze Google session logs and figure out the efficacy of our various, UI experiments. And so I accidentally started doing data engineering all the way back in 2004 as a kid fresh out of college.
[00:03:02] Unknown:
So the focus for this conversation is around this idea of data automation. And I'm wondering if you can give your definition of what that and some of the assumptions that are embedded in that phrase.
[00:03:15] Unknown:
You know, I think there's a lot to data automation. And it can encapsulate you know, for some folks, it it starts very simply. And for some folks, I think it it it really does expand out to a much broader space. Oftentimes, people do start with that baseline notion of automation being just simple orchestration, for example. But I think the industry is is really starting to evolve into a broader expectation and understanding of automation similar to what we're seeing in other fields that really does gravitate towards solving for more and more of the things that we either have to manually do or have to write code to do, and automation increasingly doing those at increasingly high levels of sophistication.
Oftentimes, I try and shorthand this as automation usually equals some combination of orchestration plus metadata plus AI to do far more advanced things for us.
[00:04:14] Unknown:
I like that you called out specifically the assumption that automation just means writing enough of the logic that the thing that I want it to do gets done without me having to push a button. And I'm wondering how what are some of the ways that automation in maybe traditional software development and deployment workflows ceases to be sufficient as you get into this space of automating for data and the types of logic that you're not able to write preemptively and anticipate, and that pushes you into needing to actually rely more on that metadata plus AI to be able to actually achieve a similar outcome?
[00:04:53] Unknown:
Yeah. Absolutely. I think in engineering, we've increasingly become more and more sophisticated, some with automated tools. For example, on the the DevOps domain, as we look at things like CICD, we've continually benefited from more and more levels of automation and tooling to accomplish tasks. I think when we get into more advanced levels of automation, we tend to actually see a couple of parallel tracks. You know, 1 of the these parallel tracks that we see is this notion of imperative versus declarative systems.
Airflow, for example, is an imperative orchestration system. There's declarative models out there for orchestration as well. Terraform at the infrastructure is a declarative for how we define infrastructure. So we see the those models. But when we connect those into automation and continuously running systems, I would say the sort of, you know, reigning, like, champion example of high end automation in the field of engineering these days really is Kubernetes. And it's because it's been able to marry not just this declarative model, but actually morph that into a continuously running control plane that is always on, always running.
And it is that sort of beautiful balance of declarative, metadata backed, and continuously running that really takes us to what are are the highest levels we've seen in engineering for, high end automation that solves profound levels of pain and agony and takes that away from engineers and alleviates that pain.
[00:06:32] Unknown:
Implicit in the framing that I just gave about not being able to preempt the types of workflows that you need to be able to code for, is the question of automation in most experiences has a certain stopping point where you can automate for a set of known knowns and maybe even known unknowns. But there always ends up being some edge case where you have to get a human involved and they have to click the buttons or, you know, write custom code or do some operation that only only the human is able to do because of the fact that it is a new space that needs to be understood and solved for, and there's not necessarily enough trust or enough capabilities or understanding in in the computed system to be able to actually solve for that problem. And I'm wondering what you see as some of the tasks, particularly in the data ecosystem, that are common problems that are resistant to automation?
[00:07:32] Unknown:
Yeah. Great question. First, it's funny. I'll tell you a couple of fun anecdotes. You know, we hear this a lot from folks, which is, hey. Automation, it may make 95% of my job easier, like, profoundly easier. But if you make that last 5% impossible, it's still a nonstarter. And so you do need highly automated systems to still afford users that ability to do things, the more advanced things, the more custom things. So I think that's really important. 1 of our old solution architects actually had the saying of, you can't give people a Tesla and have it be fully self driving and not give them a steering wheel because at some point, they will still wanna take over or that you do still have to earn trust, as you mentioned. And so you do need the ability to build that trust and that comfort and give people, the escape hatches or the controls back when they need it.
And so the category of things that we've generally seen people really need, those escape hatches, is usually around things that require more imperative logic or higher, more customized needs than what the system currently can can grok and understand. And that's where this forever keep up game, some new capabilities, some new understanding that that folks have it. You're trying to make sure the system keeps up as much as it can with that. And the approach that we've taken to this inside of Ascend has been what we actually call a flex code model, which is a fancy term as we think about low code and no code systems that allows you to flex deeper into the stack and write plugins and modules and adapters that extend the capabilities that allow you to implement imperative logic to still support declarative constructs.
Not too dissimilar, again, from what we see the ability to write your own operators. Inside of Kubernetes, for example, similar kind of notion of, hey. When somebody wants to go more advanced, let's actually allow them to extend the platform itself and add to its capabilities without throwing the baby out with the bathwater and having to make them flip all the way back to, you you know, more primitive versions of imperative based models or systems that don't have the benefit of a continuously running control plane attached to it. Another aspect
[00:09:49] Unknown:
of automation resistance that comes up particularly in the data space is the fact that there is a boundary to the kind of the platform that you're using to manage the automated flows. And as you move outside of the boundaries of that platform and what it owns, it becomes increasingly harder to automate different pieces of interaction. So, for instance, in the Ascend use case, you're able to ingest data from a source system, but at some point in that source system, you have to be able to have some logic to be able to generate that data or be able to create the credentials that Ascend is able to use to reach into that system.
In the destination systems, you can maybe push data into it, but then you have to either have some way of reaching into that system to then trigger additional flows, so like landing data in Snowflake and then triggering a dbt run. But if you're pushing that data into some black box that maybe has a defined interface for inserting the data but no additional controls to be able to own any downstream workflows, then you can't automate that without having some way to maybe, you know, build a different system that lives in that black box space and hopefully try to be able to, you know, take some bailing wire and twine and duct tape to hook them together. And I'm wondering, going back to your analogy of Kubernetes of, you know, it has these core abstractions that allow you to write these customized plugins to build upon them and be able to use the existing APIs to add your own specific use cases.
Bringing that all together, I'm wondering what you see as some of the kind of patterns in the data ecosystem that lead to some of these challenges of, you know, black boxes or lack of access to be able to hook into some of these processes to to to fully automate the end to end flows, and also some of the standards, either existing or evolving, that allow for this interoperability to let something like an Ascend or, you know, orchestrator of choice or other automation platform reach into those other systems to be able to, you know, serve as the puppet master regardless of what the boundaries of that core platform happen to be?
[00:12:06] Unknown:
Fantastic question. Like, super, super cool. And even before you used the term, which you used 3 times, I love that term. Well, I think it's a bad thing, but I love the use of the term in this context, which is the black box. And I think it's so on point that oftentimes the highly automated systems that we see out there in the wild are also very much closed systems and black boxes, and it's hard to tap into them, which I think is a travesty that Kubernetes has certainly set the, I think, the gold standard for by exposing so much controls and extensibility and and of the data because, ultimately, any automated system is built on an abundance of metadata and has incredibly large volumes of the data that are of extreme value to that automated system, but also of extreme value to everybody else who may be interested in using that system. For example, in the data engineering context, a highly automated system, for example, in a sense world, the metadata we collect all the profile of every partition of data that moves through, whether it's semantically partitioned or just in free floating fragment. We track what code operated on it, what code generated it. We track the code the shaas of the code. We track the input partitions of that data. We track who accesses that data.
And we use all of this, and we traverse up and down DAGs to figure out what the system should do and what processing should happen. So you have a huge amount of metadata, and we're really big believers of opening up access to that metadata to other systems. Now, you know, popping off the stack a little bit further into the question set here, a couple of the the abstraction letters that we really believe in that that you're talking to of how do we make it easy enough to integrate into other systems, you know, there's a couple of abstraction layers that are really important.
First 1 we look at is the abstraction layer and the plug in architecture for where you run. And from a raw infrastructure perspective, is it Amazon, Google, or Azure, how you can run on top of that environment? That's a pretty easy 1. Of course, we're big fans of of K8s, so we run everything on Kubernetes. That's pretty easy. But when you pop up another part in the architecture there, it is also what is your what we call a data plane. Where does your data sit what is your primary processing engine or better yet even a set of primary processing engines? We've created a very clean abstraction around this where whether you want to run on Snowflake or Databricks or BigQuery and a couple of others coming soon too, you have the ability to very easily specify how you want to interact with that environment in that ecosystem.
And so a very limited number of sort of what we'll call the core fundamental calls you have to be able to implement. So that's making it very pluggable into a data plan architecture. The other part that we've done is also created abstraction layers and a plug in for how you connect into read and write systems connectors. Whether it's Salesforce, a different data plane, MongoDB, you name it. Those also follow really elegant abstraction and design patterns that by creating a clean architecture on this, the implementation of new connectors, very, very simple and easy to do. And so we also create these architectures so you can plug into everything.
The next abstraction layer that we also created inside of Ascend is the ability to control your entire graphs in your data flows and actually even download them as executable Python that you can then reapply back up to the APIs. As it was 1 of the first times we've seen this in the industry, where not only can you download definitions as JSON objects or YAML objects like you can in in Kubernetes, But you can download an actual executable Python set of files that are the definition of your data flow that you can check into Git. You can actually programmatically extend and modify, and they wrap the SDK to go back and and recreate these dataflows. And so it is a bidirectional code level of sync into the system as well. We put in all these various abstraction layers and integrations to make sure that it doesn't really matter if it's at the connector level, at the data plane level, at the data access level. You have access to everything inside of the system itself.
[00:16:39] Unknown:
1 of the things that came to mind as you were describing the kind of extensibility of the Ascend platform is, in my initial framing of the question, I was focused on moving from the kind of automation engine out into the peripheral systems. But there's also the question of if the peripheral system is designed to be opaque to, you know, an automation engine, usually it's because they want to own some of those different workflows, so they might have some capabilities of being able to call back into, you know, the automation engine or or whatever other systems. And so I'm also interested in exploring that question of making that core automation engine accessible to those external systems so that you can maybe flip the direction of calling or triggering so that you don't get hamstrung by saying, okay, I can automate up to this point, but then it's off in this other system, and now I have to do something totally different.
But being able to have a system that's extensible where you say, okay, I can automate up to this point. Now I have to hand it off to that system, but I'm going to provide a way for that system to be able to control the flows into itself so that it is a more seamless transition so that maybe the boundary exists because you're handing off from data engineers to, you know, machine learning engineers or data engineers to business operations. And so maybe the business operations people have their black box system because that's what they're comfortable with. That's what they want to work with, but they need to be able to feed the data in. And by having hooks back into the automation engine, they have a way to say, I want to trigger a refresh of the data that I'm working with here, but the rest of that flow can be owned by data engineering. And so it prevents having the kind of capabilities locked in 1 place and allows for a more kind of natural flow and kind of seamless transition between these different system boundaries.
[00:18:33] Unknown:
Yeah. I totally agree with that. And I think a lot of this, at least in our world and what we've seen with our customers, boils down to, again, a lot of API interaction and connectivity where it usually boils down to both sides of the DAG, if you will. You know, 1 is, can you trigger and force behavior for somebody who's upstream from you and actually have them trigger manually data refresh. In a declarative model, there's less of a run this pipeline and much more of a, hey. You have established data flow, but now check for new data and run whatever has to be run. And so it's a, hey. Go check for new data and refresh. Go do your magic.
And then on the other side of the pipeline, on the right hand side is, hey. As there's residual effect or you want to trigger residual effects or trigger another pipeline in another system or another event, there's also that same notification system and how do you go trigger some downstream behavior. And we absolutely see that. I think it's not unique to Ascend or to our customers, but it's pretty dominant within the space of I need to talk to another system. You know, I have airflow orchestrating something upstream or I need to trigger something in airflow downstream. And I think that's the really important part is having that extensibility on either side. Another interesting
[00:19:54] Unknown:
aspect of, you know, sort of what triggers what is I recently read an article by Ben Stansil talking about the idea that the existing model of ETL pipelines is backwards, where we're very focused on these push based flows where something in the source system changes. So now I need to push that into the pipeline, the from the transformations, it propagates down to the downstream system. And so you say, okay. I know that there is this rate of change in the source system, so I will run this process on x frequency, and that update might end up impacting 5 completely different business units just because of the way that the data propagates.
And so as somebody who is a consumer of that data asset, you say, I just care if my dashboard is up to date because I need to know the answer to my question based on whatever the latest information is. So now I need to figure out who actually owns that upstream flow to tell them I need you to kick this off. And now that's also gonna have, you know, ripple effect to all the other consumers of of these different data sources. So suggesting that the more ideal flow is I, as a data consumer, know that I need this data updated at this frequency. So I'm just gonna tell the system, I want this at this time, and you go ahead and figure out what all needs to happen. Then, you know, bringing in that question of the ripple effect too, as a data consumer, I say, I just care if my dashboard's up to date. I don't understand all the ramifications of kicking off whatever jobs are gonna fill that in.
You know, how do you then maybe do some kind of tree pruning of that DAG to say, okay. This user cares about their data updates, but that other user might not wanna have something new come in because they're not ready for it. How do you account for those kind of differing needs of all the different stakeholders and the fact that in a lot of cases, they're not even gonna know about each other.
[00:21:42] Unknown:
Absolutely. Completely aligned. Like a 1000% with the notion that pipeline shouldn't be designed and run left to right. They should be run right to left. And I think this is really heavily oriented in that difference of declarative versus imperative. And the history of software engineering and and technology in general has has demonstrated time and time again the shift from imperative to declarative systems over time. That move towards systems where we define the outcome and lean on the underlying technology to determine how to deliver that outcome.
Has also demonstrated time and time again that that is results in less code, more stable systems, happier engineers working with those systems into the right to left model, which is oriented towards the the end result and the outcome, tends to fit very nicely. It also then does lean pretty heavily on that notion of automation because you have to then figure out how do I ensure that I can deliver on that end result to the user or to the business. And it generally does require more complex traversal back through to all the upstream systems. This is 1 of those things that I think does become really exciting when we think about the amount of metadata required, not just at a business level, but then at the code level and at an operations level to determine, you know, frankly, how pipelines can work that way. Just to geek out for a quick second on, for example, like, how we have solved this internally on our side is to go declarative and eye on automation for pipelines.
We evaluate pipelines, and we actually do run them right to left. And the way that we do this is we actually run checksums on all code and all data as it moves through pipelines. We traverse it. We track the lineage of partitions. We run it through code, and and we actually do what's really a recursive SHA on all code up to the originating data, traversing partitions, and on the assumption that all operations are item potent and you have immutable fragments. As you traverse data all the way through a DAG, if you store all of the sha's of all the work you've done before and you're reevaluating DAGs continuously as an automated system does, you're essentially looking for SHA mismatches.
And so how the traversal, even in Ascend as an automation system for DAGS, works is we actually start at the far right, and we look at the end destinations and recurse back to the left, traversing and doing essentially SAW checks to determine we have the right data. Once you get into a model and an architecture like that, things are really powerful and really easy to solve for different challenges, things like broken pipelines with automated resume, automated backfills, mid pipeline branching off of existing logic and components. All of those things are solved. Even the classic late arriving data, new log file shows up and 2% of it actually happened yesterday.
All of those, when you're evaluating pipelines right to left, becomes really easy and a natural extension of how the system works. So very much a fan of this methodology and belief structure.
[00:25:08] Unknown:
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state of the art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up for free or just get the free t shirt for being a listener of the data engineering podcast at dataengineeringpodcast.com/rudder. To your point of automated backfills, that's something that I wanna dig into a little bit, but also the question of, you know, as an end consumer, I wanna refresh my data at this frequency, understanding what the, you know, other downstream impacts are of, you know, some of the changes that happened 3 stages before the piece I'm looking at. In both of those cases, I'm interested in some of the design questions about how you structure the kind of granularity and the logical elements of how you build each of these discrete stages of the data flow so that you can do things like automate backfills and understand that the logic that you're writing is aware of the time component of it. You know, I'm not just running a bare, you know, select into without having that where clause that gives me the time window or, you know, as I'm building this, you know, maybe dbt model, I'm not building it in such a way that it is encapsulating too many concerns that all of the downstream pieces are going to be impacted whenever I run this 1 model. So just figuring out what are some of those useful system design questions of how to approach that kind of granularity aspect and the understanding of what are the kind of higher order concerns that I need to be aware of as I'm structuring this logic to be able to account for, for instance, that timeliness aspect?
[00:27:07] Unknown:
I'd say as a general rule, the easiest limitation to first put on on yourself when designing these systems, which I think it's good to basically rule out, like, a couple of pieces. And then all of a sudden, your frame of reference will actually start to morph quite nicely. First 1 that I usually suggest is remove the availability of wall time. Like, assume that there's literally just nothing in your system that will give you a current time stamp. And when you start to do that, it starts to then get you into the next step, which is write your logic on the assumption of just the data exists and break out of the pipeline construct entirely.
What we found inside of Ascend to do this pretty nicely is to construct the world into reconnectors, which are really entire datasets that are reflective of that source, and then transforms, which are really materialized views in many ways, followed by right connectors that will replicate the entire dataset out somewhere else if you need to replicate it outside of your existing data plane, Snowflake Databricks, etcetera. And the reason why we constructed it this way is once you remove, you know, current time stamp or, you know, current wall time, and you start to restructure that frame of mind as assume everything is a SQL query or a Python query on the data, but think of it as a query.
You think of the war in a little bit more of a static context. New data may be coming in, but the query itself can still run and give you an updated result. You then actually free up the underlying automation system to start to do smarts. Okay. Well, let me figure out how to not actually reprocess all of the data for all time because that would be very slow and also tends to coincide with very expensive. So then you start to actually get people out of that imperative construct. And that's when you can do the first cool pieces of getting into a more declarative, more automated model. From there, the system can start to do smarter things around looking at the code, taking different enrichments or hints from the the developer itself to figure out, is this the map style operation? Is it a reduction? Both of those actually are are fairly straightforward to automate and optimize.
Or is it, for example, partial reduction? And what are the inputs into the partial reduction? That's usually where it gets more interesting to figure out what you have to wait for upstream, what you can pull through more quickly, and what the dependency is on from upstream partitions into downstream partitions. But again, if we go back into that, just assume the whole world is a dataset and the datasets are just constantly changing, and you have eventual consistency as as data moves through and the system can automate that with priorities and optimizing as it moves through, and you end up with just new datasets that are your new data products that other teams can build on. It actually pulls you also from an organizational perspective out of the pipeline construct and into the data product construct, which is similarly you were talking about before too, where we see really cool benefits because you're now providing products to others in your organization.
And the code should actually fade more into the background, and the data product should move into the foreground.
[00:30:24] Unknown:
Looking back at the last time that we talked about what you're building at Ascend was about 3 years ago, not quite. It was November of 2019 that the episode was published. And despite the fact that it's a seemingly short amount of time in dataland, that's a lifetime ago, I'm curious what you have seen as the broader kind of understanding and adoption of these automation principles and the evolution from these very imperative workflows of things like MapReduce to where we are now, where everything is SQL, and just some of the ways that that has maybe made your job easier, both in terms of development and integration of your product, but also in terms of the framing of the ways that people are thinking about problems?
[00:31:07] Unknown:
I think we're all, as an industry, we're all kind of frogging our way through it right now, to be really honest. And I think we're in the pretty early stages. And the reason why I would describe it that way is when we spoke last, a lot of the focus was the shift from imperative to declarative and how do we remove so much of the code that we're writing and how do we automate more. And I think, you know, probably, gosh, a year ago, 2 years ago, a lot of the focus then started to shift the early focus, especially among early adopters shifting into even more low code, no code kind of systems. And and a lot of the the focus that we had was around flex code and this notion of being able to flex across.
And I think both of these are early inputs into really now what is going to become automation. And the reason why I would highlight this, you know, I I think the battles between imperative and and declarative are hitting even more mainstream. But I can tell you, at least based off of my completely unbiased set of observations and conversations with people, I usually use conferences as just a measure of random sampling of conversations. A lot more of the focus and interest has shifted away from imperative systems to declarative. I think we're hitting far more mainstream on that. I also think we're hitting far more mainstream on this notion of it isn't just no code. It isn't just low code. I I need the ability to flex whether or not flex code actually takes hold as a term or not, and we'll see. But the core premise of it, I I think, holds very much. You know? At the end of the day, most of us, at least most of us listening to this, are probably engineers.
And we don't wanna get pinned down, and we don't wanna get trapped into not being able to accomplish the thing we want to go accomplish. And so having that freedom for something that's flexible, I think is very compelling. The part about data automation is I think we are very much in the early days, like, the very, very early days. And the reason why I believe that is we actually ran a survey little while back. It was back in April. And it's a completely independent, blind, third party survey that we run. And we intentionally do it because, obviously, we don't want to bias anybody with our customer base, for example.
We survey a little over 500 folks, data scientists, data engineers, analysts, architects, and and chief data officers. And, you know, usually when you run these surveys, you have all your standard, like, hey. We have this question. How loaded is your team? Do you have capacity? Etcetera. Right? No surprise to anybody here, I'm sure. 95% of data teams are at or over capacity, don't really have the bandwidth for new. And that was even before all the headcount freezes, etcetera, had all kicked in. But the thing that was really interesting is we added 1 new question this year. And this is literally a question that you toss in for 1 year, at least when I asked that to add it in, just to get a baseline. All you wanna do is, let's just collect some numbers so at least we can plot a trend next year when we ask it again. And the question was, are you investing in automation to increase team capacity and throughput?
In short, I'm paraphrasing, but basically that. And what was shocking was only 3 a half percent of respondents said that they already did invest in automation, which to me tells me something very interesting. Right? Which would say, automation in the minds nowadays of people certainly does not equate to to orchestration, for example. Otherwise, it would be, like, 95%. And so there is something in people's minds that is there's automation. There is something there, and it is new. And we we may not even know what it is yet, but it sounds exciting and we want it. And the reason why I say we want it is while only 3 a half percent of those individuals said of the respondents said they had it, 88.5% said they intend on investing in automation in the next 12 months. What was so shocking for me as somebody who literally just tossed this question in on the host of, ah, maybe we'll get a baseline number for future years, I've never seen such a stark contrast in the same question of data and responses, like, of people who believe they have it today and people who want to have it in a year from now was remarkable.
And so I think that's why the I believe data automation is going to surge in demand as a way of solving what is an increasingly complex and painful ecosystem. And I predict next year, when we run the same independent survey, more than 3 a half percent will have it. Significantly less than 88 a half percent will also have it, but we will continue to see pervasive desire for higher and higher levels of automation and increasing levels of sophistication.
[00:35:54] Unknown:
On that point of kind of the availability of automation, obviously, what you're building at Ascend is focused on that capability. What do you see as either other tools and products in the ecosystem that offer what you consider to be data automation? And also what are some of the conversations and areas of investment in terms of standards and foundational tooling that are being built out that will help people, you know, build on top of these ideas and realize their own capabilities for automation?
[00:36:28] Unknown:
I think there's a lot of interesting things happening right now. I will say because we're in the very early days, I don't think the industry has very sophisticated automation. I will be really honest. I think it's going to get automation. I will be really honest. I think it's going to get more confusing for folks in the next couple of years as automation will surge in popularity, and there will be a lot of marketing spend around the notion of automation. And I think a lot around what will be low code tools that orchestrate.
I think that's gonna make it very confusing for folks. And for me as a software engineer that still spends every free hour I have cutting code if I can, I understand that the general level of cynicism in the space as people try and weave through the marketing messages into really what is somebody actually doing and how sophisticated are these technologies? I think we're early, and I think there's gonna be a lot of noise in this space. The technologies that I think are going to be increasingly interesting, I think we're seeing really cool technologies come out of some of the underlying data plans.
Like, some of the stuff that's coming out of Snowflake, they announced at their conference earlier on this year around the underlying data plane and how quickly it can move data through and how it can help optimize cascading transforms. Super, super interesting. I also think that as we're seeing these underlying beta planes expose more metadata, that gives us a lot of really interesting things to do with higher levels of automation. And then I would say I'm also seeing some really cool stuff come out in the space of data observability from folks like Great Expectations, Monte Carlo.
It's actually pulling more metadata in and also generating more metadata that, again, fuels the automation engines. I think we're still a few years away from there being, like, pervasive and ubiquitous adoption of technologies that are at the level of sophistication of what Kubernetes is. Clearly, Ascend is you know, we're taking a a really big run at solving that level of complexity. I think we're still just very early.
[00:38:40] Unknown:
On the metadata front, that's a conversation that has been gaining a lot of volume and frequency, and there have been, you know, notable investments in things like Open Lineage, Open Metadata, Egeria. And I'm wondering what your opinion is of the potential for those kinds of efforts to actually take hold and be adopted, or if it's something where there are enough entrenched interests who have already built their own metadata layers that there will be enough resistance that those efforts will continue to be niche and not as widely adopted as they need to be for them to have meaningful impact on the industry?
[00:39:25] Unknown:
Good question. I think over time, all things eventually normalize. Usually requires basically it no longer being fun or cool to go solve for. And that usually happens over the course, you know, a small number of years where n is, you know, less than 5. The I think, ultimately, we're all solving for the same problems, which is basically ton of the stuff that's going on inside of the system, who's doing what, what co's doing what, where is it going. And over time, innovating there becomes very not differentiating. And 1 of the core values that we have culturally inside of Ascend is this notion of evolve with intent.
Because ultimately, at the end of the day, innovation is very expensive. It's very expensive from a time perspective. It's very expensive consequently from a money perspective. But as a result, we should be very cautious and intentional around where we choose to innovate. And I generally encourage most companies to do that uniformly across their technology stack too. And so over time, as things stabilize, how you collect metadata is going to become less differentiated for your business, and you should choose not to innovate there and instead adopt whatever standards are out there and put your innovation horsepower into the things that layer on top of that. And so over time, I, as an engineer, believe there should be standardization.
If for no other reason that then that's 1 less thing most of our team has to worry about, and we can get on to the other cool new impactful stuff.
[00:41:09] Unknown:
On that note of evolving with intention, as I said, the last time we talked was about 3 years ago. I know that Spark was a very core component in your infrastructure and the capabilities that you're offering. I'm wondering if you can summarize some of the notable evolutions that you have gone through over that 3 year period.
[00:41:27] Unknown:
Oh my gosh. So much changes in 3 years. The start up years are like dog years. So a lot happens. I'm not sure I had as much gray in my beard 3 years ago. That's definitely a start. I would say, you know, a few things that we've done over the last few years. You know, 1 was we rolled out our entire new modern data ingest and data delivery capabilities on top of this flex code foundation. It just greatly expanded the number of systems we could connect to. And I think that was really important to move beyond just data lakes and spark architectures. Then dovetailed into this new wave of not just data connectors for where you read and write, but also where you store and process data.
And so we've now expanded that connectors into not just Spark, but a bunch of the new capabilities coming out of Databricks with the Databricks SQL, a bunch on Snowflake. So we now run on Snowflake supporting their native SQL as well as Snowpark and Python, which is really important for us. I should also add, if I remember the stats correctly, it's I believe 65% of data transformation logic in Ascend is in SQL, 32 in Python, and 3 in Scala slash Java. And I think that's it's not quite indicative of the broader market. Scala would be higher, I think, in the in the broader market, but I think it's very indicative of where folks are going. And so that ability to support multiple languages interchangeably, very, very powerful, but being able to support that on Snowflake. And then also, we run on BigQuery as well. And so we've added all of these various capabilities because we see both rapid innovation happening at that data plane level, be because the market is so large at the data plane level. We also see rapid consolidation from a feature setting capability perspective.
From a a automation platform that sits on top of these technologies, like, this is the most exciting dive. We're watching just all these incredible new features and capabilities come out and then very quickly see similar manifestations of those capabilities in other data planes. And that's a really exciting time to make sure that we can then automate and leverage all those underlying cool new new features. Over 3 years, there's been a lot this really happened.
[00:43:42] Unknown:
Your comment about this is the most exciting time reminds me of a bumper sticker I saw the other day that says, it's never been later than it is right now, something to that effect. And on that point, going back to your comment about the survey that you sent out about the people's self assessment of whether or not they have achieved data automation and whether they want to invest in it. I'm curious what you have seen as organizations do start to adopt data automation and build on top of it, how their understanding of what data automation is and what it can do shifts and how their requirements evolve from, I just need to be able to get thing from point a to point b without having to get woken up at 3 in the morning to, oh, that part was easy. Now I can do x, y, and z and just how that data automation continues to be a moving target, and what are the pieces that remain the same in that equation?
[00:44:37] Unknown:
We definitely see people at various points in their journey. Let's describe it that way. The to go all fledged automation is usually a larger CTO, CDO, enterprise architect level effort of, hey. We're gonna do a big step back. Let's think about the what are the core drivers? And it's usually around team productivity, team velocity, team happiness. It's the we oftentimes see this happening because ultimately at the end of the day and I I hope that for most both of your listeners are are forward leaning in the sense of the we wanna drive tons of innovation, and we want to make sure that with these really high caliber, high horsepower teams we're assembling to drive our data future, which is content for most businesses, your future.
We're amplifying their impact and giving them the greatest amount of leverage. And because of that, what we oftentimes see is these new initiatives coming out, which is, look, these are incredibly high impact individuals. We don't have enough of them. We can't find enough of them. But so much of our future innovation leans on data. How do we actually get more out of folks? And and it takes that big step back to say, we need to fundamentally change our approach. And the reason why I think that's interesting is because oftentimes, we're just building out a new data pipeline for a new feature, a new ML model. It's hard to see the forest at the trees. Right? It's like, look, man. Like, all I know is this thing's oomphin' on me for some dumbass reason, and I just needed to not do that. Preferably not at 3 AM when I I get woken up. Right?
And sometimes it takes that step back to say, like, well, what are the actual core problems, and where are we going to? And and I think this is where there's that shifting awareness from, hey. Automation really isn't just run this thing on a schedule. Like, take my code and run it on a schedule or a trigger for me, but it is a bigger step back and awareness now of there really is something more impactful that we can lean on. So we like engaging with people throughout that journey, and we've had the benefit of really getting to see people at different stages. I'll plant a seed for later too, which is the challenge oftentimes I like to give folks is, what is your biggest cost when it comes to data? Because for the last 5, 10 years, a lot of companies have heavily focused around the cost of your data infrastructure or your storage or your processing, and a lot of decisions were made based off of that.
But more often than not, the biggest cost is actually your team. They are your most valuable resources, and their productivity, in theory, should supersede even how much you're spending on infrastructure. And what we're finding is a shift now towards how do we actually maximize the impact of our team versus just, say, how do we just tune a pipeline or optimize a pipeline? But instead, how do we get more impact from our people and enable them to have greater leverage? And I think that's a really good thing, especially as we head into hard market conditions too.
Like, your teams are your most expensive resource. They're also your most valuable resource. How do we now help them get more and have them do more differentiated work than ever before?
[00:47:57] Unknown:
Bigeye is an industry leading data observability platform that gives data engineering and data science teams the tools they need to ensure their data is always fresh, accurate and reliable. Companies like Instacart, Clubhouse and Udacity use BigEyes automated data quality monitoring, ML powered anomaly detection, and granular root cause analysis to proactively detect and resolve issues before they impact the business. Go to dataengineeringpodcast.com/bigeye today to learn more and keep an eye on your data. In your experience of working with companies as they go through this journey of understanding and adopting automation, what are some of the most interesting or innovative or unexpected ways that you have seen them approach that question of data automation, whether jerry rigged systems or how they understand the capabilities of automation and what it enables them to do or just some of those different aspects of this broader question?
[00:48:55] Unknown:
I've seen a few different approaches. I have seen and I think the perfect is is generally always in somewhere in the in the shades of gray and in that middle ground. You know, I've seen a lot of teams basically, everybody starts with some imperative system, and and whether that's literally cron Python scripts or an airflow DAG or or something along those lines. And I've seen, you know, 1 rough end of the spectrum. I've seen folks do the, hey. We're just going to create an abstraction layer on top, and most folks end up doing that, and for obvious reasons. But I I think the putting Band Aids bolts on top of traditional or or legacy imperative based models is generally hard. And I've seen these models. For example, all we need is the marketing team to deliver us ajar, and our system will run it for them every day. They sort of missed the piece of, like, I don't think your marketing team is going to deliver a jar to you to run. This is why we saw the pendulum swing in in other directions and move more towards low code and no code systems for people to try and self serve. But I've seen that part. The other piece I've seen where where teams have struggled is the we're just going to rebuild everything from the ground up. I actually have a article I wrote about this a while back, which is how to avoid the common pitfalls of engineering leadership.
And and it's the greenfield allure of building an entire new system from the ground up, which usually looks like a couple of year kind of effort. And I always encourage people to not do that as you're never going to get 2 years. You're always going to underestimate the level of effort. And at some point, the pressure turns up, and you gotta go deliver something, and you gotta go take some shortcuts, and you cut your corners. And then all of a sudden, this beautiful new envisioned gen 2 system that you thought you were going to have is a slightly better version of the Frankenstein because it you had to get it out, and you had to keep moving forward. And so I also generally encourage folks not to try and do a whole massive replatform rebuild, like, all at once because I think that's really dangerous and really, really risky.
I usually encourage folks, and what we've seen be really successful, is develop a new pattern and don't try and build platform up, but instead go solve a specific use case. Go solve a, hey. We have a new pipeline that we need to go build for x, or, hey. We have an existing pipeline that is literally waking Sean up at 3 AM every other day. Like, let's just fix that thing first, and let's get our heads just a little bit further above water. And then if the patterns that we develop here, if the system that we put together works well there, great. Let's move more things on. But let's incrementally and iteratively solve for that, And I've seen that be successful. It's actually also the type that ascend to core values. We have this notion of build for 10 x, but plan for a 100 x. As engineers, we always love to build straight for the 100 x or a 1000 x. And we work really hard on the, hey, have an architecture and design and a plan for how you'll get there so you know you don't paint yourself into a corner, but solve an immediate problem today that creates value that earns us the headroom and the space and the breathing room to just then go solve for the next incremental 1. And so finding that balance, we do really encourage for for other teams, and we've seen that be really successful.
[00:52:21] Unknown:
On that question of not doing a huge replatforming effort, 1 of the notable aspects of what folks are terming the modern data stack is the question of what are the concrete interfaces between these different stages so that we can take it and replace it with another utility or, you know, iteratively evolve the system and the architecture. And in this question of data automation, what are some of those useful seams or interfaces where you can say, I'm going to define a new component and implement that piece so I can plug it into this, you know, this space that is shaped like that, and then I can do that with this other piece and just some of those useful scenes for how to think about iteratively building out that automation capacity?
[00:53:07] Unknown:
The integration point is where we see a lot of value. 1 is processing and storage systems. What is the interface for the definition of a job? What goes into a job? Is it just raw code, or are there insights into what the code is doing, the relation between the inputs and the outputs that really matter? Related to that, the definition of a system and a connector itself. Where do you read data from? Where do you write it to? How do you push down data transformation logic? Really important. As we start to abstract away what that registry looks like for all metadata, for jobs, for users, for access, for partitions of data. Many of these, I think, have been solved for in cataloging and in other domains, but moving more away from a passive model into a really aggressively active model of metadata collection, I think, is important.
And then I think as you start to work up from there, it's tied more towards things on the observability side. What do I expect of my data? What are the assumptions that I baked into my dataset? And what do I expect happens if those assumptions are violated? This is where we see, from a very pragmatic perspective, a lot of users really looking to to have well defined factors.
[00:54:24] Unknown:
In your experience of building Ascend and focusing your time and energy in this space of data automation, what are some of the most interesting challenging lessons that you've learned?
[00:54:35] Unknown:
I think the the most interesting 1 that I've seen is because there's such a big gap between the technologies that achieve what we want to go achieve, the hows of what we do, and the end impact is a very lossy chain of communication as I watch how teams plan what they do. And as a result, we tend to see fairly systematically a lot of folks work from an architecture perspective. And this is, I think, why both data engineering and even data infrastructure tend to really be waterfall style development cycles, even if we kind of sugarcoat it in some, you know, more agile methodologies and just how we hold our meetings. They tend to be much more waterfall from an execution model.
It's such a long chain of requirements of what we're trying to achieve that you usually see things built in layers, and you go a really long window of time until things see the light of day. And I think that's been the the teams that are the most successful that we've observed, both in software engineering and in data engineering, are the ones that are able to sprint faster from raw capabilities and in business impact and then morph over time what that architecture looks like. And I think the you know, there's this really cool saying a friend said a while back, which was, the definition of great code isn't its performance or its readability, but its ability to adapt to change is the most important. And that may encompass the other pieces, and there may be inputs into it. The reason why that matters so much, the mutability, the adaptability of code, is we're in such early stages from a data ecosystem perspective. Things are changing so fast.
Big, hard, like, fast infrastructures. Like, if we prioritize that versus mutability of design, by the time something we envision sees the light of day, like, the rest of the ecosystem will have already changed. And so I think this is where we're starting to see more and more teams really prioritize the speed of change as the measure for what a great system, what great code can do, how fast can you adapt, how safe is it to quickly respond to new requirements and new change. And I think that's gonna push a lot of really great innovation. It's going to drive more investments in automation. It's going to drive a lot more investments in data ops, just as we saw it drive great innovation in DevOps. And that speed of iteration is going to become increasingly important for for data teams.
[00:57:14] Unknown:
As people invest in these automation capabilities and build out more of their data flows to not be imperative and not require as much human time and attention? What are some of the ways that those automation efforts can go wrong?
[00:57:28] Unknown:
Definitely the, hey. We're gonna go re architect for the next year. Ain't nobody got time for that, generally speaking. I think the massive rearchitectures are just generally hard. They're generally slow, and they're fraught with risk. So I always recommend avoiding that approach. The other thing that I do think is important is as teams go down their automation path, it's important to traverse the decision tree back up to some of the higher upper root nodes in that decision tree. When you came down And if you pursue that 5 whys methodology, you're like, well, And if you pursue that 5 y's methodology, you're like, well, why do we need to do that? Well, wait. But why do we need to do that thing? If you traverse it far enough, you'll oftentimes realize, well, we did that really as a funky workaround for this other limitation that literally is no longer relevant, like eventual consistency of writes on s 3. And I'm like, oh, we actually have immutable and item codegen fragments, and s 3 now has immediate consistency. So all that other gnarly stuff that we did probably don't need to do anymore. And so it's really important to to invest that those cycles, that's where I would put time, to revisit the assumptions.
The foundational drivers behind those assumptions have fundamentally changed tend to change very quickly.
[00:58:56] Unknown:
In terms of the continued evolution of the ecosystem and the perspective on automation, what are some of the areas that you're paying particular time and focus on?
[00:59:06] Unknown:
I think there's a couple of areas that are really interesting. I think the most interesting 1 is in the nearer term is going to be in this notion of multi data plane, and how do we do really advanced automation across. So it's really been boiled down into data meshes or data fabrics, nuances as to which one's appropriate for which business and what they mean. And a lot of opinions, I think, out there in the industry as as to what they are, so I'll leave that to the experts. The thing that I think is relevant is the backbone of achieving either of those tends to be pretty heavily around the need to connect into many systems, the embracing of the fact that your data will sit in many systems. In fact, probably even be processed in many systems.
But maintaining a continuity around metadata, lineage, automation, access where relevant, incredibly important. And I think that relates to our world and automation quite a lot. I think the most successful data strategies around mesh and fabric will very much be automation driven. So we're spending a lot of time looking at that and looking at the data planes you can integrate into the clouds and that you can connect across. And we have a lot of really cool capabilities coming out in the next couple of quarters tied to that that trend that we see today.
[01:00:29] Unknown:
Are there any other aspects of this question of data automation and the work that's involved in making it a reality that we didn't discuss yet that you'd like to cover before we close out the show? It'd be a very interesting question to see
[01:00:42] Unknown:
when do we get to, you know, even 50% of companies embracing data automation. My guess is we may be a couple of years away from that, maybe even more. I think it'll be a really fun maybe we should start a pool at you know, with our data aware pull survey, what the percent of penetration of data automation will be next year. It's gonna be somewhere between that 3 a half percent. It'll be greater than that, but it's definitely not gonna be the 88 a half percent of people who want it or intend on having it by next year. But it'll be somewhere in the middle. I think that'll be a really fun thing to see what what shakes out.
[01:01:16] Unknown:
Alright. Well, for anybody who wants to follow along with the work that you're doing and get in touch, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Biggest gap today?
[01:01:33] Unknown:
I can't say automation. That's gonna be way too easy. I would say the I'll give a hard answer to this 1. The tooling and technology we see today, I think, has become very successful at hello world. I I think getting all the way into complex production with a lot of tools is a significant lift beyond that. And in our world of product led growth and SaaS business models, I think a lot of teams in technologies have really tuned that their killer at it. Having a really smooth glide path to increasingly complex, I would love to see our industry software. I think too many data engineers get a pass that's high and wide on that, and they go back up to their exec team, pitch something that can be delivered, they believe, quickly. And as you really try and push to get things into production, I think it's harder. And so I think that's the current gap.
[01:02:29] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share your thoughts on this space of data automation and the different challenges and benefits that is on offer there. Appreciate all the time and energy you're putting into solving certain portions of that with your work at Ascend, and I hope enjoy the rest of your day. Thanks, Tobias. Really appreciate the time.
[01:02:51] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast.init, which which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com. Subscribe to the show, sign up for the mailing list, and read the show notes. And And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to Data Automation with Sean Knapp
Sean Knapp's Journey in Data Engineering
Defining Data Automation
Challenges in Automating Data Workflows
Interoperability and Extensibility in Data Systems
Declarative vs Imperative Systems
Automated Backfills and Data Granularity
Evolution of Data Automation
Current State and Future of Data Automation
Approaches to Data Automation
Interfaces and Integration Points
Lessons Learned in Data Automation
Future Trends in Data Automation
Closing Thoughts and Industry Gaps