An Exploration Of What Data Automation Can Provide To Data Engineers And Ascend's Journey To Make It A Reality

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Atlin is the metadata hub for your data ecosystem.

Instead of locking your metadata into a new silo, unleash its transformative potential with Atlin's active metadata capabilities.

Push information about data freshness and quality to your business intelligence,

automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value.

Go to data engineering podcast.com/atlin

today, that's a t l a n, to learn more about how Atlin's active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork, and Unilever achieve extraordinary things with metadata.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.

With their new managed database service, you can launch a production ready MySQL,

Postgres or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs.

Go to dataengineeringpodcast.com/linode

today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.

Your host is Tobias Macy. And today, I'm interviewing Sean Knapp to talk about the role of data automation in building maintainable systems. So, Sean, can you start by introducing yourself?

Yeah. Absolutely. Thanks for having me. Sean Knapp. I'm the founder and CEO at Ascend. Io. And despite the CEO title, have had a long career

of doing data engineering for the the last

18 plus years now in software engineering and data engineering. So really excited to to chat about some of that. And for folks who haven't listened to your prior appearance on the show, I'll add a link. But just a quick recap, do you remember how you first got started working in data?

Yeah. I

do. Way, way, way back when. Given, you know, for those of us who are now thinking lots of gray in their their beards and hair. I actually started as a software engineer at Google back in 2004.

And

my remit was working on, front end for web search.

Really exciting time to be there. You start pushing around pixels, experimenting with different user experiences.

And we did this a ton for many years.

And the thing was, when you push pixels and you wanna know whether or not they did something for user engagement,

you end up writing a lot of data pipelines.

And so 1 of the first languages I learned inside of Google was our internal language, SalsAll,

that allowed me to write data pipelines on MapReduce

to analyze Google session logs and figure out the efficacy of our various,

UI experiments.

And so

I accidentally

started doing data engineering all the way back in 2004 as a kid fresh out of college.

So the focus for this conversation

is around this idea of data automation. And I'm wondering if you can give your definition of what that

and some of the assumptions that are embedded in that phrase.

You know, I think there's a lot

to data automation.

And

it can encapsulate you know, for some folks, it it starts very simply. And for some folks, I think it it it really does expand out to a much broader space.

Oftentimes, people do start with that baseline notion of automation being

just simple orchestration,

for example.

But I think the industry

is is really starting to evolve

into a broader expectation and understanding of automation

similar to what we're seeing in other fields

that really does gravitate towards

solving for more and more of the things that we either have to manually do or have to write code to do, and automation increasingly doing those at increasingly high levels of sophistication.

Oftentimes, I try and shorthand this as automation usually equals some combination of orchestration plus metadata

plus

AI to do far more advanced things for us.

I like that you called out specifically

the assumption that automation just means writing enough of the logic that the thing that I want it to do gets done without me having to push a button.

And I'm wondering

how what are some of the ways that

automation in maybe traditional software development and deployment workflows

ceases to be sufficient as you get into this space of automating for data

and the types of logic that you're not able to write preemptively and anticipate, and that pushes you into needing to actually rely more on that metadata plus AI to be able to actually achieve a similar outcome?

Yeah. Absolutely. I think in engineering, we've increasingly

become more and more sophisticated,

some with automated tools. For example, on the the DevOps domain, as we look at things like CICD,

we've continually

benefited from more and more levels of automation and tooling

to accomplish tasks.

I think when we get

into more advanced levels of automation,

we tend to actually see a couple of parallel tracks. You know, 1 of the these parallel tracks that we see is this notion of imperative versus declarative

systems.

Airflow, for example, is an imperative orchestration system. There's declarative models out there for orchestration as well. Terraform at the infrastructure is a declarative

for how we define infrastructure.

So we see the those models. But when we connect those into

automation and continuously running systems,

I would say the sort of, you know,

reigning, like, champion example of high end automation

in the field of engineering these days really is Kubernetes.

And it's because it's been able to marry

not just this declarative model,

but actually morph that into a continuously running control plane that is always on, always running.

And it is that sort of beautiful balance of

declarative,

metadata backed, and continuously running that really takes us to what are are the highest levels we've seen in engineering

for,

high end automation

that solves profound levels of pain and agony and takes that away from engineers and alleviates that pain.

Implicit in the framing that I just gave about not being able to preempt the

types of workflows that you need to be able to code for, is the question of

automation

in most experiences has a certain stopping point where you can automate for a set of known knowns and maybe even known unknowns. But there always ends up being some edge case where you have to get a human involved and they have to click the buttons or, you know, write custom code or do some operation that only only the human is able to do because of the fact that it is a new space that needs to be understood and solved for,

and there's not necessarily enough trust or enough capabilities or understanding in in the computed system to be able to actually solve for that problem. And I'm wondering

what you see as some of the tasks,

particularly in the data ecosystem,

that are common problems that are resistant to automation?

Yeah. Great question. First, it's funny. I'll tell you a couple of fun anecdotes. You know, we hear this a lot from folks, which is, hey. Automation,

it may make 95% of my job easier, like, profoundly easier.

But if you make that last 5% impossible,

it's still a nonstarter. And so you do need highly automated systems to still afford users that ability to do things, the more advanced things, the more custom things. So I think that's really important. 1 of our old solution architects actually had the saying of, you can't give people a Tesla and have it be fully self driving

and not give them a steering wheel because at some point, they will still wanna take over or that you do still have to earn trust, as you mentioned. And so you do need the ability

to build that trust and that comfort and give people, the escape hatches or the controls back when they need it.

And so

the category of things that we've generally seen people

really need,

those escape hatches, is usually around things that require more imperative logic or higher, more customized

needs than what the system currently can can grok and understand.

And that's where

this forever

keep up game, some new capabilities, some new understanding that that folks have it. You're trying to make sure the system keeps up as much as it can with that. And the approach that we've taken to this inside of Ascend has been what we actually call a flex code

model, which is a fancy term as we think about low code and no code systems

that allows you to flex deeper into the stack and write plugins and modules and adapters

that extend the capabilities that allow you to implement imperative logic to still support declarative constructs.

Not too dissimilar, again, from what we see the ability to write your own operators. Inside of Kubernetes, for example,

similar kind of notion of, hey. When somebody wants to go more advanced, let's actually allow them to extend

the platform itself

and add to its capabilities

without throwing the baby out with the bathwater and having to make them flip all the way back to, you you know, more primitive versions of imperative based models or systems that don't have the benefit of a continuously running control plane attached to it. Another aspect

of automation resistance

that comes up particularly in the data space is the fact that

there is a boundary to

the kind of the platform that you're using to manage the automated flows.

And

as you

move outside of the boundaries of that platform and what it owns, it becomes increasingly harder to automate different pieces of interaction. So,

for instance, in the Ascend use case,

you're able to ingest data from a source system, but at some point in that source system, you have to be able to have some logic to be able to generate that data or be able to

create the credentials that Ascend is able to use to reach into that system.

In the destination systems, you can maybe push data into it, but then you have to either have some way of reaching into that system to then trigger additional flows, so like landing data in Snowflake and then triggering a dbt run. But if you're pushing that data into some black box that maybe has a defined interface for

inserting the data but no additional controls to be able to own any

downstream

workflows, then you can't automate that without having some way to maybe, you know, build a different system

that lives in that black box space and hopefully try to be able to, you know, take some bailing wire and twine and duct tape to hook them together. And I'm wondering,

going back to your analogy of Kubernetes of,

you know, it has these core abstractions that allow you to write these customized plugins to build upon them and be able to use the existing APIs to add your own

specific use cases.

Bringing that all together, I'm wondering what you see as

some of the

kind of patterns in the data ecosystem

that lead to some of these challenges of, you know, black boxes or lack of

access to be able to hook into some of these processes to to to fully automate the end to end flows,

and also some of the

standards, either existing or evolving, that allow for this interoperability

to let something like an Ascend or, you know, orchestrator of choice or other automation platform

reach into those other systems to be able to, you know, serve as the puppet master regardless of what the boundaries of that core platform happen to be?

Fantastic question. Like, super, super cool. And even before you used the term, which you used 3 times, I love that term. Well, I think it's a bad thing, but I love the use of the term in this context, which is the black box.

And I think it's

so on point that oftentimes

the highly automated systems that we see out there in the wild

are also very much closed systems

and black boxes, and it's hard to tap into them, which I think is a travesty

that Kubernetes has certainly set the, I think, the gold standard for by exposing

so much controls and extensibility and and of the data because,

ultimately, any automated system

is built on an abundance

of metadata

and has

incredibly large volumes of the data

that are of extreme value to that automated system, but also of extreme value to everybody else who may be interested in using that system. For example, in the data engineering

context,

a highly automated system, for example, in a sense world, the metadata we collect

all the profile of every partition of data that moves through, whether it's semantically partitioned or just in free floating fragment. We track what code operated on it, what code generated it. We track the code the shaas of the code. We track the input partitions of that data. We track who accesses that data.

And we use all of this, and we traverse up and down DAGs

to figure out what the system should do and what processing should happen. So you have a huge amount of metadata, and we're really big believers of opening up access to that metadata to other systems.

Now, you know, popping off the stack a little bit further into the question set here,

a couple of the the abstraction letters that we really believe in that that you're talking to of how do we make it easy enough to integrate into other systems,

you know, there's a couple of abstraction layers that are really important.

First 1 we look at is the abstraction layer and the plug in architecture

for where you run. And from a raw infrastructure perspective, is it Amazon, Google, or Azure,

how you can run on top of that environment? That's a pretty easy 1. Of course, we're big fans of of K8s, so we run everything on Kubernetes.

That's pretty easy. But when you pop up another part in the architecture there, it is also what is your what we call a data plane. Where does your data sit what is your primary processing engine or better yet even a set of primary processing engines?

We've created a very clean abstraction

around this where

whether you want to run on Snowflake or Databricks or BigQuery

and a couple of others coming soon too, you have the ability to very easily specify

how you want to interact with that environment in that ecosystem.

And so a very limited number of

sort of what we'll call the core fundamental calls you have to be able to implement.

So that's making it very pluggable into a data plan architecture.

The other part that we've done is also created abstraction layers and a plug in

for how you connect into read and write

systems connectors.

Whether it's Salesforce, a different data plane,

MongoDB,

you name it. Those also follow really elegant abstraction and design patterns

that by creating a clean architecture on this, the implementation of new connectors,

very, very simple and easy to do. And so we also create these architectures so you can plug into everything.

The next abstraction layer that we also created inside of Ascend

is the ability to control your entire graphs in your data flows

and actually even download

them as executable Python that you can then reapply back up to the APIs. As it was 1 of the first times we've seen this in the industry,

where

not only can you download definitions as JSON objects or YAML objects like you can in in Kubernetes,

But you can download

an actual executable Python set of files that are the definition of your data flow

that you can check into Git. You can actually programmatically

extend and modify,

and they wrap the SDK to go back and and recreate these dataflows. And so it is a bidirectional

code

level of sync into the system as well. We put in all these various abstraction

layers and integrations

to make sure that it doesn't really matter if it's at the connector level, at the data plane level, at the data access level. You have access to everything inside of the system itself.

1 of the things that came to mind as you were describing

the kind of extensibility

of

the Ascend platform is,

in my initial framing of the question, I was focused on moving from the kind of automation engine out into the peripheral systems.

But there's also the question of if the peripheral system is

designed to

be opaque to, you know, an automation engine,

usually it's because they want to own some of those different workflows, so they might have some capabilities of being able to call back into, you know, the automation engine or or whatever other systems.

And so I'm

also interested in exploring that question of

making that core automation engine

accessible to those external systems so that you can maybe flip the direction of calling or triggering

so that you don't get hamstrung by saying, okay, I can automate up to this point, but then it's off in this other system, and now I have to do something totally different.

But being able to have a system that's extensible where you say, okay, I can automate up to this point. Now I have to hand it off to that system, but I'm going to provide a way for that system to be able to control the flows into itself so that it is a

more seamless transition so that maybe the boundary exists because

you're handing off from data engineers to, you know, machine learning engineers or data engineers to business operations.

And so maybe the business operations people have their black box system because that's what they're comfortable with. That's what they want to work with, but they need to be able to feed the data in.

And by having hooks back into the automation engine, they have a way to say, I want to trigger a refresh of the data that I'm working with here, but the rest of that flow can be owned by data engineering. And so it prevents having

the kind of capabilities

locked in 1 place

and allows for a more kind of natural flow and kind of seamless transition between these different system boundaries.

Yeah. I totally agree with that. And I think a lot of this, at least in our world and what we've seen with our customers,

boils down to,

again, a lot of API

interaction and connectivity where it usually boils down to both sides

of the DAG, if you will. You know, 1 is, can you trigger and force behavior

for somebody who's upstream

from you and actually have them

trigger manually data refresh. In a declarative model, there's less of a run this pipeline and much more of a,

hey. You have established

data flow, but now check for new data and run whatever has to be run. And so it's a, hey. Go check for new data and refresh.

Go do your magic.

And then on the other side of the pipeline, on the right hand side is,

hey. As there's residual effect or you want to trigger residual effects or trigger another pipeline in another system or another event, there's also that same notification

system and how do you go trigger

some downstream behavior. And we absolutely see that. I think it's not unique

to Ascend or to our customers, but it's pretty dominant within the space of I need to talk to another system. You know, I have airflow orchestrating something upstream or I need to trigger something in airflow downstream.

And I think that's the really important part is having that extensibility on either side. Another interesting

aspect of, you know, sort of what triggers what is I recently read an article by Ben Stansil talking about the idea that

the existing model of ETL pipelines is backwards, where

we're very focused on these push based flows where something in the source system changes. So now I need to push that into the pipeline, the from the transformations, it propagates down to the downstream

system. And so you say, okay.

I know that there is this rate of change in the source system, so I will run this process on x frequency,

and

that update might end up impacting

5 completely different business units just because of the way that the data propagates.

And so as somebody who is a consumer of that data asset, you say, I just care if my dashboard is up to date because I need to know the answer to my question based on whatever the latest information

is. So now I need to figure out who actually owns that upstream flow to tell them I need you to kick this off. And now that's also gonna have, you know,

ripple effect to all the other consumers of of these different data sources.

So suggesting that

the more ideal flow is

I, as a data consumer, know that I need this data updated at this frequency. So I'm just gonna tell the system, I want this at this time, and you go ahead and figure out what all needs to happen.

Then, you know, bringing in that question of the ripple effect too,

as a data consumer, I say, I just care if my dashboard's up to date. I don't understand all the ramifications of kicking off whatever jobs are gonna fill that in.

You know, how do you then maybe do some kind of tree pruning of that DAG to say, okay. This user cares about their data updates, but that other user might not wanna have something new come in because they're not ready for it. How do you account for those kind of differing needs of all the different stakeholders and the fact that in a lot of cases, they're not even gonna know about each other.

Absolutely. Completely aligned. Like a 1000%

with the notion that pipeline shouldn't

be designed and run left to right. They should be run right to left. And I think this is

really heavily oriented in that difference of declarative versus imperative.

And the history of software engineering and and technology in general has has demonstrated

time and time again the shift from imperative to declarative systems over time.

That move towards systems where we define the outcome

and lean on the underlying technology

to determine how to deliver that outcome.

Has also demonstrated time and time again that that is results in less code, more stable systems, happier engineers working with those systems

into the

right to left model, which is oriented towards the the end result and the outcome,

tends to fit very nicely. It also then does lean pretty heavily on that notion of automation

because you have to then figure out how do I ensure that I can deliver on

that end result to the user or to the business.

And it generally does require

more complex traversal

back through to all the upstream systems.

This is 1 of those things that I think

does become really exciting when we think about the amount of metadata

required, not just at a business level, but then at the code level and at an operations level to determine, you know, frankly, how

pipelines can work that way. Just to geek out for a quick second on, for example, like, how we have solved this internally on our side

is to go declarative and eye on automation

for pipelines.

We evaluate pipelines, and we actually do run them right to left.

And the way that we do this is we actually

run checksums

on all code and all data as it moves through pipelines.

We traverse it. We track the lineage of partitions. We run it through code, and and we actually do what's really a recursive SHA

on all code up to the originating data, traversing partitions,

and on the assumption that all operations are item potent and you have immutable fragments.

As you traverse data all the way through a DAG,

if you store all of the sha's of all the work you've done before

and you're reevaluating

DAGs continuously

as an automated system does, you're essentially looking for SHA mismatches.

And so how the traversal,

even in Ascend as an automation system for DAGS, works is we actually start at the far right, and we look at the end destinations and recurse back to the left,

traversing and doing essentially SAW checks

to determine we have the right data.

Once you get into a model and an architecture

like that,

things are really powerful and really easy to solve for different challenges, things like broken pipelines with automated resume,

automated backfills,

mid pipeline branching

off of existing logic and components.

All of those things

are solved. Even the classic late arriving data, new log file shows up and 2% of it actually happened yesterday.

All of those, when you're evaluating

pipelines right to left,

becomes really easy and a natural extension of how the system works.

So

very much a fan of this methodology and belief structure.

RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control.

Their SDKs make event streaming from any app or website easy, and their state of the art reverse ETL pipelines enable you to send enriched data to any cloud tool.

Sign up for free or just get the free t shirt for being a listener of the data engineering podcast at dataengineeringpodcast.com/rudder.

To your point of automated backfills, that's something that I wanna dig into a little bit, but also the question of,

you know, as an end consumer, I wanna refresh my data at this frequency,

understanding

what the, you know, other downstream impacts are of, you know, some of the changes that happened 3 stages before the piece I'm looking at. In both of those cases,

I'm interested in some of the design questions about how you structure the

kind of granularity

and

the

logical elements of how you

build each of these discrete

stages of the data flow

so that you can do things like automate backfills and understand that the logic that you're writing is

aware of the time component of it. You know, I'm not just running a bare, you know, select into without having that where clause that gives me the time window or, you know, as I'm building this, you know, maybe dbt model,

I'm not building it in such a way that it is

encapsulating

too many concerns that all of the downstream pieces are going to be impacted whenever I run this 1 model. So just figuring out what are some of those useful

system design questions of how to approach that kind of granularity

aspect and the understanding of

what are the kind of higher order concerns that I need to be aware of as I'm structuring this logic to be able to account for, for instance, that timeliness aspect?

I'd say as a general rule, the easiest

limitation to first put on on yourself

when designing these systems, which I think it's good to basically rule out, like, a couple of pieces. And then all of a sudden, your frame of reference will actually start to morph quite nicely.

First 1 that I usually suggest is remove the

availability

of wall time. Like, assume that there's literally just nothing in your system that will give you a current time stamp.

And when you start to do that,

it starts to then get you into the next step, which is write your logic on the assumption

of just the data exists

and break out of the pipeline construct entirely.

What we found inside of Ascend to do this pretty nicely is

to construct the world into

reconnectors, which are really entire datasets that are reflective of that source,

and then transforms,

which are really

materialized views in many ways, followed by right connectors that will replicate the entire dataset out somewhere else if you need to replicate it outside of your existing data plane, Snowflake Databricks, etcetera.

And

the reason why we constructed it this way is

once you remove, you know, current time stamp or, you know, current wall time, and you

start to restructure that frame of mind as

assume everything is a SQL query or a Python query on the data, but think of it as a query.

You think of the war in a little bit more of a static context. New data may be coming in, but the query itself

can still run and give you an updated result. You then actually free up the underlying automation

system

to start to do smarts.

Okay. Well, let me figure out how to not actually reprocess all of the data for all time because that would be very slow and also tends to coincide with very expensive.

So then you start to actually get people out of that imperative construct.

And that's when you can do the first cool pieces of getting into a more declarative, more automated model.

From there,

the system can start to do smarter things around

looking at the code, taking different enrichments or hints from the the developer itself to figure out, is this the map style operation? Is it a reduction? Both of those actually are are fairly straightforward

to automate and optimize.

Or is it, for example, partial reduction?

And what are the inputs into the partial reduction? That's usually where it gets more interesting

to figure out what you have to wait for upstream, what you can pull through more quickly,

and what the dependency is on from upstream partitions into downstream partitions.

But again, if we go back into that, just assume the whole world is a dataset and the datasets are just constantly changing, and you have eventual consistency as as data moves through and the system can automate that with priorities and optimizing as it moves through,

and you end up with just new datasets

that are

your new data products that other teams can build on. It actually pulls you also from an organizational perspective out of the pipeline construct

and into the data product construct, which is similarly you were talking about before too, where we see really cool benefits because you're now providing products

to others in your organization.

And the code should actually fade more into the background,

and the data product should move into the foreground.

Looking back at the last time that we talked about what you're building at Ascend was about 3 years ago, not quite. It was November of 2019 that the episode was published.

And

despite the fact that it's a seemingly short amount of time in dataland, that's a lifetime ago,

I'm curious what you have seen as the broader

kind of understanding and adoption of these automation principles and the evolution from these very imperative workflows of things like MapReduce to where we are now, where everything is SQL, and just some of the ways that that has maybe made your job easier,

both in terms of development and integration of your product, but also in terms of the framing of the ways that people are thinking about problems?

I think we're all, as an industry,

we're all kind of frogging our way through it right now, to be really honest. And I think we're in the pretty early stages.

And the reason why I would

describe it that way is when we spoke last,

a lot of the focus was the shift from imperative to declarative

and how do we remove so much of the code that we're writing and how do we automate more.

And I think, you know, probably, gosh, a year ago, 2 years ago, a lot of the focus then started to shift the early focus, especially among early adopters

shifting into even more low code, no code kind of systems. And and a lot of the the focus that we had was around flex code and this notion of being able to flex across.

And

I think both of these are

early inputs

into

really now what is going to become automation.

And the reason why I would highlight this, you know, I I think the battles between imperative and and declarative are hitting even more mainstream.

But I can tell you, at least based off of my completely unbiased set of observations and conversations with people, I usually use conferences

as just a measure of random sampling of conversations.

A lot more of the focus and interest has shifted away from imperative systems to declarative. I think we're hitting far more mainstream on that. I also think we're hitting far more mainstream on

this notion of it isn't just no code. It isn't just low code. I I need the ability to flex

whether or not flex code actually takes hold as a term or not, and we'll see. But the core premise of it, I I think, holds very much. You know? At the end of the day, most of us, at least most of us listening to this, are probably engineers.

And we don't wanna get pinned down, and we don't wanna get trapped into not being able to accomplish the thing we want to go accomplish. And so having that freedom for something that's flexible, I think is very compelling.

The part about data automation

is I think we are very much in the early days,

like, the very, very early days. And the reason why

I believe that is

we actually ran a survey

little while back. It was back in April.

And it's a completely independent, blind, third party survey that we run. And we intentionally do it because, obviously, we don't want

to bias anybody with our customer base, for example.

We survey a little over 500 folks, data scientists, data engineers, analysts, architects, and and chief data officers.

And, you know, usually when you run these surveys, you have all your standard, like, hey. We have this question. How loaded is your team? Do you have capacity? Etcetera. Right? No surprise to anybody here, I'm sure. 95% of data teams are at or over capacity, don't really have the bandwidth for new. And that was even before all the headcount freezes, etcetera, had all kicked in. But the thing that was really interesting is we added 1 new question this year. And this is literally a question that you toss in for 1 year, at least when I asked that to add it in, just to get a baseline.

All you wanna do is,

let's just collect some numbers so at least we can plot a trend next year when we ask it again. And the question was,

are you investing in automation

to increase team capacity and throughput?

In short, I'm paraphrasing, but basically that.

And

what was shocking was only 3 a half percent of respondents said that they already did invest in automation, which to me tells me something very interesting. Right? Which would say, automation in the minds nowadays of people

certainly does not equate to to orchestration, for example. Otherwise, it would be, like, 95%.

And so there is something in people's minds

that is there's

automation. There is something there, and it is new. And we we may not even know what it is yet, but it sounds exciting and we want it. And the reason why I say we want it is

while only 3 a half percent

of those

individuals said of the respondents said they had it,

88.5%

said they intend on investing in automation in the next 12 months. What was so shocking for me as somebody who literally just tossed this question in on the host of, ah, maybe we'll get a baseline number for future years, I've never seen such a stark

contrast in the same question of data and responses,

like, of people who believe they have it today and people who want to have it in a year from now

was remarkable.

And so I think that's why the I believe data automation

is going to surge

in demand

as a way of solving what is an increasingly

complex and painful ecosystem.

And I predict next year, when we run the same independent survey,

more than 3 a half percent will have it. Significantly less than 88 a half percent will also have it, but we will continue to see

pervasive

desire

for higher and higher levels of automation

and increasing levels of sophistication.

On that point of

kind of the availability

of automation, obviously, what you're building at Ascend is focused on that capability.

What do you see as either other tools and products in the ecosystem

that offer what you consider to be data automation?

And also what are some of the

conversations

and areas of investment in terms of standards and foundational tooling that are being built out that will help people,

you know, build on top of these ideas and realize their own capabilities for automation?

I think there's a lot of interesting things happening

right now. I will say

because we're in the very early days, I don't think the industry has very sophisticated automation. I will be really honest. I think it's going to get automation.

I will be really honest. I think it's going to get more confusing

for folks in the next couple of years

as

automation will surge in popularity,

and there will be a lot of marketing spend

around the notion of automation.

And I think a lot around what will be

low code tools

that orchestrate.

I think that's gonna make it very confusing for folks. And for me as a software engineer that still spends

every free hour I have cutting code if I can, I understand that the general level of cynicism

in the space as people try and weave through the marketing messages into

really what is somebody actually doing and how sophisticated are these technologies?

I think we're early, and I think there's gonna be a lot of noise

in this space.

The technologies that I think are going to be increasingly interesting,

I think we're seeing really cool technologies come out of some of the underlying data plans.

Like, some of the stuff that's coming out of Snowflake, they announced at their conference earlier on this year around the underlying data plane and how quickly it can move data through and how it can help optimize cascading transforms.

Super, super interesting.

I also think that

as we're seeing these underlying

beta planes expose more metadata,

that gives us a lot of really interesting

things to do

with higher levels of automation.

And then I would say I'm also seeing some really cool stuff come out in the space of data observability

from folks like Great Expectations, Monte Carlo.

It's actually pulling

more metadata in and also

generating more metadata that, again, fuels the automation engines.

I think we're still a few years away from there being,

like, pervasive

and ubiquitous adoption of technologies

that are at the level of sophistication of what Kubernetes is. Clearly, Ascend is you know, we're taking a a really big run at solving that level of complexity.

I think we're still just very early.

On the metadata front, that's a

conversation that has been gaining a lot of

volume

and frequency,

and there have been, you know, notable investments in things like Open Lineage, Open Metadata,

Egeria.

And I'm wondering

what your opinion is of the

potential for those kinds of efforts to actually take hold and be adopted, or if it's something where

there are enough entrenched interests who have already built their own metadata layers that there will be enough resistance that those efforts will continue to be niche and not as widely adopted as they need to be for them to have meaningful impact on the industry?

Good question. I think over time,

all things eventually

normalize.

Usually requires

basically

it no longer being fun or cool to go solve for. And that usually happens over the course, you know, a small number of years where n is, you know, less than 5.

The I think,

ultimately,

we're all solving for the same problems,

which is basically

ton of the stuff that's going on inside of the system, who's doing what, what co's doing what, where is it going.

And

over time,

innovating there becomes very not differentiating.

And

1 of the core values that we have culturally inside of Ascend is this notion of evolve with intent.

Because ultimately, at the end of the day, innovation is very expensive.

It's very expensive from a time perspective. It's very expensive consequently from a money perspective.

But as a result, we should be very cautious and intentional

around where we choose to innovate. And I generally encourage most companies to do that uniformly across their technology stack too. And so over time, as things

stabilize,

how you collect metadata

is going to become less

differentiated

for your business, and you should

choose not to innovate there

and instead adopt whatever standards are out there and put your innovation horsepower

into the things that layer on top of that. And so over time,

I, as an engineer, believe there should be standardization.

If for no other reason that then that's 1 less thing most of our team has to worry about, and we can get on to the other cool new impactful stuff.

On that note of evolving with intention,

as I said, the last time we talked was about 3 years ago. I know that Spark was a very core component in your infrastructure and the capabilities that you're offering. I'm wondering if you can

summarize some of the notable evolutions that you have gone through over that 3 year period.

Oh my gosh. So much changes in 3 years. The start up years are like dog

years. So a lot happens. I'm not sure I had as much gray in my beard 3 years ago. That's definitely a start.

I would say, you know, a few things that we've done over the last few years. You know, 1 was we rolled out

our entire new modern data ingest and data delivery capabilities on top of this flex code foundation.

It just greatly expanded the number of systems we could connect to. And I think that was really important to move beyond just data lakes and spark architectures.

Then dovetailed into

this new wave of not just data connectors for where you read and write, but also where you store and process data.

And so we've now expanded that connectors into

not just Spark, but a bunch of the new capabilities coming out of Databricks

with the Databricks SQL, a bunch on Snowflake. So we now run on Snowflake supporting their native SQL as well as Snowpark and Python,

which is really important for us. I should also add, if I remember the stats correctly, it's I believe

65%

of data transformation logic in Ascend is in SQL, 32 in Python,

and 3 in Scala slash Java.

And I think that's

it's not quite indicative of the broader market. Scala would be higher, I think, in the in the broader market, but I think it's very indicative of where folks are going. And so that ability to support multiple languages interchangeably,

very, very powerful,

but being able to support that on Snowflake. And then also, we run on BigQuery as well. And so we've added all of these various capabilities

because we see both rapid innovation

happening at that data plane level,

be because the market is so large

at the data plane level. We also see rapid consolidation

from a feature setting capability perspective.

From a a automation platform that sits on top of these technologies, like, this is the most exciting dive. We're watching just all these incredible new features and capabilities

come out and

then very quickly

see similar manifestations of those capabilities in other data planes. And that's a really exciting time to make sure that we can then automate and leverage all those underlying cool new new features.

Over 3 years, there's been a lot this really happened.

Your comment about this is the most exciting time reminds me of a bumper sticker I saw the other day that says, it's never been later than it is right now,

something to that effect.

And on that point,

going back to your comment about the survey that you sent out about the people's

self assessment of whether or not they have achieved data automation and whether they want to invest in it.

I'm curious what you have seen

as organizations

do start to adopt data automation

and build on top of it,

how their

understanding of what data automation is and what it can do shifts

and how their requirements evolve from, I just need to be able to get thing from point a to point b without having to get woken up at 3 in the morning to, oh, that part was easy. Now I can do x, y, and z and just how that data automation continues to be a moving target,

and what are the pieces that remain the same in that equation?

We definitely see people at various points in their journey. Let's describe it that way. The

to go

all fledged automation

is usually

a larger

CTO,

CDO,

enterprise architect

level effort of, hey. We're gonna do a big step back.

Let's think about the what are the core drivers? And it's usually around

team productivity,

team velocity, team happiness.

It's the

we oftentimes see this happening

because

ultimately at the end of the day and I I hope that for most both of your listeners are are forward leaning in the sense of the we wanna drive tons of innovation,

and we want to make sure that with these really high caliber, high horsepower teams we're assembling to drive our data future, which is content for most businesses,

your future.

We're amplifying

their impact and giving them the greatest amount of leverage.

And

because of that, what we oftentimes see is these new initiatives coming out, which is, look, these are incredibly high impact individuals.

We don't have enough of them. We can't find enough of them.

But so much of our future innovation

leans on data. How do we actually get

more out of folks? And and it takes that big step back to say,

we need to fundamentally change our approach.

And the reason why I think that's interesting is because oftentimes,

we're just building out a new data pipeline for a new feature, a new ML model. It's hard to see the forest at the trees. Right? It's like, look, man. Like,

all I know is this thing's oomphin' on me for some dumbass reason, and I just needed to not do that. Preferably not at 3 AM when I I get woken up. Right?

And sometimes it takes that step back to say, like, well, what are the actual core problems,

and where are we going to? And and I think this is where there's

that shifting awareness from,

hey. Automation really isn't just run this thing on a schedule. Like, take my code and run it on a schedule or a trigger for me, but it is a bigger step back and awareness now of there really is something

more impactful

that we can lean on. So we like engaging with people

throughout that journey, and we've had the benefit of really getting to see people at different stages.

I'll plant a seed for later too, which is the challenge oftentimes I like to give folks is,

what is your biggest cost

when it comes to data? Because for the last 5, 10 years,

a lot of companies have heavily focused around the cost of your data infrastructure or your storage or your processing,

and

a lot of decisions were made based off of that.

But more often than not, the biggest cost is actually your team. They are your most valuable resources, and their productivity,

in theory,

should supersede even

how much you're spending on infrastructure.

And what we're finding is a shift now towards how do we actually maximize the impact of our team versus just, say, how do we just tune a pipeline or optimize a pipeline? But instead, how do we get more impact from our people

and enable them to have greater leverage? And I think that's a really good thing,

especially as we head into hard market conditions too.

Like, your teams are your most expensive resource. They're also your most valuable resource. How do we now help them get more

and have them do

more differentiated work than ever before?

Bigeye is an industry leading data observability platform that gives data engineering and data science teams the tools they need to ensure their data is always fresh, accurate and reliable.

Companies like Instacart, Clubhouse and Udacity use BigEyes automated data quality monitoring,

ML powered anomaly detection, and granular root cause analysis to proactively detect and resolve issues before they impact the business.

Go to dataengineeringpodcast.com/bigeye

today to learn more and keep an eye on your data.

In your experience of working with companies as they go through this journey of

understanding and adopting automation, what are some of the most interesting or innovative or unexpected ways that you have seen them approach that question of data automation, whether

jerry rigged systems or how they understand the capabilities of automation and what it enables them to do or just some of those different aspects of this broader question?

I've seen a few different approaches.

I have seen

and I think the perfect is is generally always in somewhere in the in the shades of gray and in that middle ground. You know, I've seen a lot of teams

basically, everybody starts with some imperative system, and and whether that's literally cron

Python scripts or an airflow DAG or or something along those lines.

And I've seen,

you know, 1 rough end of the spectrum. I've seen folks do the, hey. We're just going to create an abstraction layer on top,

and

most folks end up doing that, and for obvious reasons. But I I think the

putting Band Aids bolts on top of traditional or or legacy imperative based models is generally hard. And

I've seen these models. For example, all

we need is the marketing team to deliver us ajar, and our system will run it for them every day. They sort of missed the piece of, like, I don't think your marketing team is going to deliver a jar to you to run. This is why we saw the pendulum swing in in other directions and move more towards low code and no code systems for people to try and self serve. But I've seen that part. The other piece I've seen where where teams have struggled is the we're just going to rebuild everything from the ground up. I actually have a article I wrote about this a while back, which is

how to avoid the common pitfalls of engineering

leadership.

And and it's the

greenfield

allure

of building an entire new system from the ground up, which usually looks like a couple of year kind of effort. And I always encourage people to not do that

as you're never going to get 2 years.

You're always going to underestimate the level of effort.

And at some point, the pressure turns up, and you gotta go deliver something, and you gotta go take some shortcuts, and you cut your corners.

And then all of a sudden, this beautiful new envisioned

gen 2 system that you thought you were going to have is a slightly better version of the Frankenstein because it you had to get it out, and you had to keep moving forward. And so I also generally

encourage folks not to try and do a whole massive replatform rebuild,

like, all at once because I think that's really dangerous and really, really risky.

I usually encourage folks, and what we've seen be really successful, is

develop a new pattern and don't try and build platform up, but instead go solve a specific use case.

Go solve a, hey. We have a new pipeline that we need to go build for x, or, hey. We have an existing pipeline that is literally waking Sean up at 3 AM every other day. Like,

let's just fix that thing first, and let's get our heads just a little bit further above water.

And then

if the patterns that we develop here, if the system that we put together works well there,

great. Let's move more things on. But let's incrementally and iteratively

solve for that, And I've seen that be successful. It's actually also the type that ascend to core values. We have this notion of build for 10 x, but plan for a 100 x. As engineers, we always love to build straight for the 100 x or a 1000 x. And we work really hard on the, hey, have an architecture and design and a plan

for how you'll get there so you know you don't paint yourself into a corner,

but solve an immediate problem today that creates value that earns us

the headroom and the space and the breathing room to just then go solve for the next incremental 1. And so finding that balance,

we do really encourage for for other teams, and we've seen that be really successful.

On that question of

not doing a huge replatforming

effort,

1 of the

notable aspects of what folks are terming the modern data stack is the question of what are the

concrete interfaces

between these different stages so that we can take it and replace it with another utility or, you know, iteratively evolve the system and the architecture.

And in this question of data automation,

what are some of those useful seams or interfaces where you can say, I'm going to define a new component and implement that piece so I can plug it into this, you know, this space that is shaped like that, and then I can do that with this other piece and just some of those

useful scenes for how to think about iteratively building out that automation capacity?

The integration point is where we see a lot of value.

1 is processing and storage systems. What is the interface for the definition of a job? What goes into a job? Is it just raw code, or are there insights into what the code is doing, the relation between the inputs and the outputs

that really matter? Related to that, the definition of a system and a connector itself. Where do you read data from? Where do you write it to? How do you push down data transformation logic?

Really important.

As we start to abstract away

what that registry looks like for all metadata,

for jobs, for users, for access, for partitions of data. Many of these, I think, have been solved for in cataloging and in other domains, but moving more away from a passive model into a really aggressively active model of metadata collection, I think, is important.

And then I think as you start to work up from there,

it's tied more towards

things on the observability

side. What do I expect of my data? What are the assumptions that I baked into my dataset? And what do I expect happens if those assumptions are violated? This is where we see,

from a very pragmatic perspective, a lot of users really looking to

to have well defined

factors.

In your experience of building Ascend

and focusing your time and energy in this space of data automation, what are some of the most interesting

challenging lessons that you've learned?

I think the

the most interesting

1 that I've seen

is

because there's such a big

gap

between

the technologies that achieve what we want to go achieve, the hows of what we do, and the end impact

is a very lossy chain of communication as I watch how teams plan what they do. And as a result, we tend to see fairly systematically

a lot of folks work from an architecture

perspective. And this is, I think, why

both data engineering and even data infrastructure tend to really be waterfall style development cycles, even if we kind of sugarcoat it in some,

you know, more agile methodologies and just how we hold our meetings. They tend to be much more waterfall from an execution model.

It's such a long chain of requirements of what we're trying to achieve that you usually see things built in layers,

and you go a really long window of time until things see the light of day.

And I think that's been the

the teams that are the most successful that we've observed, both in software engineering and in data engineering,

are the ones that are able to sprint faster from

raw capabilities

and in business impact and then morph over time what that architecture looks like. And I think the you know, there's this really cool saying a friend said a while back, which was,

the definition of great code isn't

its performance

or its readability, but its ability to adapt to change

is the most important. And that may encompass the other pieces, and there may be inputs into it. The reason why that matters so much, the mutability,

the adaptability

of code, is

we're in such early stages

from a data ecosystem perspective. Things are changing so fast.

Big,

hard, like,

fast infrastructures. Like, if we prioritize that versus mutability of design,

by the time something we envision sees the light of day, like, the rest of the ecosystem will have already changed.

And so I think this is where

we're starting to see more and more teams really prioritize

the speed of change as the measure for what a great system, what great code can do, how fast

can you adapt, how safe is it to quickly respond to new requirements

and new change.

And I think that's gonna push a lot of really great innovation. It's going to drive more investments in automation. It's going to drive a lot more investments in data ops, just as we saw it drive great innovation in DevOps.

And that speed of iteration is going to become increasingly important for for data teams.

As people

invest in these automation capabilities

and build out more of their data flows to not be imperative and not require as much human time and attention? What are some of the ways that those automation efforts can go wrong?

Definitely the, hey. We're gonna go re architect for the next

year. Ain't nobody got time for that,

generally speaking. I think the

massive rearchitectures

are just generally hard. They're generally slow, and they're fraught with risk. So I always recommend

avoiding that approach.

The

other

thing that I do think is important is as teams

go down their automation path,

it's important to traverse the decision tree back up to some of the higher upper root nodes

in that decision tree. When you came down And if you pursue that 5 whys methodology, you're like, well,

And if you pursue that 5 y's

methodology, you're like, well, why do we need to do that? Well, wait. But why do we need to do that thing? If you traverse it far enough, you'll oftentimes realize, well, we did that really as a funky workaround for this other limitation that literally is no longer relevant, like eventual consistency of writes on s 3. And I'm like, oh, we actually have immutable and item codegen fragments, and s 3 now has immediate consistency. So

all that other gnarly stuff that we did probably don't need to do anymore. And so

it's really important

to to invest that those cycles, that's where I would put time, to revisit the assumptions.

The foundational drivers behind those assumptions have fundamentally changed tend to change very quickly.

In terms of the continued evolution of the ecosystem

and the perspective on automation, what are some of the areas that you're paying particular time and focus on?

I think there's a couple of areas that are really interesting. I think the

most interesting 1 is in

the nearer term is going to be in this notion of multi data plane,

and how do we do really advanced

automation across. So it's really been boiled down into data meshes or data fabrics,

nuances as to which one's appropriate for which business and what they mean. And a lot of opinions, I think, out there in the industry as as to what they are, so I'll leave that to the experts.

The thing that I think is relevant is

the backbone

of achieving either of those tends to be pretty heavily around

the need to connect into many systems, the embracing of the fact that your data will sit in many systems. In fact, probably even

be processed in many systems.

But maintaining a continuity

around metadata,

lineage, automation, access where relevant,

incredibly important. And I think that relates to our world and automation quite a lot. I think

the most successful data strategies around mesh and fabric will very much be automation driven.

So we're spending a lot of time looking at that and looking at the data planes you can integrate into the clouds and that you can connect across. And we have a lot of really

cool capabilities

coming out in the next couple of quarters

tied to that that trend that we see today.

Are there any other aspects of this question of data automation

and the work that's involved in making it a reality that we didn't discuss yet that you'd like to cover before we close out the show? It'd be a very interesting question to see

when do we get to,

you know, even 50% of companies

embracing data automation.

My guess is we may be a couple of years away from that, maybe even more.

I think it'll be a really fun maybe we should start a pool

at you know, with our data aware pull survey, what the percent of penetration of data automation will be next year. It's gonna be somewhere between that 3 a half percent. It'll be greater than that, but it's definitely not gonna be the 88 a half percent of people who want it or intend on having it by next year. But it'll be somewhere in the middle. I think that'll be a really fun thing to see what what shakes out.

Alright. Well, for anybody who wants to follow along with the work that you're doing and get in touch, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Biggest gap today?

I can't say automation. That's gonna be way too easy.

I would say the

I'll give a hard answer to this 1.

The tooling

and technology we see today,

I think, has

become very successful

at hello world.

I I think getting all the way into complex production

with a lot of tools is a significant lift beyond that. And in our world of

product led growth and SaaS business models,

I think a lot of teams in technologies have really tuned that their killer at it. Having a really smooth glide path

to increasingly complex,

I would love to see our industry software. I think too many data engineers get a pass that's high and wide on that, and they go back up to their exec team, pitch something that can be delivered, they believe, quickly. And as you really try and push to get things into production,

I think it's harder. And so I think that's the current gap.

Alright. Well, thank you very much for taking the time today to join me and share your thoughts on this space of data automation and the different challenges and benefits that is on offer there. Appreciate all the time and energy you're putting into

solving certain portions of that with your work at Ascend, and I hope enjoy the rest of your day. Thanks, Tobias. Really appreciate the time.

Thank you for listening. Don't forget to check out our other shows, podcast.init,

which which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning Podcast,

which helps you go from idea to production with machine learning.

Visit the site at dataengineeringpodcast.com.

Subscribe to the show, sign up for the mailing list, and read the show notes. And And if you've learned something or tried out a product from the show, then tell us about it.

Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links