Bringing AI Into The Inner Loop of Data Engineering With Ascend

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale.

DataFold's AI powered migration agent changes all that.

Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches.

And they're so confident in their solution, they'll actually guarantee your timeline in writing.

Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds

today for the details.

Your host is Tobias Macy, and today I'm welcoming back Sean Knapp to talk about how Ascend is how Ascend is incorporating AI into their platform to help you keep up with the rapid rate of change. So, Sean, for folks who haven't heard your past appearances, can you just give a quick introduction?

Yeah. Happy to. Well, first and foremost, thanks for having me back. It's always fun to to get a chat about data together. Quick way of introduction, I'm the founder CEO at Ascend.io.

Started my career twenty years ago,

accidentally

doing data engineering back at Google. I led the web search front end team. And when you play around with the web search user experience, you move around a bunch of pixels, and you pose a lot of theories, and then you process a lot of data to see what actually happened. And so that was really how I got into data.

Started a company back in 02/2007,

built that up to

about four to 500 people.

We had over,

60 people in our data product,

team and got really excited about data, and that eventually took me here to Ascend.

And for folks who haven't,

come in contact with your work yet, can you give a bit of an overview about what Ascend is and maybe some of the story behind how you got to where you are today?

Yeah. Happy to. So Ascend provides an automation platform for data engineering teams. We focus really hard on taking the

both repeatable

parts of data engineering, a lot of the the toil and the the rote work, as well as a lot of the sort of harder parts around going from

simple basic data pipeline to far more complex, large scale, large complexity data pipelines.

And we look to help productize a lot of that experience and help automate a lot more of it, both through advanced layers of automation, but now, obviously,

given these days, a lot of, help around AI as well,

that makes the life of the data engineer actually delightful.

The last time we spoke, I don't think we talked too deeply about Ascend specifically, but it was in August of twenty twenty two, and I'll link to the episode for folks who want to listen back to that. But that happens to have been kind of right at the precipice of the current era that we're in right now where generative AI has sucked all the oxygen out of the room for everybody. And I'm wondering if you can talk to some of the notable ways that your platform and the ways that you think about the problems that you're solving have changed in that space

and the role that AI plays in the data ecosystem for people who are just trying to keep their business running and empower them with data?

Yeah. So a lot has changed

clearly

in the last two and a half years, and it's actually you know, when we last spoke, I think it was just right before so much really had had dropped into the space. And, you know, I think when we look at data teams, first, it it you know, that old saying where,

the cobbler's children have no shoes. It it feels oftentimes like this where data teams themselves,

while powering such incredible experiences

for their consumers, their their companies,

oftentimes get some of the weakest tooling and the weakest capabilities to actually make their lives easier as well. And so I think what, you know, what we've seen over the last couple of years is a tremendous excitement around AI, especially as it pertains to the software engineering realm, a tremendous amount of interest in more data

to drive

really powerful experiences,

but we haven't seen yet a lot of, I think, really interesting investments that just make the life of the data engineer easier, more delightful themselves. And I think we're we're on the cusp of some really cool new innovations there that will change that much for the better. ChatGPT

was the, I guess, shot heard around the world, if you will, in the AI space where for the first time,

generally, AI

broke

from something that nerds talked about and speculated about and worked on to something that everybody was exposed to and incorporating into their daily lives

because of the fact that it was available as an API in a web interface in a manner that didn't suck, frankly. Yeah. Since that time, there has been a lot of evolution in the capabilities of those AI as a service, as well as the open ecosystem of models, as well as the ways the AI is incorporated into different tools and technologies that we all use every day. For developers, it has taken the form of things like Copilot and some of these other language model assistants that live in your editor.

For people who are content creators, it's in the form of things like Grammarly or Gemini in your Google Docs where it can correct your grammar, generate content for you, etcetera.

In the data space,

there are different tools that you can use to bring to bear where maybe you feed your schema of all of your database tables to an AI model so that you can use that to propose

a dimensional structure that will incorporate all of that information.

But it still feels like there's a lot of friction to actually getting the data that you need to the thing that you need to make the decisions for you. Different platforms are trying to make that more native. The Snowflake folks, Databricks, you know, they're obviously very invested in making AI a native part of their experience, but I'm wondering what you see as some of the most impactful areas for the application of AI to the work that data teams are doing and the ways that they can

accelerate the pace at which they're able to iterate, their own problems and the problems of their organization?

Yeah. I think that's a great question. First thing I would say is data engineering has yet to have its own chat GPT moment. I think the industry as a whole

is applying

what are the earlier stages

of innovation, but we haven't had that incredible

moment yet for data engineering. And by that, I mean, you know, when you saw ChatGPT, it was sort of this moment. When, you know, you first use

Copilot or Cursor as a software engineer. That's a, wow. This is is really changing

how I work. I think

the prevailing sentiment to date that we've seen

when it comes to data engineering,

around and the assistance of AI in data engineering has kind of been this,

that helps a little bit. It's kinda nice. I'll take it. But it's not this mind blowing

game changing

experience yet. And I'll dive into, I think, a little bit of why we're at that. I think what we're going to see is the early stages, and we're starting to see that. I think, you know, we're seeing that from an integrated chat perspective in

both Databricks and Snowflake's,

experiences. I think we're going to see,

others also start to innovate a lot on, well, if I can get my entire catalog, can you then help me define a model that will, you know, generate a bunch of new data or will actually take into account everything out of my schema and help me generate new datasets. But when I think about where that tends to sit, that's going to be further downstream closer to the analytics engineer workload. What we haven't seen yet is what I would describe as, you know, either the cursor for data engineering or even the Devon for data engineering experience.

And I think of that as something that plays a far more proactive

role in the actual data engineering life cycle itself. The way that I like to think about this is in a few layers. You know, the baseline parts from a cursor, for example, is look. It it's doesn't even have to write entirely new applications for you. Cursor is quite delightful as a software and engineering experience because it just completes your thoughts really, really, really darn well. And its composer capabilities are incredible and amazing, but we we we don't even have that yet, I think, from a data engineering perspective. Once we get that, I think we'll then start to get into more of pipeline development

and architecture and design and even optimization. The other thing I think we'll start to see in data engineering is because data engineering is not just a write code, test it, ship it, but it's a far deeper in observability,

you want to scale it far greater across engineering orgs. We see a lot of interest

in allowing data engineers to even write their own bots and their own agents

to do specific actions and take specific activities

on their behalf as part of that data engineering life cycle. Now if I kind of hit one last point, I think, that's really important to this as well is, you know, why haven't we seen this yet? I think it's harder in the data world. In the software world, we're writing software, but in data, we're writing software, and it's being applied to to the data in in a really interesting way where most of the tooling, most of the architectures that we see from folks, even though they actually touch and process data, they tend to actually not keep track of a lot of that data. They tend not to keep track of the metadata long term. They tend not to keep really close tabs of what code was run on what piece of data, what was the historical performance over time, who made that change to code, when was it deployed,

what was the semantic change that happened

to that that new data operation? And so because of there isn't that unified metadata and understanding,

it's hard to create a data engineering agent or set of agents or utilities, if you will. Right now, we're seeing software engineering agents or tools just being exposed into the data engineering role. I think that's will be the next big step functions if we can actually create that combination

of data and software combined. And I think too that one of the reasons that data engineering has not caught up to where software engineering

is in the adoption of AI or the capabilities that it offers is that we never got to a point where the development environment caught up with where software engineering was because of the fact that it needs to be so close to the data and so aware of that data to be able to do that development and testing work. So there have been lots of efforts, especially over the past five years, to try and close that gap, but we never quite got there. And now AI is causing that gap to widen again because development is pulling ahead of where data engineering left off, and so we're we're still playing catch up in that regard of we don't have all of the wires connected to be able to say, on my local laptop, I can do all of these things and have confidence that whatever I'm changing is the right thing because I have to actually test it in production and see what breaks and then go back and try again where which is where we never quite figured out what is the pre prod for data engineering

for

the entirety of the actual workflow where we have bits and pieces of it of the copy and write tables in Snowflake or the versioning and branching capabilities. And things like Iceberg brings in CI for your orchestration system, but then there's still all the missing wiring to make sure that that ties into all of the copy on rate tables, etcetera. So you can get to a point where you sort of have that, but it requires a lot of effort that, at the end of the day, has no actual impact on the problem that you're trying to solve for the business. You're just trying to solve your own problems so that you can get to that point, and you don't necessarily

have enough time and leeway to do all of that engineering effort to solve your own problems to help you move faster. Yeah. I I agree. I think, you know, this might have been one of the

unfortunate and accidental side effects of the modern data stack in many ways, which was we had so many different tools and we're and that were propelling us forward. We didn't get enough time for that to really bake and settle for everybody.

And, you know, I was chatting with a bunch of folks yesterday about this, which was so much of the team's time was spent integrating and trying to actually integrate at a metadata layer as well, but nobody completed that job. And as a result, there was this rapid change in what you could do where you didn't have to use these verticalized,

tightly integrated,

tools, and you could use more modern day, capabilities. But as part of that, you had to do your own integrations.

And when we think about

where a lot of the advancements come from AI is are in these more tightly coupled systems because

AI, you get this exponential effect from just the the power and capabilities of those those models, the more data they can have access to and and the the more purvey that the system itself has access to. And so I think that's why the

it's changing so fast. It's becoming very hard for for teams to to keep

up. The other piece of this too is

that the introduction of AI, not even just in terms of the data engineering workflow, but at the organizational level is that all of these businesses are now hurrying to build their own AI based products or incorporate AI into their existing offerings,

which forces these data teams to try and leapfrog

the data maturity cycle of, I just barely got my business intelligence system set up, and now I have to be an AI expert. I never even got through the machine learning, build your own model cycle.

And so we're trying to, you know, race so far ahead of our skis that we don't even really know where we're going or, you know, overdriving our headlights as it were. And so I'm curious

how

the capabilities of AI as an accelerator to the work that we are doing can maybe allow us to catch up to the point where we're actually driving at a reasonable pace so that we can actually see where we're going rather than just racing through the dark and hoping that we don't see somebody coming around the next corner or whatever, however you wanna stretch that metaphor.

Yeah. Totally. Well, I so I think it you know, we are gonna see AI help in a number of ways. The, you know, what I would call code and data aware development environments, we are going to see a a pretty significant surge in. At least, I hope so. I mean, clearly, these are these are things that, obviously, we're working really hard on and and building very quickly at Ascend for that reason. And and so I think that will be the the very baseline level is, does your system understand what you're working on from both dimensions

to accelerate? I think we're going to get

additionally a lot of assistance in this classic shift left or move upstream model where we can help more people self serve. We can help more people democratize

their access

to increasingly

complex use cases where, you know, oftentimes we would try and solve these problems

with basically no code interfaces. And I think GenAI is a really powerful alternative to no code interfaces

as it helps bridge the gap where you can use Gen AI to generate code. Business analyst could actually use that, and an engineer could still actually understand and be comfortable working in the same platform if it boils down to to actual code itself. And so I think that will start to change a lot. I think then we're going to also see a lot more AI is we still see teams massively saddled with a huge amount of maintenance burden, on legacy systems.

It's one of the

primary draws on Teams time based on on our annual surveys that we do. And one of the challenges we see from most teams is I would love to either have AI manage this for me and make smarter decisions around how we just maintain

existing systems? Or even better yet, can I get AI to help migrate that off of this old

clunky legacy stack and actually get it into a modern architecture so I can run a a unified,

approach again? I think that's where we'll start to see some pretty powerful lift that will get data engineers out of this just crappy toil,

of of maintaining old systems.

I think another

barrier for data teams to the wholesale adoption of AI is the consideration

around governance

and data access control where you don't want to give unfettered access to some proprietary model or some third party company to be able

to operate across your data with impunity

because it has so many risks associated with it. And so then you're left trying to say, okay. Well, how do I either run my own model? Or I just can't do it, so I just have to continue with the status quo where with the software example, the software itself is typically

much less regulated, and there are fewer risks involved with it.

And a lot of the companies that do have their

AI focused code assistance already have some capabilities in place to say,

this is never going to be used in our training data, or this is being run in some sort of gated fashion so that your private repository is not ever going to be leaked or trained upon, etcetera. And so I'm curious how you're seeing the adoption of AI

in terms of the deployment and incorporation

of our people just saying, whatever. I'll just go ahead and use OpenAI or Anthropic. Are they running their own models in

Bedrock or self hosting it on VLLM or whatever,

and some of the ways that they're thinking about

what model capabilities they even need to be able to get their work done where maybe they don't need the 400,000,000,000

parameter model. They just need the 14,000,000,000

or whatever it is. Really great question. You know, we see a

pretty wide range

of, let's call it, readiness

and

hardened thinking,

across

our our customer base and across those we just interact with in the industry as well when it comes to AI. There's a number of companies that we'll talk to where we've been asked, are you allowed to use generative AI solutions in your development experience?

And we still get answers from from some folks where they they don't know. And there's no mandate that they should or that they shouldn't, but the company hasn't decided yet, which I'd say is more rare, but that's on the the sort of extreme trail end. Where we see with a lot of enterprises

nowadays and a lot of more mature teams

is, I'd say, pretty stringent

policies, but not overly restrictive.

And where they tend to focus pretty intensely is

you've picked your cloud or you've picked your data cloud, and you generally stick to that. So if you run if you're an Azure shop, you're fine with OpenAI as long as it's hosted on Azure. If you are a Snowflake customer,

and this is a huge value prop for, Snowflake customers is you have access to a ton of models. And as they they just recently announced,

even OpenAI models soon as part of your Snowflake environment. And I think that's what we're seeing is the sheer demand

for access to to these generative models ends up being

so compelling

that both the platform providers as well as are racing to at least get it within some sort of governed walls. And at the same time, companies are willing to, even with some of that ambiguity,

press forward because they know there's so much value. And so that's what we're seeing is the if I can at least stay within my data governed walls,

so to speak, where most companies tend to sit. And in terms of your own

work of integrating AI into the

platform capabilities,

what are some of the ways that having it natively offered in the substrate versus as a bolt on changes the ways that you think about what it can do, how it can do, and what it should do, and then also some of the ways that that enables your customers to

reduce some of the busy work that they would otherwise be engaging in? Yeah. Great question. You know, the a couple of fold. You know, one is, hey. Look. When when something goes bump in the middle of the night, something broke and you wanna figure out what went wrong. What classically can be a multi hour triage

of when did the last deployment go out? What was the, set of of changes that went out as, code changes as part of that? What was the state of data before? What's the state of the of data now? Do I have to actually replay or recover or

somehow fix and repair the the underlying data? Is it safe to revert? Is it not? All those kinds of questions, generally, as a data engineer, would require

your your Git history on one screen, a bunch of your observability tools on another, looking into your Snowflake or Databricks

ecosystem

on yet another screen and trying to deep dive. But the reality is

if you can actually have agents that actually have full access to your Git understand the entire run history for all time, for all of your data pipelines, They know what changed, who changed it, and when did that go live. All of those kinds of and and they can also look at the data if you've allowed it to to do so. That entire recommendation remediation step can largely be alleviated by an agent to at least make your life a heck of a lot easier. They're not gonna take over, I think, in the control seat anytime soon.

But the you know, imagine it is sort of the equivalent of, you know, for those who play with the deep deep research capabilities coming out of a a bunch of the

the AI providers. Imagine you had your own triage report ready to go by the time, you know, you've gotten, you know, the pager duty alert and you got up in front of your monitor and you could just digest all that information. Life would be a lot easier. So I think there's things like that that are really, compelling and helpful. I think in more of the day to day experience

where this becomes really exciting is, why do I even have to search a catalog?

Like, why do I have to remember the table names? Can't I just describe more of what I want to actually

do? Can I get a the intelligence of a cursor like experience,

but that is actually aware of all the data that I have access to and can start to suggest things that I should do? Or even better yet, if I'm writing a data pipeline, wouldn't it be amazing if I could also have it suggest, hey. Are there smart partitioning strategies to keep my compute cost down? Are there, data quality checks I should be running continuously on my datasets

based on the nature of the data that's already available to me that I can see? Those are the kinds of things that, as an engineer, I I shouldn't have to spend time on in this day and age, and we should be able to actually have AI offload huge amounts of that workload before. And in terms of the engineering effort of actually operationalizing

AI and incorporating it and and embedding it in your platform, what are some of the challenges that you had to address, some of the ways that you had to restructure the core of the system to be able to feed the necessary information

as well as the overall architecture of the AI system itself, whether it's just a straight chat based or tool calling or agentic and just some of the ways that you iterated on that problem space as well as you started to incorporate it into the the core of your functionality?

Yeah. Great question. You know, for us, this has been

a now a year long investment

in rearchitecting

from the core out. And one of the things we've done at Ascend is we've actually rebuilt the platform for the AI era. I think it's very hard to take an entire architecture that's been baked for

five, ten years, even a few years, and say, you know what? We're gonna try and, like, weave AI back into the core. I I think that's very challenging. I think that's where we're going to see a lot of, companies in the space

really struggle because they weren't architect and design with that to begin with. And what I mean by that is, you know, as part of our new architecture and design at Ascend,

we store everything. And we actually store it away from our cloud hosted environment. We actually store, for example, all get history we have, access to for our agents,

all metadata access, which is literally

everything that's ever been run, the performance characteristics,

the git SHA that was run, the SHA of the code that was run so we can detect changes over time. And that across every developer

workspace,

across every deployment of the environment, across dev to stage to prod. And by tracking all that and making that data available to a model, it becomes incredibly powerful. And then what we've ended up doing is as part of this entire unified metadata core, we keep full history of that in the underlying, big data, systems, whether it's Snowflake, Databricks, BigQuery, you name it. And then at the same time, we put it all on a real time event bus

so that, the system can take real time actions on it as well. And so by combining both the the streaming nature to to do instantaneous activity, but the full historical nature and feeding that into the the agents

creates a huge amount of power and capability for us. Digging now into the technical architecture that you've settled on,

obviously,

AI

inference can be a very expensive proposition. I can't imagine that you're just saying, we'll just throw up an AI at it. And so I'm wondering how you worked through some of that experimentation

and validation

of what are the actual runtime operating costs. How do we build up the operational maturity around actually having to own this capability,

and some of the ways that you thought about the size of the model that you needed, the inference costs, and the the compute necessary, and what you settled on as far as whether you run it yourself, rely on the cloud provider, rely on, one of these third party GPU clouds, etcetera?

So we've definitely decided to to rely on the cloud providers. And the primary driver behind this is is actually related to what we were talking about before, which is most companies we interact with have generally decided on a camp. I'm a Google camp, which means I'm going to use BigQuery and Cloud Pak for my data processing and storage, and I want to use Gemini for my models. Or I'm running on AWS, and I wanna use, maybe at some point in the future, s three tables and a bunch of glue capabilities,

and I wanna use Claude on Bedrock, for example.

Or I'm on Snowflake, and, similarly, I wanna use the anthropic models and use all of the the amazing sort of cataloging and capabilities

with, Snowpark and and native, Snowflake SQL. And so what we see is

across those different camps

that there's a a strong draw towards keep me in these walls, and and then it's very smooth sailing from a capabilities

perspective. What that means from the Ascend perspective is we've taken much more of a how would we from an architectural perspective, how would we design

sort of a Kubernetes

for data engineering from a platform perspective. And so when we think about that from an architectural

perspective,

we separate

out where you store your code, which becomes highly configurable. Then we separate out, where do you actually store your data, and we separate that from where do you store your metadata,

all historical operations, which usually ends up being the similar underlying,

data platform, but in a fully isolated, far more secure location. Then as we start to build up on top of that, a lot of our design is

to be model agnostic. So you may be running,

Snowflake on Azure, and you wanna use OpenAI for your models,

and for your agents. Whereas,

conversely, you may be running in Databricks and wanna use some native, Databricks hosted models. And so we we try and keep very agnostic

around what underlying models we support because we wanna be able to tap into the ones that are closest to the data. Then what we build on top are is our

tooling and agentic,

layer itself.

One of the really important things we've learned over the course of the last year is there's no one architecture and no one

structure to to rule them all. So, for example, our chat interface is a a multi agent

model that has multiple experts that are really good at writing SQL verse,

generating, YAML definitions versus writing Python. And each one has access to

different parts of our documentation repository

and rich sets of examples, so they they can really help you. And, of course, access to all of your own code. And that becomes really different than if I wanna create my own agents as a user of Ascend. The structure that we create behind that is very different than a structure for our chat and what we a a chat agent. And that's what we found is we've been trying to break it into each one of these sort of constituent elements and then expose through different agentic architectures and different sets of tools all of the underlying components that you would need to deliver different activities.

And

in terms of the actual

models themselves, feeding them the appropriate

data, have you found it most useful to go with the RAG based approach? Are you able to just inject directly into the prompt the context that you need? Are you investing in more fine tuning capabilities to be able to optimize the the model capabilities and the functionality? Just some of the ways that you think about the optimization and evolution of your capabilities.

Yeah. Great question. We've, to date, primarily focused

on RAG and prompt injection.

The reason we've done that is

the when we generally look at the

volume of data, distinct of the volume of context that you would put in. While there's some benefits to fine tuning, that gets much harder when,

customers are wanting to bring their own moments. If we start to fine tune, then we have to to focus on fine tuning across the various providers. And we've given the nature of the problems we see in data engineering,

the data itself isn't that large by comparison. The amount of code you have in your in your repository

in a a sort of tighter declarative architecture, it's not so large that you can't be highly effective with large amounts of RAG and prompt injection.

And what we focused really hard on is making our models, our agents, very, very good at doing reg and calling tools to get access to more of that data, and that's been really quite successful. Now I will say too the one of the really cool parts about building these tools for data engineering

teams

is data engineering is such a valuable role

and profession

that we don't have to tackle problems that a consumer tech company would have to solve, which is

we don't really worry about the cost per invocation

or the cost per token in the same way. We care about some of the response time speed, but when you're trying to help a data engineer that's such a re a a resource constrained or such a tapped resource in the tea in a team and is so so valuable, it's really easy to justify throwing the largest, most expensive model so there's so long as they can respond quickly to provide the best experience to the the user. And so we find that we just don't have to spend as much time in the data engineering side of the house trying to tune and refine smaller models that you wouldn't necessarily have to do if you're running,

either things locally or that you if you were trying to run things, at a consumer

scale. Now in terms of the

capabilities

that it offers to your customers

now that they actually have AI as part of their inner loop of development,

what are some of the ways that it changes the

types of work that they do, the way that they approach the work, maybe the volume of work that they're able to complete, and just some of the ways that it shifts their overall thinking about the work that needs to be done and how to approach it? Net effect is obviously you know, hopefully, people can produce a lot more. And I think, you know, more importantly, though, is

where teams can spend their time. The you know, what

we really are starting to see and and really hope that the whole industry will help deliver on too. But where we see is far less time should be spent on documenting and testing, though debatable how much teams truly do because that's usually the first thing that gets cut in a pinch.

But everything from development times should be much lower. One of the other really big things we see that we see wild variability across teams is even code review processes.

Platform code hygiene itself across teams varies quite wildly. All of these things should largely disappear.

We should see incredible consistency

across teams and and far more efficiently. So, yeah, I think the you know, if if you use SonarCloud or other kinds of of similar software that analyze the quality of your code, we expect to see dramatic increases in code quality across customer bases, where we'll see greater levels of consistency,

greater levels of reusability, higher testability, higher documentation.

And these are the kinds of things that I I think AI will will be knocking out faster than anything else, which I think is incredibly valuable, especially in

large, large teams where there's,

hundreds of developers

on these platforms.

That guaranteed level of consistency and durability then

helps new members of those teams ramp up faster, be effective quickly, pays into a very virtuous cycle. So those are a couple of areas I think in in a very immediate

term are are really powerful benefits that we're that we're seeing.

On the documentation

side too,

one of the aspects of that typically feeds into

governance and understanding

what is this field. And once I know what this field is, then I can have a better and more informed decision about how it can and should be used.

There's long been automation available for doing,

profiling and sampling of your data, doing automatic PII tagging.

What are some of the ways that

the AI capabilities

and the fact that you can auto generate some of this documentation

that is typically left to the wayside

improves the visibility and effectiveness

and

strategies around governance as that data becomes

more of a conversational

interface and less of a, well, let me just go through this spreadsheet. And after about the 15 flying, my eyes glaze over, and I'm not paying attention anymore.

Yeah. I mean, honestly, every time I cut a PR for

some sort of emerge off of a data pipeline, we should have AI

re updating all of my documentation for me. There there's no reason we shouldn't. In many ways,

and this is even a a development practice we we have internally at Ascend, we have what we call pre PRs.

And part of the goal of the pre PR is document the things that you're actually planning on doing. You know, what is it? Why does it matter? What's the customer impact? And we try and get a

a detailed enough pre PR

such that

anybody who's dropping in could even check and be like, hey. By the way, have you thought about x, or, you know, could I help you with y? What I would love to see is not just the post

PR, update all my documentation,

do all these other things, but I would love to see a model where whether it's through a chat interface or an actual, you know, draft pre PR, if we could get teams able to describe here's what I want to do, here's what I'm thinking about doing, And oftentimes, it's hard to get an engineer to do that into a chat prompt, actually. But if it was just a simple markdown file describing your thought process before you got started, not only is it an amazing ability for other members of the team to help and contribute some of the thinking, But that's where we should be able to see AI

be able to actually go and tap into, hey. Here's all the data that's in the catalog.

Here's the things that you probably would wanna look into that could really help you. And here's even some drafted updated models or brand new models that you could, leverage. And I think the if we can get into that sort of a practice

as data teams,

the

ease with which we should be able to build becomes

pretty

significant.

On the junior versus senior, another

aspect of AI

is that

for for both sides of that coin, the AI can serve as a stand in for the other. So as a junior, having an AI right there

can be your infinitely patient senior developer who has all of the context and history available to them.

And for the senior engineer, the AI can act as the stand in to the junior developer that you have to explain things to that helps you understand the problem more effectively.

And

often, the challenge is getting both of those participants to actually engage with an AI as part of their standard flow. I'm wondering how you're seeing that

reticence

manifest in the teams that you're working with of no. Just talk to it. Yes. It might be stupid or you might not understand it, but it's worth just engaging with it because this is a skill that you're going to need to develop and be able to feel comfortable with as it becomes more prevalent and part of absolutely everything in our lives going forward.

Mhmm. Mhmm. So I think you you highlight something that's that's really interesting too, which is look. Like, was it probably a year and a half ago I gave this presentation,

to the the company, which was none of our jobs will exist

in twelve to eighteen months as they do today. Like, we we all have to rethink this. And it's very akin to you know, we always have talked about digital native and digital non native companies, and the non natives had to go through their transformation.

Yet pretty much every company in existence today, minus a handful of very early startups, is an AI non native company, and we all have to go through this transformation

together. But the difference is,

well, Moore's law has actually driven a pretty

slow growth rate by in by comparison. We always thought Moore's law was was, you know, fast compounded growth rate. But when we look at the computational power of AI, it's growing 13.3

x year over year and has been for the last, decade. Looks like it's probably going to grow 13.3 x year over year for the next decade as well. It's basically like taking

all of the change we've seen since the launch of the iPhone back in, I think, 02/2007

to today from a a software perspective, computationally, at least, and saying, hey. All of that change is going to be possible and probably gonna happen in the next two to three years

instead of, you know, the last eighteen years. I hope everybody's ready for this. And

the the reason why we went through that conversation is I put up this this very funny GIF, which we have a very strong, like, GIF culture in in the company, and it was of this movie Shazam where there's, like, you know, teenage high school kid. He has spinesy superpowers. And he's like, ah, this is great. I have superpowers, but I don't even know how to go pee in my my super suit. And it's like, yeah. It's kinda like the awkward stage where all of us have these superpowers in AI and some things it does amazingly well and some things it is just absolutely terrible at, and we're left so disappointed. And I think the reason why that becomes so important is the just because you're disappointed in it today, which will happen often with AI, it is

improving

and

accelerating at such just this unprecedented

rate that you gotta stay on top of it because saying, ah, that being yeah. I tried that last quarter and it sucked. This quarter is probably gonna be pretty darn awesome. And then you'll hit a new threshold where it's gonna really disappoint you. But, like, figuring out where those boundaries and those parameters are, I think it's really, really important to stay on top of. Because if you don't,

how much you're going to miss out and how far behind from a skill set perspective you are of just intuitively knowing, ah, if I craft a prompt this way or I craft a prompt that way, which I would also contend is probably pretty in line with how you should talk to another

engineer around

consistent and clear communication.

If we don't learn that that model,

you're just you run the risk of falling behind. And I think the the world needs more really great data engineers solving really hard problems. And so we don't want anybody to get left behind on this on this journey.

Absolutely. And I have been reading the book Co Intelligence recently, which is a very accessible

book by,

a non computer scientist academic talking about the impact of AI on everyone's work and lives. And one of the quotes

that I think was noteworthy was assume that whatever AI you're using today is the worst one you will ever use.

Yeah. Totally. It's getting better so much faster. I mean, and and remember when it was chat GPT first came out, we also got GPT three five, and it seemed like magic. And now we look back, and it seems so basic by comparison.

It's the how quickly we've sort of been readjusted

to

the new capabilities.

In many ways,

I feel bad because we're all just given this, like, magic. Like, it's almost freaking magic. And somebody just showed us this amazing thing. It's like, look at how cool this is. And we're all like, yeah. But what have you done for me lately? And we're getting

drops of amazing new innovation

on a monthly basis

right now. And so I think that's the really cool part is they are just getting better and better and better. You know?

The the moment o three came out, we were playing with it extensively,

especially from an engineering perspective. And

already, like, the, you know, the we've been testing out a ton on on the o three mini models and their ability to generate incredible output for incredibly complex

data problems far greater than what we were seeing with four o and and even,

to cloud three five. Three seven's looking great as well. But it's just, like, every month better and better and better.

And as you have been incorporating

these AI capabilities, you went through this whole replatforming

to bring AI to the center of your offerings. What are some of the most interesting or innovative or unexpected ways that you've seen that combination applied by your customers?

Good question. The

I'd say for our customers,

it's more around

their productivity.

The customers of creating

AI products,

I think it's still

trailing a bit for a lot of especially big enterprise teams. They're so battling with how do I get out of a ton of legacy

challenges, and they're trying to figure out how do they get their teams off of Informatica,

off of traditional

SSIS

and other more traditional systems, or even how do they simplify down there and make more production grade their modern data stack investments. And so I think that's where the sort of true reality for a lot of, at least, enterprise data teams is still battling to get their head above water, you know, as we were talking about before. And in your own efforts of going through this replatforming,

reengineering,

understanding

the utility of AI in the context of Ascend, what are some of the most interesting or unexpected or challenging lessons that you learned?

I I think the the most important one

was, you know, for us, this really big bet of you do have to rethink all of the assumptions.

Without rethinking the those assumptions, I I don't I don't think we we would have got to the sort of incredible platform we have today. And that was a big bet because you have to to take a step back and say, hey. Look. Like, we're not gonna process data the way we did before. We're not gonna store metadata

the way we did before.

And when you'll even put in hard remits around some of the crazy things we did early on where we're not allowed to have a database

of our own. We have to store data in the customer's environments

so that the agents have access to it, which people throw all sorts of concerns when you do something like that, which sounds like that's crazy from a classic SaaS distributed systems architecture. You're talking crazy talk. But when you put it against the backdrop of the automated engines and the AI agents have to have access to everything, and we have to be able to run all of these inside of a customer's

environment, all of a sudden, it starts to make a a lot more sense given those parameters. And I think when we think about trying to do that

from any of the classic architectures, you're just you're fundamentally hamstrung. For people who are interested in the experience that you're offering of AI at the center,

automate away all of my toil, what are the cases where Ascend is the wrong choice?

Good question. If you like writing a lot of code, we're probably not your jam. And I say that jokingly, but there's definitely places where, you know, we encounter teams oftentimes who

really want more of a a software experience

that they want to do more imperative style work versus declarative systems. And there's always some of that push and pull or give and take between, an automated system that you can do things in a tenth as much code, but you have to relinquish some of the automation into the the underlying system so it can take care of those things for you. And what we oftentimes find are if you do like writing a bunch of code or you do want

really specific ways, that are nonstandard

with where a lot of the industry and sort of best practices have evolved, if you really wanna do those things, Ascent gives you escape hatches to do those kinds of things. But at that point,

you might be better off with just a generic code orchestrated by an airflow, for example. And that and that may be okay for for other use cases. And so we do try and and, you know, be really transparent to folks about, hey. We are a really advanced automation and and AI backed system. If you want less code, you want the system to take more of the toil for it for you, that's great. But it it comes with those trade offs. As you were mentioning, Airflow,

etcetera,

it made me also think

in the act of replatforming,

I know that at least in your initial formulation of Ascend, you were very invested in the Spark and Airflow

combination of that ecosystem.

How has this replatforming

and the centering

around AI changed the ways that you think about what tools you can bring off the shelf versus what you have to build custom?

Yeah. The, we have this joke internally where, you know, we try and embrace this mantra of, look, if for whatever the task or the job is today, if we could just swipe a credit card or PIP install something, we would be so delighted.

And the default approach is not just, hey. What exists today? And let's avoid that that that classic engineering, what what I like to call accidental ransomware,

trap, which is, oh, but I could build it better. This thing's missing that that feature. Like, I could just build it better. And and the question we always try and ask is, well, what's the trajectory

of that thing? Is the industry going to standardize on it? And how much time and energy is it gonna take? And is it really gonna differentiate us that much? And let's be really pragmatic about it. And what we've, you know, really come to to learn is, look, we like this Kubernetes for data architecture

where we think about, look, you have a a data plane. You know, these are sort of atomic units. Like, it stores data. It processes data. It generally carries a catalog with it. It maybe it's gonna toss in iceberg tables or delta lake tables, but it's going to take care of that for you just like you have code storage somewhere, just like you have metadata storage somewhere. And so what we did with this architecture was try and go very agnostic

from a you're gonna have your data processing in one of the hyperscalers

or one of the data clouds,

and you're going to similarly have your code in one of the really popular repositories. You're similarly going to use one of the the mega models. And for us, what we decided was let's be not opinionated about that from a platform perspective and instead be opinionated at the data engineering

side of, hey. What do good pipelines look like? They're combinations

of Python and SQL and generic tasks, and

they sometimes materialize data, and they sometimes don't materialize data. They sometimes

are incremental, and they're sometimes hyper partitioned.

Sometimes it's transactional data. Sometimes it's mutable data. And when we took that step back and just looked at that data engineering life cycle,

the the end factor of how many things there really are isn't that high. And if you can really intelligently support those, that's where you can streamline a lot of the the data engineering experience. And then let people plug in the parts that they want. And if you let's say the last piece is for those platform teams that support a lot of the data engineers and the analytics engineers, the key sort of final part of that is give them this tremendous ability to extend and customize those behaviors, just like what we've seen in in the Kubernetes world. If you do that, it it it really can can meet a lot of those really compelling needs from the platform team to the data engine team to the analytics engine team, even off to the data ops or, you know, support team depending on on on how you cluster that.

As you continue to

build and iterate on the Ascend platform,

the AI capabilities that you are centering around, what are some of the things you have planned for the near to medium term or any particular

trends or capabilities that you're keeping an eye on? A few fold. One of the things that we're really excited about is,

we want the platform team and the architect team well, any user, but we anticipate

more platform and architects. We want them to be able to write their own agents that operate inside of the Ascend environment. So, I mean, how many times for, you know, for you as, you know, a senior member of the team are like, I keep having to review all of these pull requests. I have the same feedback every single time. Couldn't I just delegate this to at least a first pass or a second pass? Or,

every time I get paged for memory pressure on a device,

or on a node, couldn't I try these two remediations first beforehand? And until we we really want to not just provide what Ascend envisions as really powerful AI capabilities, but for these teams that are working, collaborate together on data, wouldn't it be amazing if they could actually write their own? That we think it's super super,

coiling exciting. The other thing that we're paying a lot of attention to right now is

the surge on the reasoning models. That starts to get really interesting for a few reasons. While they're slower, so they're not as good in a chat based experience oftentimes,

that that will solve too. The things that we used to have to have a multi agent

complex

set of steps, things you would do asynchronously

or in more of a devon style workflow, which is, hey. Try these things. Get a feedback loop. Did the did this data pipeline compile? Like, does the flow actually run? Does it actually produce the data that I expect it to? That starts to get really, really interesting. And where I think this really becomes compelling is we have so many customers that are deeply, deeply interested in migrating

legacy workflows to Ascend. But at the same time, that is expensive to grab a bunch of bodies to say, look at this old Informatica data pipeline, or look at this old SSIS

pipeline and put it onto a modern architecture.

It still takes human time to go do so. And the just sheer volume of demand we see to be able to take that and say, you know what? If you can get agents to do eighty, ninety, even 95% of that work for me, that becomes economically compelling in a way that it wasn't when I had to think about throwing dozens, if not hundreds of developers for a year at at this migration problem. And so I think that's where we're gonna see a huge amount of of shift in the market. Are there any other aspects of the work that you're doing at Ascend, the ways that you're incorporating

AI into the experience or the overall impact of these AI models on the data engineering workflow that we didn't discuss yet that you would like to cover before we close out the show? Good question.

Nothing public,

I would say. But I will say, for sure, I think it's gonna be a very,

very exciting,

next couple of years in the data engineering world, especially for those who can bring AI to this unification of data and software. I I think we're gonna see our mantra internally as make data engineering delightful. I think as an industry, we may finally get to the point where data engineering is is a consistently

delightful experience.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as the current biggest gap in the tooling or technology that's available for data management today. The biggest gap is the tooling's not intelligent. Not just on the AI side, but I need too many tools to do too many straightforward things. There's too much repeatable process. I want my tooling to be smarter. Absolutely. Well, thank you very much for taking the time today to join me and share the work that you're doing at Ascend, the lessons that you've learned, and how you've approached this challenge of bringing AI into the fold as it were. So appreciate all the time and energy that that you're putting into making data engineers' lives easier, and I hope enjoy the rest of your day. Thank you so much, Tobias. Really enjoyed it.

Thank you for listening, and don't forget to check out our other shows.

Podcast.net

covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems.

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com

with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends

and

coworkers.