The Future of Data Engineering: AI, LLMs, and Automation

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale.

DataFold's AI powered migration agent changes all that.

Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches.

And they're so confident in their solution, they'll actually guarantee your timeline in writing.

Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds

today for the details.

Your host is Tobias Macey, and today, I'd like to welcome back Gleb Myszynski, where we're going to talk about the work of data engineering to build AI, to build better data engineering, and all of the things that come out of that idea. So, Gleb, for folks who haven't heard any of your past appearances, if you could just give a quick introduction.

Yeah. Thanks for having me again, Tobias. Always fun to be at podcast.

I'm glad I am CEO and cofounder of DataFold.

We work on automating data engineering workflows

now also with AI.

Prior to starting DataFold, I was a data engineer, data scientist, data product manager, and

I got a chance to build three data platforms pretty much from scratch

at three very different companies,

including Autodesk and Lyft, where I was one of the first

founding data engineers and got to build a lot of pipelines and infrastructure and also break a lot of pipelines and infrastructure. And

I've always been fascinated by how important

is data engineering to the business

In that, it unlocks the delivery of the actual applications that are data driven, be that dashboards

or machine learning models or now in busingly also,

BI applications.

And at the same time, as a data engineer, I have always been very frustrated with how manual, error prone, tedious, and toilsome my personal workflow was

and pretty much started Dataflow to solve that problem and remove all the manual work from the

data engineering workflow so that we can ship high quality data faster and help all the wonderful businesses that are trying to leverage data

actually

do it. So excited to chat.

In the context of data engineering,

AI,

obviously, there's a lot of hype that's being thrown around about, oh, you just rub some AI on it. It'll be magical, and your problems are solved. You don't need to work anymore. It's going to replace all of your junior engineers or whatever the current marketing spin is for it. And it's undeniable

that large language models, generative AI, the current era that we're in, has a lot of potential. There are a lot of useful applications of it, but

the work to actually realize those capabilities

is often a little bit opaque or

misunderstood

or confusing.

And so

there are definitely a lot of

opportunities for being able to bring large language models or other generative AI technologies

into the context of data engineering work or development environments.

But the work of actually getting it to the point where it is more help than hindrance

is often

where things start to fall apart. And I'm wondering if you can just start from the work that you're doing and the experience you've had of actually incorporating

LLMs into some of your product, some of the

lessons learned about what are some of those impedance mismatches, what are some of those stumbling blocks that you're going to run into

on the path of saying, I've got a model. I've got a problem. Let's put them together.

Yeah. Absolutely. And I think that's spot on observation, Tobias, in terms of there's a lot of noise and hype around AI everywhere. But, yeah, we don't have a really clear idea

and consensus

how actually it

impacts state engineering.

And maybe before we dive into, like, okay, what is actually working, it's worth kind of disambiguating and cutting through the noise a little bit. And

I've been thinking about this recently, but I think there is probably two main

things that everyone gets a bit confused about.

One is

the confusion of software engineering and data engineering.

Software engineering and data engineering are very related. And in many ways, they are similar.

In data engineering, we ultimately

also write code that

produces some outcome. But unlike software engineering, typically, we're not really building a deterministic application that performs a certain function. We write in code that processes

large amounts of data.

And, usually, that data is highly imperfect. And so

we're dealing not just with,

code. We're dealing also with extremely complex,

extremely

noisy inputs and a lot of the times also

unpredictable

outputs. And that makes the workflow quite different.

And I think one important distinction is

when we see lots of different tools and advancements in tools that are affecting

software engineers and impacting their workflows for the better like, one example is,

I think, over the past year, we've seen

amazing amazing

improvement of the

kind of Copilot

type of support within a software engineering workflow through various tools. We at Dataflow, for example, use cursor ID a lot, and we really like how it seamlessly plugs in and enables our engineers working on the application code just be more productive,

spend less time on,

a lot of, like, boilerplate, total sum tasks.

And

those tools are really it's really exciting how it affects the software engineering workflow. There's also a huge part in the software engineering space right now that is devoted

to the agents. So, for example, with Cursor,

the idea is that you

plug it in the IDE

in a few

touch points for developer, like code completion and then kind of in the system and helps you mock up and refactor the code. And it's very seamless, but it's still kind of part of the core workflow for human. And And then there's a second school of thought where there's an agent that takes the tasks that can be very loosely defined and then basically builds an app from scratch or takes a Jira or linear ticket and does the work from scratch. And it's also very exciting. I would say, in our experience testing multiple tools, the results there are far less impressive, and actual

impact on the business for us in terms of software engineering has been far less impressive than with more, like, a ID native

enhancement.

But all of that is to say that

while those tools are

really impactful for software engineers and there's a lot happening also in other parts of the workflow,

we've seen very

limited impact of those particular tools on the data engineers workflow.

And the primary reason is that although we're also writing code as data engineers,

the

tools that are built for software engineers, they lack very important context about the data.

And it is kind of a simple idea and a simple statement, but what's underneath is actually quite a bit of complexity.

Because if you think about what data engineer needs to do in order to

do their job, they have to understand not just the code base, but they also have to have a really good grasp on the underlying data that their code base is processing,

which is actually a very hard task by itself starting from understanding

what data you have in the first place,

how was the data computed,

where it's coming,

who is consuming it, what are the relationships

between all the datasets.

And absent of that context, the tools that you may have supporting your workflow, yes, they can help you generate the code, but

the impact of that would be quite limited relative to,

how complex your workflow is. And I think that means that for data engineers, we need to see a specialized class of tools that would be dedicated

at improving data engineers' workflow and would excel at doing that by having that context that

is critical for data engineer to do their job. That's kind of, I think, one aspect of the confusion sort of like all the advances in software engineering tools are exciting and inspiring. It doesn't mean that now data engineer's workflow is

impacted as as significantly as the software engineer's workflow.

I think the other type of confusion that I'm seeing is

is a lot of talk about AI in the data space.

And all the vendors you see out there are,

I think, smartly positioning them themselves as really

relevant and essential to

the fundamental

tectonic shift we've now seen technology, meaning they try to position themselves as relevant in the in the world where LMS are really

providing big opportunity for businesses to, to improve and grow and automate a lot of

business processes.

But if you double click into what is exactly everyone is saying is it's pretty much we're going to help you, the data team, the data engineer, ship AI to your business and to your stakeholders. Like, we are the best,

you know, workflow engine

so that you can get data delivered for AI, or we are the best data quality vendor that will help you ensure the quality of the data that goes into AI, or we have the most integrations with all the

vector databases that are important for AI.

And

kind of the the

the message that you're getting from all of this and by no means, this is not import this is definitely important and relevant.

But what's interesting about this, it's we're saying essentially,

data engineer. You have so many things to do, and now you also have to ship AI. We're gonna help you ship AI. It's so important that you ship beta for AI applications.

We are the best tool to help you ship AI.

But it almost sounds like this is data engineers in the service of AI.

And I think what's really interesting to explore and to unpack and what I would personally love for myself as a data engineer is kind of reversing that question and asking the question

of, okay. So we have now this fundamental shift in technology,

amazing capabilities

by LLMs.

How does it actually help me in my workflow?

So what does the AI four beta engineer

look like? And I think we need much more of that discussion because I think that if we make people who are actually working on all these important problems more productive with the help of AI, then they all for sure do amazing things with data. And I think that's a really exciting opportunity to explore.

One of the

first and

most vocal applications of AI in that context of helping the data engineers

by maybe taking some of the burden off them that I've seen is the idea of

talk to your data warehouse in English or text to SQL or whatever formulation it ends up taking where rather than saying, oh, now you need to build your complicated star or snowflake schema

and then build all of the different dashboards and visualizations for your business intelligence.

You just put an AI on top of it, and then your data consumers just talk to the AI and say, hey. What was my net promoter score last quarter, or what's my year over year revenue growth, or how much growth can I, expect in the next quarter based on current sales?

And it's going to just automatically generate the relevant queries. It's going to generate the

visualizations for them, and you, as a data engineer or as an analytics engineer, don't need to worry about it anymore.

And

from the description, it sounds amazing. It's like, great. K. Job done. I don't need to worry about that toilsome work. I do all of the interesting work of getting the data to where it needs to be, and then the AI does the rest. But

then you still have to deal with issues of making sure that you have the appropriate semantics maps so that the AI understands what the question actually means in the context of the data that you have, which that's the hardest problem in data anyway no matter what. So the AI doesn't actually solve anything for you. It just maybe exacerbates the problem because somebody asks the AI the question, the AI gives an answer,

but it's answering it based on a misunderstanding

of the data that you have. And

so you still have those issues of hallucination,

incorrect data, or variance in the way that the data is being interpreted. And I'm wondering what you have seen

as far as the

actual practical applications of the AI

being that simplifying interface

versus the amount of effort that's needed to be able to actually make that useful.

Yeah. I think this is text to SQL is the holy grail of

the data space. I would say for as long as I've worked in the space

for over a decade

that, you know, people really try to solve this problem multiple times. And,

obviously, now in hindsight, it's obvious that pre LLM,

all of those,

approaches using traditional NLP

were doomed.

And

now that we we have LLMs, it seems like, okay. Finally, we can actually solve this problem.

And I'm very optimistic that

it indeed will help make data way more accessible, and I think it eventually will have tremendous impact on how humans interact with data and how data is leveraged. But I think that the how

and how it happens and how it's applied is also very important

because

I don't think that the fundamental problem is that people cannot write SQL.

SQL is actually not that hard to to write and to master.

I I think the fundamental issue is that if we think about the life cycle of data in the organization,

it's very important to understand that the raw data that it gets collected from, you know, all the business systems and all the events and logs and everything we have in a data lake, it is pretty much unusable. And it's unusable both by machines and AI or and and people if we just try to, you know, throw a bunch of queries so they didn't ask, you know, try to answer really key business questions.

And in order for the data to become use usable,

we need what is currently is the job of a data engineer

of structuring,

filtering, merging, aggregating this data, curating it, and creating a really structured representation of what is our business

and what

are all the entities in the business that we care about, like customers, products, orders.

So that then this data can be fed into all the applications. Right? Business intelligence, machine learning, AI.

And I don't think that text to SQL replaces that because if we just do that on top of the raw data, we basically get garbage in, garbage out. I do think that in certain applications in certain

applications of that,

we can actually get very good results even today

if we put that level of a system on top of highly curated,

semantically

structured

datasets. Right? So if we have a number of tables that are well defined that describe how our business works,

having a text to SQL interface

could be actually extremely powerful because we know that

the questions that are asked and will be translated into code will be answered with the data which has been already prepared

and structured. And so it's actually quite easy for the system to be able to make sense about it.

But I don't think we are there where just, like, you don't need the data team. Let's just ask a question. Almost guaranteed that the answer will be

wrong. So data engineer in that regard, data engineering and data engineers,

are definitely not going to lose their jobs because now it's easy to generate SQL from text.

And in the context even of that text to SQL

use case, what I've been hearing a lot is that

it's not even very good at that. One, because LLMs are bad at math and SQL is just a manifestation of relational algebra, thereby math.

But that if you bring

a knowledge graph into the system where the AI is using the knowledge graph to understand what are the relations between all the different entities from which it then generates the queries, it actually does a much better job.

But, again, you'd have to build the knowledge graph first. And I think maybe that's one of the places where

bringing AI earlier in the cycle is actually potentially useful, where you can

use the AI to do some of that

root work of saying, here are all the different representations that I have of this entity or this concept across my different data sources.

Give me a first pass of what a unified model looks like to be able to represent that given all of the data that I have about it and all the ways that it's being represent

that given all of the data that I have about it and all the ways that it's being represented. And I'm wondering what you've seen in that context of

bringing the AI into that data modeling,

data curation workflow

of

it's not the

end user interacting with it. It's the data engineer

using the AI as their copilot, if you will, or as their assistant to be able to do some of that tedious work that would otherwise be okay. Well, I've got 15 different spreadsheets. I need to visually look across them and try and figure out the similarities and differences, etcetera.

Yeah. That's a good point, Tobias. I would say that there are I have two thoughts there.

On how

the EI plugs in to actually make text to SQL

work, yes, you absolutely need that kind of semantic graph of

what it what datasets you have, how are they related,

what are all the metrics, how those metrics are computed.

And

in that regard,

what's really interesting is that the metrics layer

that was, at some point, a really

hot idea in the modern data stack probably about for, you know, three to five years ago.

And then everyone was really disappointed with how little impact it actually made on on the data team's, productivity and just overall on the data stack.

It almost like now now it's the metric layer's

time. Because if you take the metrics layer and,

which gives you a really structured representation of the core entities and the metrics,

putting the text to SQL is almost, like, the most impactful thing that you can do because then you have a structured preservation of your

data model,

which allows AI to be very, very effective at being able to answer questions while being while while operating on a structured graph.

And so I think we'll see

really exciting

applications coming out of the hybrid of that kind of fundamental

metric layer semantic graph and text to SQL

in you know, we're already seeing that the early impacts of that. But I think over the next two years, it probably would become the a really popular way to

open up data for ultimate stakeholders

instead of

classical BI of, like, drag and drop,

interfaces and kind of passively consume dashboards.

But then the second point which he made is, basically, can AI actually help us get to that structured representation? And I think absolutely,

for the data engineer's workflow. So not for a, I would say, business stakeholder or someone who is data consumer, but for data producer,

I think that leveraging

LLMs to help you build

data models

and especially build them faster build them faster in the sense of understanding all the semantic relationships,

not just writing code, is a very promising

area. And that comes back to the my point about

how software tools are limited

in their help of you know, for data engineers. Right? I can write SQL, but if I if my tool does not understand

what are the relationships

between the datasets,

then it can't even help me write joins properly.

And one of the interesting things we've done at DataFold

was actually build a system that

essentially infers

a entity relationship diagram

from

the

raw data that you have combined with all the ad hoc SQL queries that have been written by people. So, previously, that would be a very hard problem to solve. But with the help of LLMs, we can actually have a really good shot

at understanding

what is the what are all the entities that your

business have in your data lake, how they're related. And that's almost like a probabilistic graph because people can be writing joins correctly or incorrectly, and you have noisy data. And sometimes

keys that you think are, like, primary keys or foreign keys are not perfect.

But if you have a large enough dataset of

queries that were ran against your warehouse, you can actually have a really good shot at understanding what's the semantic

graph looks like. And the context on which we actually did this was to help

data teams

build testing environments for their data. But the the implications

of having that knowledge is actually

very powerful. Right? So to your point, we can use that tools to help write SQL.

So I'm very bullish on the ability to help engine data engineers

build pipelines

by creating a semantic graph

without the need for curation. Because previously,

that problem was almost pushed to people

with all the kind of data governance tools. The idea was, let's have data stewards define all the canonical datasets and all the relationships. And, obviously, we just powered this completely non scalable.

So now we're finally at the point where we can automate that kind of semantic

data mining,

with LMS.

That brings us back around to another point that I wanted to dig into further

in the context of how to actually

integrate the LLMs into these different use cases and workflows.

You brought up the example of Cursor as an IDE that was built specifically with LLM use cases in mind,

juxtaposed with something like a Versus code or VIM or Emacs where the LLM is a bolt on. It's something that you're trying to retrofit into the experience.

And it can be useful, but it requires a lot more effort to be able to actually set it up, configure it, make it aware of the

code base that you're trying to operate on, etcetera,

versus the prepackaged

product.

And we're seeing that same type of thing in the context of data where you mentioned there are all these different vendors of, oh, hey. We're gonna make it super easy for you to make your data ready for AI or use this AI on your data. But most teams already have some sort of system in place, and they just wanna be able to retrofit the LLM into it to be able to start getting some of those gains, would the eventual goal of having the LLM maybe be a core portion of their data system, their data product? And I'm wondering, in that process of bringing an LLM, retrofitting it onto an existing system, whether that be your your code editor, your deployment environment,

your data warehouse, what have you.

What are some of those impedance mismatches or some of the issues in conceptual understanding about how to bring the appropriate, I'm gonna use the word knowledge,

even though it's a bit of a misnomer,

into the operating memory of the LLM so that it can actually do the thing that you're trying to tell it to do? Yeah. That's a great question, Tobias. I think that to answer this, we kinda need to go back to

what are the jobs to be done for data engineer,

and how does the data engineer workflow actually look like. And if we were to visualize it, it actually looks quite similar

to the software engineering workflow in just the types of tasks

that a data engineer does day to day to do their work. And by the way, we're saying data engineer as sort of like a blank label, but I don't necessarily mean just

people who have data engineering in the title because

all roles that are working with data, including data scientists, analysts,

analytics engineers, and VM in many cases, and software engineers,

a lot of them actually do data engineering in terms of building pipelines and developing pipelines as part of their job. It's just data engineers probably do this, you know, % of their data time. And if I'm a data analyst or data scientist, I would be doing this maybe 40%

of the time of my week. And so if we think about what do I need to do to, let's say, ship a new

data model like a table or extend

an existing data model, you know, refactor definitions or add new types of information into an existing model,

it starts with planning. Right? So I'm doing planning.

I'm trying to find the data that I

need for for my work. And

a lot of the times,

a lot of information can be

sourced from documentation,

from a data catalog. I think right now, the data catalog, giving you the sense of, like, what datasets I have and what's the profile of those datasets,

has been largely solved. There are great tools. You know, some are open source. Some are vendors. But overall,

understanding what datasets you have now is way easier than it was five years ago. You also probably are consulting

your tribal knowledge, and you go to Slack and you do, like, search for certain definitions. And that's also now is largely solved with a lot of the enterprise search tools. And then you go into writing code.

And writing code, I think this is also an important misconception. Like, if you are not really, you know, doing this for for a living, you think that people spend most of their time actually writing SQL

and in terms of, like, writing SQL to for production.

And in my experience,

actual

writing of the SQL

or other types of code is maybe,

like, 10 to 15%

of my time,

whereas all the operational tasks around

testing it,

talking to people to get context, doing code reviews,

shipping it to production,

monitoring it, remediating issues, talking to more people

is where the bulk of the work is happening.

And if that's true, then that means that probably as we talk about automation,

these operational workflows are where the bulk of the lift

coming from MLMs can actually happen. And so for actual writing code as a data engineer, I would still recommend probably using the best in class software tools these days, like Cursor. It will even though it's not aware of the data, it will probably still help you write a lot of boilerplate

and will speed up your workflow somewhat. And or you can use other IDs with Copeland or, like, Versus Code plus Copilot. I think those tools will just help you speed up the writing of the code itself.

But back to operational workflows that I think take the majority of the of the time within any kind of cycle of shipping some of shipping something. When it comes to

what happens after you wrote the code, right,

typically,

if

you have people who care about the quality of the data, it means that you have to do a fair amount

of testing of your work.

And testing is both helping

making sure that my code is correct. Right? Does it conform to the expectations?

Does it produce the data that I expect? But it's also about understanding potential breakages.

Data systems are historically fragile in the sense that you have layers and layers of dependencies

that are often opaque because,

I can be changing some definition of what an active user is somewhere in the pipeline. But then I can be completely oblivious of the fact that 10 jobs

down the road, someone builds a machine learning model that consumes that definition and

tries to automate certain decisions for, like, for example, spend and manipulating that metric. And so if I'm not aware of those downstream dependencies, I couldn't be actually causing a massive business disruption just by the sheer fact of changing it. And so

the testing that involves not just understanding how the data behaves, but also how the data is consumed and what are the larger business implications for making any kind of modification to the code is where a ton of time is spent in the data engineering. And so what's interesting is that is the use case where, historically, we at Data Vault spend a lot of time thinking about even pre AI. And before a lens were a thing, what we did there was came up with a concept of data diffing. And the idea is

everyone can see code diff. Right? My code looked like this before I made a change. Now

it's it's a different numb you know, it's a different set of characters

that, the code looks like. And, defining the code is something that is, like, embedded in GitHub. Right? You can see the diff. But the very hard question is understanding how does the data change based on the change in the code because

that is not obvious. That happens,

like, once you actually run the code against the database. And so Datadiff allow you to see the impact of a code change on the data. And that by itself was quite impactful, and we've seen a lot of teams

adopt that, you know, large

enterprise

teams, fast moving software you know, startup teams. But we were not fully satisfied with

the degree of automation

that feature alone produced because people are still required to, like, sift through all the data diffs and explore

them for multiple tables and

see how

the downstream impacts when they pass themselves through lineage.

And it felt like, okay. Now at least we can give people all the information, but they still have to sift through a lot of it, and some of the important details can be missed. And the big unlock that LLMs bring this particular workflow is once LLMs became pretty good in comprehending the code and actually, semantically understanding the code, which pretty much happened over

2024

with the latest generation of fundamental,

you know, large language models, we were able to

do two things. One,

take a lot of information and condense it into, like, three bullet points,

kind of like an executive summary. And those bullet points are

essentially helping the data engineer understand on the high level what are the most important impacts that I need to worry about for any given change and for a code reviewer to understand the same. And that just helps people to get on the same page very quickly and say they're running a lot of time that otherwise could spend be spent in meetings, running back and forth, you know, putting comments on a code change. And the second unlock that we've seen is the opportunity to to drill down

and explore

all the impacts and do the testing by, essentially,

chatting with your pull request, chatting with your code. And that it comes in the form of a chat interface where you're basically speaking to an agent

that has a full context of your code, full context of the data change, data diff, and also full context of your lineage

so that I can actually understand

how every line of code that was modified affecting the data and what does that mean for the business.

And you can ask questions, and it produces the

answers way faster than you would by essentially looking at all the different, you know, code changes and and data dips. And that ended up save saving a lot of times a lot of time for data teams. And

now that I'm describing this,

you kind of feel that I it sounds like almost having a buddy that just, like, helps you think for the code, almost like having a code reviewer, except for with AI. With LLM,

this is a body that's always available to you twenty four seven and probably makes your mistakes because it has all the context and can set through a lot of informations really quickly. So that's an example of how an LOAD could be applied into an operational use case that historically has been really time consuming and take a lot of manual work out of that context.

And I really wanna dig into that one word that you said probably at least a half dozen times, if not, maybe a couple of dozen was that context, where that, I think, is the key piece that is

so critical and also probably the most difficult portion of making AI useful is context. What context does it need? How do you get that context to it? How do you model that context? How do you keep it up to date? And so I think that really is where the difference comes in between the cursor example that we touched on earlier versus the

retrofitting onto

Emacs or whatever your tool or workflow of choices is how do you actually get the context to the place that it needs to be. And so you just discussed the use case that you have of being able to use the LLM in that use case of interpreting the various data diffs, understanding what is the actual ramifications

of this change. And I'm wondering if you can just talk through some of the lessons learned about how you actually

populate and maintain that context

and how you're able to

instruct the LLM how to take advantage of the context that you've given it? That's a great question, Tobias. And I think what's interesting

is that

at face value, it seems like you wanna throw

all the information you have at LLM. Right? Just like tell you everything and then let it figure out things.

And in fact, it is obviously not as easy as that. And in fact, it's actually counterproductive

to oversupply

the LM with context, in part because

the context window of Flash language models is limited.

And the trade off there is,

one, you just, like, can't physically fit everything. And, two, even if you were dealing with a model that actually is designed to have a very large convex window, if

you overuse it and supply too much information,

L and M just get gets lost. It's also

it starts

being far far less effective in understanding what's actually important versus not, and the overall effectiveness of your system goes down.

So back to your question of, like, what is the actual information that is important to provide as context into LLM? It really depends on what is the workflow

that we're talking about. In the context of a code review and testing,

where we are trying to fundamentally answer the question of, a, if we change the code,

was a change

correct relative to what we tried to do, what was what the task was,

or did we not conform to the business requirement?

The second question is,

did we follow the best processes

such as, you know, code guidelines and performance guidelines or not? And the third question is, okay. Let's say we conform to the business requirements.

We did the good job at following our coding best practices,

But we may still cause a business disruption

just by making a change that can

be a surprise either for a human consumer of data downstream or could throw off a machine learning model that was trained based on the different

distribution of data. Right? And so these are fundamental three questions that we try to answer. And by the way, even without AI, that's what a good code review would ultimately accomplish done by humans.

So what is the context that it's important for LM to have here? First, obviously, it is the code difference. Right? So we already know what the original code was, what's the new code is. And

feeding that into l m is really important so that I can understand, okay, what are the actual changes in the code itself, in the logic. And, I won't go into the details here because, obviously, the code base can be very large. Sometimes your PR can fetch a lot of code, so you have to be quite strategic in terms of how do you feed that on the technical side. But conceptually, that's what we have to provide as an input, number one. The second important input is the data diff. Right? It's understanding

if I have a kind of main branch

version of the code,

understanding

what data it produces

and what are the metrics showing. Right? And then if I have a new version of the code, let's call it a developer branch,

what data it produces and what is the difference in the output?

Let's say,

with my main branch code, I see that I have 37

orders on Monday. But with the new version of the code, I see that I have 39.

And so that already tells me that, okay. So this is the important impact on the output data and on the metrics. And that can and that's important both on the value levels, understanding how the individual cells, rows and columns, are changing, but it's also important to do roll ups and understand what is the impact on metrics.

And coupling that context with the code diff allows us to understand how changes in the code affect the actual data output. And the third really important aspect is the lineage. So lineage is fundamentally understanding

how the data flows

throughout your system,

how it's computed, how it's aggregated, and how it's consumed.

And

the lineage is a graph, and there are kind of two directions of exploration. One of them is upstream, which helps us understand

how how did the data get to the point where you're looking at it. Right? So, for example, if I'm looking at number of orders and I'm changing a formula,

where does the information about orders come from in the first place? And that is important because that can tell us a lot about how a given metric is is computed and what are the source of truth. Are we getting it from Salesforce? Are we getting it from our internal system? And then the downstream lineage is also important because it tells us how the data gets consumed, and that is absolutely essential information that can help us understand

what downstream systems and metrics will be affected. And lineage graph in itself can be very complex,

and building it actually is a top problem because you have to essentially scrape all of your

data platform information, all the queries, all the BI tools

to understand how data flows, how it's consumed and produced. But let's say you have this lineage graph. It's actually also a lot of information by itself. And so

to properly supply that lineage information into

an online context, you actually kinda need,

your system to be able to explore lineage graph on its own to see, like, okay. If I am if the developer make made a change here, what are the important

downstream implications of that? So now we're talking about kind of the system to be able to

kind of traverse that and do analysis on its own for for the context. I would say these are the three most important types of context. And then the fourth one is kind of optional. Yeah, if your team has any kind of best practices,

SQL linting rules,

documentation rules, you can also provide them as context, and then your kind of AI code reviewer assistance can

help you reason about, well, did you conform or not? And if not, making suggestions about what to correct. Eventually, probably going in and correcting your code itself. I think that's ultimately where this is going. But, again, it's pretty much would be operating on the same side of input context.

Another interesting element of bringing LLMs

into the context of the data engineering workflow and use case,

one is the privacy aspect, which is a whole other conversation. I don't wanna get too deep into that quagmire.

But, also,

when you're working as a data engineer,

one of the things you need to be thinking about is what is my data platform? What are the tools that I rely on? What are the ways that they link together? And if you're going

to rely on an LLM or generative AI as part of that tool chain, how does that fit into that platform?

What is some of the scaffolding? What are some of the workflows? What are some of the custom development that you need to do where a lot of

the first pass and naive use cases for generative AI and LLMs is, oh, well, just go and open up the chat GPT UI or just go run LM studio or use cloud or what have you. But if you want to get into anything sophisticated where you're actually relying on this as a component of your workflow, you want to make sure that it's customized, that you own it in some fashion.

And so that is likely going to require doing some custom development using something like a lane chain or a lane graph or,

crew AI or whatever where you're actually building additional scaffolding logic around just that kernel of the LLM.

And I'm curious how you're seeing some of the needs and use cases of

incorporating

the LLM more closely into that actual

core capabilities

of the data platform through that effort of customization

and, software engineering.

That's a great point, Tobias. I think that the models themselves

are

getting rapidly

commoditized

in the sense that their capabilities,

the fund you know, the foundational

large language models,

their

interfaces

are very similar.

Their capabilities

are similar. We're seeing a lot of race between the companies

training those models in terms of beating each other in benchmarks.

Looks like the whole industry is converging

on adding more reasoning, and then the ways that this is happening is also

converging on the same experience and the matter and the difference is, like, who is doing this better? Right? Who is bidding the metrics? Who provides the best,

the cheaper

inference, the faster inference,

more intelligence for for the same price? And to that and I don't think that

differentiation

or the effectiveness of whatever is the automation that you're trying to bring really depends on the choice of the model. Maybe for certain narrow applications,

actually, maybe choosing a more specialized model and or fine tuning model would be more applicable. But still, I don't think the model is where you really where the magic happens these days.

Model is important for magic, but it's not something that actually allows you to build a really effective application

by just, you know, choosing something better than what's available to everyone else. The actual

magic and the value add and the automation

happens

in how you leverage that model in your workflow. So all the orchestration

in terms of how do you prompt the model, what kind of context do you provide,

how do you tune the prompt, how do you tune the inputs,

how do you

evaluate the performance of the model

in production,

how do

you make various ALM based actors that may be playing different roles

interact with each other. That is where

the hard work

is happening, and that is where I think

the actual

value and impact is created. And that's where all the complexity

is. So I think you don't have to be, you know, a PhD

and really understand

how the models are trained. Although, I would say just like in computer science, it's obviously very helpful to understand how these models are trained and their architectures and their trade offs. But you don't have to be good at, you know, training those models in order to effectively leverage them. But to leverage them, you have to do a lot of work to effectively plug them in the workflows. And I think that the applications and companies and teams that are thinking about what is the workflow, what is the ideal user interface,

what is all the information that I we can gather to make l m do the better job, and then are able to rapidly iterate will ultimately create the most impact with OLMs.

And so on that note, in your experience of working with the LLMs, working with other data teams, and keeping apprised of the evolution of the space, what are some of the most interesting or innovative or unexpected ways that you've seen teams

bring LLMs into that inner loop of building and maintaining and evolving their data systems?

I think the most, in hindsight, it's obvious, but

not necessarily obvious when you're just starting realization

is that

no one really knows how to

ship LLM AI based applications.

There are obviously, you know, guides and tutorials

and

still, like, there there's a lot you can learn from looking at what people are doing,

but

the field is evolving so fast that

nothing replaces

fast experimentation

and just building things.

It's not that you can

just hire someone who worked on building an LLM based

application,

like, six months ago, a year ago, and all of a sudden, you, you know, gain a lot of advantage

as you would with many other technologies. Like, you know, if we were,

I guess, working in a space of video streaming, it will be very beneficial

to have

extensive experience with working with video streaming and codecs. And with

LLMs,

one,

no one really knows exactly how they work, even the company in terms of, like, how they behave. Right? In terms of even the companies that are shipping them are discovering

more and more novel ways of leveraging them more effectively

every week.

And from the from the teams that

are using leveraging the lens like like data folds,

the thing that we found

matter the most is the ability to,

a, just stay on top of the field and understanding what's the what's the,

like,

most exciting thing that people are are doing, how they relate to our field, how can we borrow some of those ideas.

But most importantly

is is rapid experimentation

with some sort of methodology that allows you to try new things, measure results

quickly,

and then being able to scrap your approach that you thought was great and just go with a different one. Because a lot of times when a new model is released,

you have to kind of adjust a lot of things. You have to adjust the prompts. You have to

even rearchitect some of the flows that you build.

And that is both

difficult but also incredibly exciting because the pace of innovation

and what is possible to solve

is evolving extremely fast. I would say the fastest of any previous technological

wave of disruption that we've seen.

In your experience and in your work of investing in this space, figuring out how best to apply LLMs to the problems facing data engineers and how to incorporate that into your products, what are some of the most interesting or unexpected or challenging lessons that you've learned personally?

Yeah. I I think that the

the interesting realization

was that specifically for data engineering domain again,

if you just take the problem at face value, you think, well, let's just build a Copilot or an agent that would kind of try to automate data engineer a way.

And I don't think we have the tech ready for an agent to just, like, really take a task and run with it yet. I don't think it's been solved in software space. I think it's, in some ways, even harder to solve in data space. We'll eventually get there. I don't think we are there yet.

I don't think that the biggest feedback you can make on the engineering workflow again is, like, having a

copilot because

that's not where

data engineers spend most of their time in terms of, like, writing production code. It's all operational tasks. And

there are certain kinds of problems

in the data engineering space where

it's not even a day to day, you know, you help, you save, like, an hour, two hours, three hours.

But there are certain types of workflows

where

to complete a task,

a team needs to spend, like, ten thousand hours.

And a good example of such a project would be a data platform migration where, for example, you have

millions of lines of code on legacy database.

You have to move them over

to a new modern data warehouse.

You have to refactor them, optimize them,

repackage them into a new kind of framework. Right? You may be moving from, like, stored procedures

on Oracle

to DBT plus

Databricks.

And

doing that requires

a certain number of hours for every object. And because you're dealing with a large database that at enterprise level sums up to enormous amount of work.

And, historically, these projects would last years and be done by, a lot of times, outsource talents from, you know, consultants or or a size.

And

for data engineer, that's, like, probably one of the most miserable projects to do. I've done I've led a a project at Lyft, and it's been an absolute grind where you you're not shipping new things. You're not shipping AI. You're not shipping even data pipelines. You're just, like, solving technical debt for years.

And what's interesting is that those types

of projects and workflows

are actually,

I would say, where

AI and OMS can make today

the most impact

because we can take a task.

We can reverse engineer it.

We know exactly what is the target of you know, you move the code, you do all of these things with the code, and, ultimately, the data has to be the same. Right? You're moving

you're going through multiple complex steps, but what's important for the business is once you move from, let's say, you know, Teradata to Snowflake,

your

output

is the same because, otherwise, business wouldn't accept it. And that allows us to, a, leverage LMS for a lot of the tasks that are historically manual,

but also have a really clear objective function for OMS,

like, dipping the output on a legacy system to a modern system and using it as a constraint.

And if you put those two things together, you have a very powerful system that is, a,

extremely flexible and scalable thanks to all ends,

but also

can be

constrained to a very objective definition of what's good.

You know, unlike a lot of this text to SQL generation that cannot be constrained to the definition of what's good. Because,

like, how do you know? By by the end of migration, you do know.

And

that allows

AI

to make tremendous impact on the productivity of a data team by essentially taking a project over the last four years,

cost millions of dollars, and go our budget

and constrain that into

weeks

and, you know, just a fraction of the price. I think that is where

we can

see real impact of AI that's, like, useful. It's working.

And we also see the parallels in software space as well. There are also a lot of the up like, really thoughtful enterprise applications of AI is actually taking these legacy code bases and, you know, helping teams maintain them and

or migrate them.

And I think that

there are more opportunities like that in a data engineering space where, we'll see AI make tremendous impacts.

And as you continue

to keep in touch with the evolution in the space, work with data teams,

evaluate

what are the cases where LLMs are beneficial versus you're better off going with good old human

ingenuity.

What are some of the things you're keeping a particularly close eye on or any projects or context you're excited to explore?

In terms of where you where where I think that LMS would really make a huge impact on the workflow?

Just LLMs in general, how to apply them to data engineering problems, how to incorporate them more closely and with less legwork into the actual problem solving apparatus of an organization.

Yeah.

So I think that

on multiple levels, there's a lot of exciting things. Like, for example,

being able to prompt an OLM

from SQL as a function call

that's available these days in modern data platforms

is incredibly

impactful. Right? Because instead of trying to in many instances, we're dealing with extremely massive data.

And instead of having to write, like, complex case one statements

and regexes and, like, UDFs

to be able

to clean the data, to classify things, and to just tangle the mess,

we can now apply LLMs from within SQL, from within the query to solve that problem.

And that is

incredibly impactful

for a whole variety of different applications. So I'm very excited about all these capabilities that are now, you know, brought by the major data platforms like, you know, Snowflake, Databricks,

BigQuery.

I think that we if we go into the workflow itself, like,

what does data engineer do and how to make that work better? I think there's a ton of opportunity to

further automate a lot of tasks. I think a big big one is data observability and monitoring.

I honestly think that data observability in its current state is a dead end in terms of, like, let's cover all data with alerts and monitors and then

be the first to know about any anomalies.

It's useful, but then it quickly leads to a lot of noise,

alert, fatigue, and ultimately

kind of could be even net negative on the workflow

of a data engineer.

I think that this is a type of workflow where

putting an AI to

investigate those alerts,

do the root cause analysis, and potentially

remediation

is where I see a lot of

opportunity for

saving a ton of time for data team while also improving the SLAs

and the overall

quality of the output of a data engineering team. And that's something that we are really excited about.

Something we're working on Dataflow, and we are excited about coming later this year.

Are there any other aspects of this overall space of using LLMs

to improve the lives of data engineers

and the work that data engineers can do to the effectiveness of those LLMs that we didn't discuss yet that you'd like to cover before we close out the show?

I think that, you know, we talked a lot about kind of

the the workflow improvements.

I think that, overall,

my recommendation to data engineers

today

would be to

learn how to ship Elven applications. It's not that hard.

Frameworks like LandChain

make it very easy to compose multiple blocks together and ship something that works whether or not you end up losing using

blockchain or other framework in production

and whether your, you know, team allows that. Doesn't really matter, but it's really, really, really useful to

try and build and learn all the components.

And

by it's just like software engineering. You know?

Learning how to code opens up so many opportunities for you to solve problems. Right? You see a problem and you're like, I can write a Python script for that. And I think that with LLMs,

it's almost like a new skill that both software engineers and data engineers need to learn where

you see a problem and you think that, okay. I actually think I can

scale the problem

into three tasks that I can give to an LLM. Like, one would be extraction web. It could be, like, reasoning and classification.

And now

it just solves the problem.

And so but but really learning how to build and trying helps you build that intuition. And so my recommendation will be for all data engineers while listening to this is

try to build your own application that solves either a business problem or helps you in your own workflow

because knowing how to build with OMS just gives you tremendous superpowers and will definitely be helpful in your career

in the coming years.

I definitely would like to reinforce that statement because

despite

the AI maximalist, the AI skeptics,

no matter what you think about it, LLMs aren't going anywhere. They're going to continue

to grow in their usage and their capabilities, so it's worth

understanding how to use them and investing in that skill because it is going to be one of those

core tools in your toolbox for many years to come. And so for anybody who wants

to get in touch with you and follow along with the work that you are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get what your current perspective is on the biggest gap in the tooling or technology for data management today.

I think that there's a lot of kind of skepticism

and some bitterness around

kind of modern data stack failed us in a sense that we were so excited that more data stack will make things so great five years ago,

and we're kind of disappointed.

And

I think that I'm an optimist here. I think that modern data stack in the sense of infrastructure

and getting a lot of the

fundamental challenges out of the way, like running

queries

and getting data in and out of different databases and visualizing the query outputs and having amazing

null books.

All of that

that we now take for granted is actually

so great relative to where we were, you know, five, seven, eight, ten years ago.

I don't think it's enough. So I think that,

I am with the data practitioners for, like, well,

it's 01/25.

We have all these amazing models.

Why is it still so hard to ship data?

Absolutely with you. And I think what I'm excited about is now that we have this really great foundation with

modern data stack in the sense of infrastructure,

I'm excited

about, one,

getting everyone on modern data stack to the point of migrations. Right? Let's get everyone on modern infrastructure so that they can ship faster.

Obviously, a problem that I'm really passionate about in solving and working.

Second, once you are on the modern data infrastructure,

how to keep modernizing

your team's workflows so that

data engineers are spending more and more time on solving hard problems and thinking and planning

on the valued activities that are really worth their time and less and less on operational toil that just

is burnout inducing and keeps everyone back. So

I'm excited about the modern data stack renaissance, thanks to the fundamental capabilities of large language models.

Absolutely. Well, thank you very much for taking the time today to join me and sharing your thoughts and experiences around building with LLMs to improve the capabilities of data engineers. It's definitely an area that we all need to be keeping track of and investing some time into. So I appreciate the insights that you've been able to share, and I hope you enjoy the rest of your day.

Thank you so much, Tobias.

Thank you for listening, and don't forget to check out our other shows. Podcast.net

covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems.

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com

with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and

coworkers.