Evolving Responsibilities in AI Data Management

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale.

DataFold's AI powered migration agent changes all that.

Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches.

And they're so confident in their solution, they'll actually guarantee your timeline in writing.

Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds

today for the details.

Your host is Tobias Macy, and today I'm interviewing Bartov Mikulski about how to prepare data for use in AI applications. So Bartov, can you start by introducing yourself?

So I'm Bartov. I'm an MLOps engineer. I've been working as a data engineer for some time, then I switched to MLOps.

And,

along the way, I realized that

that the real problem is not really the software that if you do, maybe the data is more important,

like, not

even the data that you process, but the way how to test it.

And this kind of applies to AI also.

So this was the the way how how it went from data engineering to AI.

And do you remember how you first got started working in the space of data and AI and ML?

Kind of by accident. I mean, we had some

data engineering work,

to to do in the back end project, and I liked it. So then I stayed in data engineering.

And,

then along the way,

I got interested in, mapping,

but I was never very good at training those models.

But somehow, I was better to deploy them and keep it running so I got involved in MLOps,

and that I just stayed in this idea.

And so now that

a lot of the industry has started moving into the space of generative AI and building AI applications,

obviously, there are a lot of data requirements that go with that. But I'm wondering if you could just start by outlining some of the main categories of the types of data assets that are needed specifically for AI applications

and some of the ways that that maybe differs from the, I guess, traditional, I'll I'll say, data assets that engineering teams are used to working with.

Okay. So

first of all, this is machine learning, so it doesn't really differ that much.

You need some test dataset.

In the case of generative AI, we call it evaluation dataset because, you know, we need fancy naming.

But this is your test, dataset,

and you will use it to

verify if this thing whatever you are building works works correctly.

You don't need the training data set that much though unless you are going to fine tune something.

So

it should be easier,

to get the required data.

Although,

then quickly you realize that if you are building something that involves multiple steps,

like, you do one

call and you retrieve some data, you do another call to do your AI model,

you need the tray and the testing datasets

for all of the steps separately because you will be

testing them separately

and to figure out what doesn't work because

some

somehow, always something doesn't work well,

and you have to figure out what it is. Yeah.

So

you will have a lot of test data, and,

this is the the data asset that

you would need to gather somehow. You can

well, you you can generate it to some extent,

but at some point,

you will have to start getting the real data.

And in this space of generative AI applications, there are a few different

styles that have been emerging.

RAG was kind of the first one once we moved past the initial phase of just prompt engineering.

And now we've moved into these agentic architectures,

and there are a few different, styles of AI apps that have been coming up. Another one being graph rag where you incorporate knowledge graphs.

And I'm wondering

how the particular type of application

changes the requirements around the types of data assets that are available to those AI applications for feeding into the models or or for storing some of the outputs or

metrics around the model generation itself.

Okay. So besides the data in the database, obviously,

what what you need is,

as we discussed, the evaluation dataset. So in all of those cases, it looks a little bit different. It's It's because for a rack, you have the user question,

and then the AI is calling some

service, let's say, database

with some query. So it needs a test set that

consists of at least two things, the input from the user and the query you want to send or multiple queries.

And then you

check if this really happened or if the query that was sent is similar enough to what you expect.

Then you get the response,

and you have to generate the answers. And this needs it's separate,

dataset

for testing

because this is another step, and it it can fail too.

And then, of course, you will test it as a entire workflow.

So you have to

be prepared for this also.

And this is was one interaction, really. Yeah. You receive something and you generate a dancer. And if you have multiple steps, you will have to multiply your datasets.

And for the agents, it gets even funnier because, you have no control of of the of the process.

Well, I have some control over the process, but the agent can

choose a tool and choose the parameters for the tool.

And then your dataset has to contain

the queries and the tools that you want to use and the parameters you would expect to see when this query is sent.

So, basically,

keep multiplying the test datasets.

In terms of the areas of responsibility

for what the

role looks like and who is responsible for what pieces of the life cycle of the application and the different

data that gets fed into or retrieved from those different stages, I'm wondering how you're seeing that

breakdown in terms of different organizations

and how that maybe is influenced

by the either size and scale of the organization or the type of application or use case that they're powering.

K. So as a freelance AI engineer, I can say that everything that is even remotely

related to AI is

responsibility of the AI engineer,

but it doesn't have to be this way. So

we already had some setup, yeah, because you probably had some ML models.

So it can stay this way. You have the data and training into this gathering the data

and maybe cleaning it. You have the data scientist.

In this case, we'll write the prompts and do the experimentation on the prompts.

You might have the MLOps team deploying it. In this case, deployment is really just changing the prompt unless you

use the open source modem, then you have to redeploy something.

But on the other hand,

the step when you are getting the

production data is more, work intensive because you have those

intermediate

calls to the to the model.

So in this case, the envelope steam is still required.

So I think it doesn't today, it doesn't change that much.

It does

requires maybe

something that we are not used to. We're just working with text

on both ends of the

the model. So just not feeding it text, but also getting the text from it.

And for data engineering teams in particular who are used to working with more structured datasets, doing something along the lines of data warehousing, business intelligence, or maybe even

feeding some of those curated datasets back into application contexts.

What are some of the types of skills that transfer well to this world of unstructured data and preparing it for

AI applications, particularly working with things like vector databases?

And what are some of the skills that need to be acquired for people in that situation so that they can more effectively

work with and support the MLOps and AI engineer teams?

Okay. So if you

outline the process

in detail, you will always find something that we are right now. So if you do calls to the database, you probably know the query language for the database, whatever it is. If it's SQL or

any other thing,

you you might know it already.

So this is a skill that

you can just use.

In the other areas okay. Maybe vector databases might be kind of new for you.

So this,

the data entry team might need to learn

because

it is like a normal database when you are inserting data to into it. So it's maybe not that relevant, but when you are receiving,

it's

a little bit surprising at first.

So

the matching of of documents.

The other things that you can transfer,

I think, the entire machine learning process,

like the deployment,

follow deployments, AB testing, testing in general, experimentation,

this doesn't change. Change the tool, but the process stays the same.

So

people already know a lot that they need to know when they use generative AI. Maybe they just don't

realize it yet.

On that

vector database side,

they take a number of different forms where you have document oriented vector databases

in the shape of things like Qdren.

You have pure vector databases,

sort of like Pinecone,

and then you have vector

add ons for relational databases like PG vector and postgres,

as well as a whole slew of other

formulations of vector storage in various contexts.

And I'm wondering how

the inclusion of vectors

as a data type and as a core asset that is consumed

and produced by these AI applications

changes some of the ways that

teams need to think about data modeling,

in particular around things like trunking strategies,

metadata management.

What are the pieces of information that you want to strip out before you run it through an embedding model? What are some of the pieces of information that are actually useful for putting into the embedding model? I know that, for instance, HTML, there have been conversations about whether to keep or strip out the tags, whether they're helpful or harmful, and just some of those types of,

you know, tactical elements of building these data assets that teams need to be thinking about and trained up on.

Okay. So trunking is definitely something new for data engineers,

and you just have to get used to this.

So from the strategy start that you have basically, you have to remember that

you will probably have to chunk the documents

because they will not fit in the context window of the model. Even if they do, that might be too expensive to use it this way.

So even though the best

way

might be

possibly,

we will have to test this always to always send the entire document to the model.

But in reality, you will chunk it. So you have several ways to do it. You can just decide that there's a fixed

size of the trunk.

So let's say, for the sake of the example,

500 characters,

and then you just cut the text every 500 characters, maybe too small a number I think this too small, but just for example.

And then you start to build on top of this idea because you probably don't want to,

cut it in the middle of the word.

So you might have some tracking started with that. Okay.

It's 500.

But if there's in the middle of the word, we do it a little bit earlier. And then you realize, okay. That's not amid the middle of the word, but still in the middle of the sentence.

And then you go back. Yeah. It's not in the middle of the sentence, but maybe this inside of the paragraph. And so, basically,

just

invented the recursive trunking strategy. You are cutting the text

when it makes sense

to preserve as bigger chunk of as you can. But if you can't, then you just resolve

just just

stay with

the trunking of,

in the middle of the word.

But,

still, it might not be enough because,

possibly, you can just be unlucky.

Yeah. So,

the sentence that you need might end up in the other trunk.

So then we added overlapping,

trunking strategies to take some

part of another trunk. You not even considered it,

like, to be the part of the trunk that you want, but you just overlap with another thing. And you have duplicates, but it's it's supposed to help you find the the relevant information.

But still,

it's,

may not be enough because,

sometimes when you write the document, you have first

description of the problem, like a few paragraphs, and then you start writing the description of the solution. And your tracking extension might perfectly,

allow you to find the description of the problem, but you are not interested in the problem. Or even know what is the problem. You have it. You want the solution.

And it's somehow another trunk that was not matched. So then you can use some we got parent document ready, but when you match by chunks, but you get the entire document.

And

you can still build on top of those those ideas because sometimes out of strength

the topic in the middle of the text. And you can use something called semantic trunking. So use use the generative AI model to tell you where this trunk ends. So from the basic idea of cutting the text,

at some point, you can build a lot of, more advanced, techniques.

And then you realize that if you have document,

the the trunk

that you want to match and you match it by the query from the user, you are not really

matching the same things. You have the answer and the question, and that's supposed to be similar.

But maybe you would be looking for an answer that is similar to some other answer. So you just invented the hypothetical document embeddings, but you are generating the,

fake, it's not fake, answer to the user question. But you hope that the language vocabulary in this fake answer is similar to the actual answer. So

you can keep adding new things, and then you have metadata that

you can use

to narrow the,

space of the vectors that you have to, set. But this,

this is not something you can retrofit into the pipeline

because you have to start those metadata fields.

So if you start thinking of metadata, you have to go back to data engineering

and just add them.

And this might require a lot of work done again

that you have done already. But

this is what it is. Yeah.

Another

divergence from what data engineers are typically used to in the context of these embeddings and vector databases

is that

there's not a lot of opportunity for

being able to do sort of a backfill or an incremental reload,

at least in the case where you need to change your chunking strategy or change your embedding model. You need to effectively

rerun all of the data every time whenever you make a change of that nature

versus just I need to add a new document to the database using all of the same parameters.

Whereas in

more structured data

contexts, you can

either

mutate the data in place or,

you know, append to it without necessarily having to do as drastic of a rebuild.

And

given the fact that you might be dealing with large volumes of data,

it likely brings in requirements of more

complex or more sophisticated

parallel processing. And I'm wondering how you're seeing some of those requirements

change the

tool sets or

platform capabilities that

engineering teams need to incorporate and invest in to be able to support these

embedding

experimentation

and being able to evolve embeddings over time as new embedding models come up or as they need to change the trunking strategies or etcetera?

K. I'm

I think

this is not solved yet. At least, I'm not not I'm not aware of any solution to this

as of now.

So

for now, what what what I have been doing is just

creating new collections of, data,

with the different trunk strategies,

trunking strategies,

and using those. And, of course, you have to ingest them again, and it takes time, and you have to process them.

And if you use some

SaaS embedding model, you pay for the embeddings all the time when you do it.

So this is the problematic

part, and I'm not aware of any solutions. Maybe

definitely someone's working on them, and I would love to hear about it. But, I don't know it.

Yeah. In particular, I imagine that teams who are

doing sort of the traditional

extract transform and load or extract load and transform workflows for filling their data warehouse,

whatever

batch or even streaming tools they're using to do that likely aren't going to be able to,

provide the

timeliness or scalability that they would need for doing massive reprocessing

of all of the documents for regenerating embeddings, which likely pushes them into adopting something like a spark or array where maybe they didn't already have that as part of their infrastructure.

Yep.

And, then you have to,

explain the engineers that they have done something, and it was perfect, but we need something else,

which

is probably not

not not the thing to want to say to people rather often.

Yeah. But this is what it is.

It will be great to have a solution, but I think we don't have it yet.

And beyond

the embeddings,

as you move into some of the

more sophisticated

AI applications

where maybe you need to incorporate something like long and short term memory for

a chatbot or an agent style application.

You also have the management of conversational history and responses

and maybe also,

additional data collection to support fine tuning of some of those models.

How does that introduce new

requirements and new workload capabilities to data engineering and ML ops teams to be able to support those types of applications.

Okay. So I'm not that familiar with agent memory, so let's maybe focus on fine tuning.

So

first of all,

like in the machine learning classical machine learning, you need the training data set, and it consists of the input and out. This is pretty obvious.

But,

gathering the quality output gets sticky because,

you can

get the data from the chat. For example, if you're building a chatbot,

from from the chat,

and assume that if nobody is complaining about it, then it's probably correct.

But, this is not the case because people might just stop using this tool if they are not satisfied.

It doesn't mean that, like, someone doesn't didn't bother to click the button that they don't like it,

then which means that they have liked

the the output and you can use it.

What's even worse, how you are going to get the correction?

You

got something wrong.

The person who is using your app is not satisfied. They click the feedback button, but they don't like it. And now to show them the

message that okay. So write down what you want to get instead. So if you wanted to have a helpful tool and it already disappointed you, and now you also got a homework.

So this is this is not going to work,

this way.

So,

sadly, what you are going to need,

in all of the cases, you

might

get away with getting the data from the user, but you will need some data levels. So someone who can just write the

okay. Inputs you can get from the actual user, but the output that you expect, you need someone who know what is the output

and who can write them down

and what is the sad part.

In most cases, that person might be you. So

I will be writing this

because there's

no no one else was going to to do it. Yeah. But you need the data, and it's not not going to appear magically from nowhere.

Because of the fact that the overall space of generative AI applications and the different ways that these large language models are being incorporated

into

different application architectures,

thinking in terms again of things like agents

versus straight workflows versus just a back and forth chatbot and even just going from single turn to multi turn.

How has that

evolution of capabilities and use cases

changed

the types of work and responsibilities

for data engineers and ML ops engineers over that period? And maybe what are some of the ways that you are forecasting those changes to continue as we go forward?

K.

The generative AI by by itself was a big evolution in the AI space,

but,

I think it wasn't the biggest,

because

for me, the biggest trend was the

coding tools. Like, Cursor,

before it was, of course, a GitHub Copilot.

But it

it could finish the current line, but it wasn't reduced. Okay. It was useful. But, in comparison to Cursor, it's

almost nothing.

And, I think this is the the biggest, trend in responsibilities

because

now,

I, as a data engineer, I I can do front end now. This may not be the prettiest front end, but but I can do it. Yeah. I can I can I can make it work? So,

with this tool, you can have teams

who are really almost like full stacks in everything.

You might specialize in data engineering still, but but you can do other work.

And,

this really

makes a lot of things possible, and you don't need

to involve someone from other team when you are just building something,

maybe not even internally, like,

for the real long team, but

for just building something, might be a prototype that is

good enough to fall to other other people, and you don't need to involve,

the person from from another team, like a front end engineer who probably is not on your data engineering team because you don't need this kind of skills.

And and you can still still do it.

So for for me, this this is the biggest trend. Yeah? Like, the tools you can use to generate code. And, of course, it's not a pity code, might have some bugs, might be inefficient, but

doesn't really matter if it's can it allows you to do something that was not possible to do for you.

So

this was the biggest change

and, the biggest shift in the responsibilities

because now you have more responsibilities

in a sense,

because you can do everything.

Okay. Almost.

But

but you can still do it. Yeah. It's not that you got a responsibility

and you have not you're capable of doing it. Can you get it work?

As far as the skills and capabilities

for these engineers who are tasked with supporting

AI applications,

working with some of these vector databases, document embeddings,

getting involved in data collection for fine tuning datasets,

all of the various pieces that come into

supporting these applications.

What are some of the common skills gaps that you see or that teams should be aware of and watching out for and identifying opportunities for training on? So there is a huge,

huge gap between copy pasting some code to get it

work from some tutorial. They will have your first version of a chatbot

and making it work in production and not be ashamed of the result.

And it's not even

that much of the engineering work

as,

realizing that

you're

mostly like every other software. Your

software is going to only to to be as good as your test. And so if

you cannot test it and you cannot prove that it works, then it probably doesn't. And in case of the native AI, this testing might be very extensive because you have the entire workflow. We have the steps.

You have

the examples that you don't really want to see in production when someone is trying

to abuse your tool, but and they will try to abuse your tool.

So,

you have to handle this tool.

So this this is the the skill gap. You know? You can

get it work pretty easy using some old online tutorials

and then

spend months to get it to the

quality level that is required for production.

In your experience of working in this space and working with teams who are building these different AI applications,

building the data feeds that support them, what are some of the most interesting or innovative or unexpected ways that you've seen those teams address these evolving needs of AI systems and be able to support them as they evolve and scale?

K. Maybe not even

the needs of the system, but the way how you build it.

What was

most unexpected for me as a data entry is that

you can make the biggest difference with the user experience, with the UI.

I mean, because people expect that to

see a chat, maybe the summarize with AI button,

and you don't need this. You can just hide it in

the back and show them the the final result. Yeah? One of the projects we have built a reporting to top us just just a page. Yeah. And, you could get data extracted from some online reviews

on this page, and

it didn't scream this is AI based.

It's vast AI based, but you don't have to tell it everyone. Yeah.

Because right now, it

seems to be

the way to market the story. And now now this is the thing with AI. But people don't care, and

a lot of people actually don't like it. So maybe you don't need to show that this is AI based. You just use it and you show them the results, and they don't have to know it.

I I think another interesting

evolution,

particularly for data engineers in terms of the scope of their work, is going back to that

chunking and embedding generation.

The inclusion of ML and AI in the data pipeline itself, I think, think, is another notable shift from

maybe five or ten years ago where it was largely just deterministic processing and transformation

using

coded logic,

where now you're relying on these different AI models

being able

to generate those embeddings, process that content on your behalf,

particularly if you're doing something like generating,

semantic chunks where you actually feed the text through an LLM to summarize before it gets embedded. So I'm wondering how you're seeing some of that

shift in terms of toolset impact data teams who maybe don't have that history already.

Well,

not sure if I have seen a team like this because I've worked with, teams who worked with, natural language processing before.

So they're very used to,

to use something,

for embeddings

and text processing on those embeddings.

So this was not something fucking for those teams.

But, yeah, I I can imagine,

that it might

be might be something new because, for example, I have seen people

replacing OCR software with

models that can recognize data from images. And we have multimodal model

and,

just replace something that

used to be hard,

like, character recognition from menus

with

with a model, and it turns out to be cheaper.

So

there there is something

that might be for some teams, but

I don't think that this is,

as much as it was fucked like it used to be, like, two years ago when people

suddenly realized that this thing exists. And it it was there for some time already,

but

a a lot of people discovered it suddenly. Yeah.

And in your experience

of working in this space,

learning about some of these newer and evolving techniques for building and supporting these AI applications,

what are some of the most interesting or unexpected or challenging lessons that you've learned personally?

Okay. People will get very, very creative at trying to break the filter you build.

As soon after they realize this is this is AI, they will have they they will just have to break it. Someone will come and try to,

try to use it in some way that, it's not supposed to be used. Right?

But in the best case, not really. It's

very bad best case.

They will use it as a free pro,

to track the PTE. And just because you can process the the request,

and you have to be prepared for this.

And, really, if you have a chatbot, then maybe

switching it off when you detect that it's getting abused is not a bad idea.

You might laugh at at this,

this approach, but

it it this is something you might consider because, otherwise,

at the

k. Maybe it's not the best case, but not such a bad case when you

become

a topic of, memes on the Internet. I have a screen for from your chatbot, and people are laughing at it. And it's bad, but it could be worse.

And people will try to break it. That's just

not not something that you have seen before with any other app.

There are people who break apps for fun, but

the way way more of when you start to use it.

And as you continue to

invest in your own knowledge and work with the teams that you're involved with and just try to stay abreast of what's happening in the industry.

What are some of the emerging trends that you are paying particular attention to and investing your own learning efforts into?

I have just discovered

prompt comparison. Apparently, it is possible to use

a a model,

generative AI model,

to transform the prompt that you that's for another model, make it shorter,

with useless, fewer tokens,

but still get similar or the same performance. And, I really got interested in that. I cannot

say much about it yet because I have not learned enough,

but

I didn't know if it's possible, and I discovered, like, a week or two ago.

Yeah. That's,

really something I

want to

spend some time working on because

looks cool. Yeah. It makes

the makes the call strippers, first of all.

Then

maybe you can feed more data in your prompt

so you have bigger context.

And that just

sounds cool. Yeah. Just converted your prompt, compressed it, and it works the same.

So just just for this those three reasons, the this is the thing that I I want to

take a look. And,

when I learn

enough about it, I will probably write a blog post of Asifuall.

So may maybe

I can find it later.

And so far, what I have found is this, LLM lingua library from Microsoft.

I think I got to the name right.

So yeah.

And this is the

maybe not a trend, but

some area

of interest.

And are there any other aspects of

data engineering requirements around AI applications

and just supporting these applications and the data that they consume and produce that we didn't discuss yet that you'd like to cover before we close out the show?

Maybe one thing, you don't have to support every,

input.

You can just choose what what the tool is supposed to do and maybe,

cluster the data, do some topic modeling, decide if people are already asking the questions you expect,

to see.

And if they don't,

then maybe just filter it out. I mean, it doesn't have to do everything. It's not general. If it's not general purpose app, then

just decide if this is the thing you want to support.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling,

or technology that's available for data management today, maybe in particular, and how it relates to supporting AI apps.

Okay. We have a lot of tools for,

evaluation, like monitoring

or,

just doing evaluation testing.

And,

it's not really a gap in the tooling because I think we have already too many.

But they're kind of,

trying to do everything,

and,

I

think

we need some consolidation.

I would like to have one tool for this.

Like, it will do everything, but at least do it in

some way the

creators of the tool choose to to do it,

because,

right now, you have to start try to do everything, but they really don't, and we need several of them.

The documentation is usually,

let's say, politely lagging.

Most likely not even existing.

So,

yeah,

I I would love to see a tool that

just gets the job done.

May it may have some

opinions

about how to do it. I might need to adjust my code to do it. It's fine.

Just

I I don't want I don't need three tools for for everything.

So this is the gap that I see right now.

Alright. Well, thank you very much for taking the time today to join me and share your thoughts and experience

of the

data requirements around these AI applications and some of the ways that it's shifting the responsibilities

and the tooling and the work required for data engineers and MLOps engineers. So appreciate the time and energy you're putting into that, and I hope you enjoy the rest of your day. Thank you. Bye bye.

Thank you for listening, and don't forget to check out our other shows. Podcast.net

covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems.

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@dataengineeringpodcast.com

with your story. Just to help other people find the show, please leave a review on Apple Podcasts and and

tell

your

friends

and

coworkers.