Summary
In this episode of the Data Engineering Podcast Bartosz Mikulski talks about preparing data for AI applications. Bartosz shares his journey from data engineering to MLOps and emphasizes the importance of data testing over software development in AI contexts. He discusses the types of data assets required for AI applications, including extensive test datasets, especially in generative AI, and explains the differences in data requirements for various AI application styles. The conversation also explores the skills data engineers need to transition into AI, such as familiarity with vector databases and new data modeling strategies, and highlights the challenges of evolving AI applications, including frequent reprocessing of data when changing chunking strategies or embedding models.
Announcements
Parting Question
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
In this episode of the Data Engineering Podcast Bartosz Mikulski talks about preparing data for AI applications. Bartosz shares his journey from data engineering to MLOps and emphasizes the importance of data testing over software development in AI contexts. He discusses the types of data assets required for AI applications, including extensive test datasets, especially in generative AI, and explains the differences in data requirements for various AI application styles. The conversation also explores the skills data engineers need to transition into AI, such as familiarity with vector databases and new data modeling strategies, and highlights the challenges of evolving AI applications, including frequent reprocessing of data when changing chunking strategies or embedding models.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
- Your host is Tobias Macey and today I'm interviewing Bartosz Mikulski about how to prepare data for use in AI applications
- Introduction
- How did you get involved in the area of data management?
- Can you start by outlining some of the main categories of data assets that are needed for AI applications?
- How does the nature of the application change those requirements? (e.g. RAG app vs. agent, etc.)
- How do the different assets map to the stages of the application lifecycle?
- What are some of the common roles and divisions of responsibility that you see in the construction and operation of a "typical" AI application?
- For data engineers who are used to data warehousing/BI, what are the skills that map to AI apps?
- What are some of the data modeling patterns that are needed to support AI apps?
- chunking strategies
- metadata management
- What are the new categories of data that data engineers need to manage in the context of AI applications?
- agent memory generation/evolution
- conversation history management
- data collection for fine tuning
- What are some of the notable evolutions in the space of AI applications and their patterns that have happened in the past ~1-2 years that relate to the responsibilities of data engineers?
- What are some of the skills gaps that teams should be aware of and identify training opportunities for?
- What are the most interesting, innovative, or unexpected ways that you have seen data teams address the needs of AI applications?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on AI applications and their reliance on data?
- What are some of the emerging trends that you are paying particular attention to?
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
[00:00:11]
Tobias Macey:
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds today for the details.
[00:00:48] Tobias Macey:
Your host is Tobias Macy, and today I'm interviewing Bartov Mikulski about how to prepare data for use in AI applications. So Bartov, can you start by introducing yourself?
[00:00:57] Bartosz Mikulski:
So I'm Bartov. I'm an MLOps engineer. I've been working as a data engineer for some time, then I switched to MLOps. And, along the way, I realized that that the real problem is not really the software that if you do, maybe the data is more important, like, not even the data that you process, but the way how to test it. And this kind of applies to AI also. So this was the the way how how it went from data engineering to AI.
[00:01:31] Tobias Macey:
And do you remember how you first got started working in the space of data and AI and ML?
[00:01:36] Bartosz Mikulski:
Kind of by accident. I mean, we had some data engineering work, to to do in the back end project, and I liked it. So then I stayed in data engineering. And, then along the way, I got interested in, mapping, but I was never very good at training those models. But somehow, I was better to deploy them and keep it running so I got involved in MLOps, and that I just stayed in this idea.
[00:02:09] Tobias Macey:
And so now that a lot of the industry has started moving into the space of generative AI and building AI applications, obviously, there are a lot of data requirements that go with that. But I'm wondering if you could just start by outlining some of the main categories of the types of data assets that are needed specifically for AI applications and some of the ways that that maybe differs from the, I guess, traditional, I'll I'll say, data assets that engineering teams are used to working with.
[00:02:40] Bartosz Mikulski:
Okay. So first of all, this is machine learning, so it doesn't really differ that much. You need some test dataset. In the case of generative AI, we call it evaluation dataset because, you know, we need fancy naming. But this is your test, dataset, and you will use it to verify if this thing whatever you are building works works correctly. You don't need the training data set that much though unless you are going to fine tune something. So it should be easier, to get the required data. Although, then quickly you realize that if you are building something that involves multiple steps, like, you do one call and you retrieve some data, you do another call to do your AI model, you need the tray and the testing datasets for all of the steps separately because you will be testing them separately and to figure out what doesn't work because some somehow, always something doesn't work well, and you have to figure out what it is. Yeah.
So you will have a lot of test data, and, this is the the data asset that you would need to gather somehow. You can well, you you can generate it to some extent, but at some point, you will have to start getting the real data.
[00:04:08] Tobias Macey:
And in this space of generative AI applications, there are a few different styles that have been emerging. RAG was kind of the first one once we moved past the initial phase of just prompt engineering. And now we've moved into these agentic architectures, and there are a few different, styles of AI apps that have been coming up. Another one being graph rag where you incorporate knowledge graphs. And I'm wondering how the particular type of application changes the requirements around the types of data assets that are available to those AI applications for feeding into the models or or for storing some of the outputs or metrics around the model generation itself.
[00:04:52] Bartosz Mikulski:
Okay. So besides the data in the database, obviously, what what you need is, as we discussed, the evaluation dataset. So in all of those cases, it looks a little bit different. It's It's because for a rack, you have the user question, and then the AI is calling some service, let's say, database with some query. So it needs a test set that consists of at least two things, the input from the user and the query you want to send or multiple queries. And then you check if this really happened or if the query that was sent is similar enough to what you expect. Then you get the response, and you have to generate the answers. And this needs it's separate, dataset for testing because this is another step, and it it can fail too.
And then, of course, you will test it as a entire workflow. So you have to be prepared for this also. And this is was one interaction, really. Yeah. You receive something and you generate a dancer. And if you have multiple steps, you will have to multiply your datasets. And for the agents, it gets even funnier because, you have no control of of the of the process. Well, I have some control over the process, but the agent can choose a tool and choose the parameters for the tool. And then your dataset has to contain the queries and the tools that you want to use and the parameters you would expect to see when this query is sent.
So, basically, keep multiplying the test datasets.
[00:06:38] Tobias Macey:
In terms of the areas of responsibility for what the role looks like and who is responsible for what pieces of the life cycle of the application and the different data that gets fed into or retrieved from those different stages, I'm wondering how you're seeing that breakdown in terms of different organizations and how that maybe is influenced by the either size and scale of the organization or the type of application or use case that they're powering.
[00:07:08] Bartosz Mikulski:
K. So as a freelance AI engineer, I can say that everything that is even remotely related to AI is responsibility of the AI engineer, but it doesn't have to be this way. So we already had some setup, yeah, because you probably had some ML models. So it can stay this way. You have the data and training into this gathering the data and maybe cleaning it. You have the data scientist. In this case, we'll write the prompts and do the experimentation on the prompts. You might have the MLOps team deploying it. In this case, deployment is really just changing the prompt unless you use the open source modem, then you have to redeploy something.
But on the other hand, the step when you are getting the production data is more, work intensive because you have those intermediate calls to the to the model. So in this case, the envelope steam is still required. So I think it doesn't today, it doesn't change that much. It does requires maybe something that we are not used to. We're just working with text on both ends of the the model. So just not feeding it text, but also getting the text from it.
[00:08:35] Tobias Macey:
And for data engineering teams in particular who are used to working with more structured datasets, doing something along the lines of data warehousing, business intelligence, or maybe even feeding some of those curated datasets back into application contexts. What are some of the types of skills that transfer well to this world of unstructured data and preparing it for AI applications, particularly working with things like vector databases? And what are some of the skills that need to be acquired for people in that situation so that they can more effectively work with and support the MLOps and AI engineer teams?
[00:09:16] Bartosz Mikulski:
Okay. So if you outline the process in detail, you will always find something that we are right now. So if you do calls to the database, you probably know the query language for the database, whatever it is. If it's SQL or any other thing, you you might know it already. So this is a skill that you can just use. In the other areas okay. Maybe vector databases might be kind of new for you. So this, the data entry team might need to learn because it is like a normal database when you are inserting data to into it. So it's maybe not that relevant, but when you are receiving, it's a little bit surprising at first.
So the matching of of documents. The other things that you can transfer, I think, the entire machine learning process, like the deployment, follow deployments, AB testing, testing in general, experimentation, this doesn't change. Change the tool, but the process stays the same. So people already know a lot that they need to know when they use generative AI. Maybe they just don't realize it yet.
[00:10:38] Tobias Macey:
On that vector database side, they take a number of different forms where you have document oriented vector databases in the shape of things like Qdren. You have pure vector databases, sort of like Pinecone, and then you have vector add ons for relational databases like PG vector and postgres, as well as a whole slew of other formulations of vector storage in various contexts. And I'm wondering how the inclusion of vectors as a data type and as a core asset that is consumed and produced by these AI applications changes some of the ways that teams need to think about data modeling, in particular around things like trunking strategies, metadata management.
What are the pieces of information that you want to strip out before you run it through an embedding model? What are some of the pieces of information that are actually useful for putting into the embedding model? I know that, for instance, HTML, there have been conversations about whether to keep or strip out the tags, whether they're helpful or harmful, and just some of those types of, you know, tactical elements of building these data assets that teams need to be thinking about and trained up on.
[00:11:53] Bartosz Mikulski:
Okay. So trunking is definitely something new for data engineers, and you just have to get used to this. So from the strategy start that you have basically, you have to remember that you will probably have to chunk the documents because they will not fit in the context window of the model. Even if they do, that might be too expensive to use it this way. So even though the best way might be possibly, we will have to test this always to always send the entire document to the model. But in reality, you will chunk it. So you have several ways to do it. You can just decide that there's a fixed size of the trunk.
So let's say, for the sake of the example, 500 characters, and then you just cut the text every 500 characters, maybe too small a number I think this too small, but just for example. And then you start to build on top of this idea because you probably don't want to, cut it in the middle of the word. So you might have some tracking started with that. Okay. It's 500. But if there's in the middle of the word, we do it a little bit earlier. And then you realize, okay. That's not amid the middle of the word, but still in the middle of the sentence. And then you go back. Yeah. It's not in the middle of the sentence, but maybe this inside of the paragraph. And so, basically, just invented the recursive trunking strategy. You are cutting the text when it makes sense to preserve as bigger chunk of as you can. But if you can't, then you just resolve just just stay with the trunking of, in the middle of the word.
But, still, it might not be enough because, possibly, you can just be unlucky. Yeah. So, the sentence that you need might end up in the other trunk. So then we added overlapping, trunking strategies to take some part of another trunk. You not even considered it, like, to be the part of the trunk that you want, but you just overlap with another thing. And you have duplicates, but it's it's supposed to help you find the the relevant information. But still, it's, may not be enough because, sometimes when you write the document, you have first description of the problem, like a few paragraphs, and then you start writing the description of the solution. And your tracking extension might perfectly, allow you to find the description of the problem, but you are not interested in the problem. Or even know what is the problem. You have it. You want the solution.
And it's somehow another trunk that was not matched. So then you can use some we got parent document ready, but when you match by chunks, but you get the entire document. And you can still build on top of those those ideas because sometimes out of strength the topic in the middle of the text. And you can use something called semantic trunking. So use use the generative AI model to tell you where this trunk ends. So from the basic idea of cutting the text, at some point, you can build a lot of, more advanced, techniques. And then you realize that if you have document, the the trunk that you want to match and you match it by the query from the user, you are not really matching the same things. You have the answer and the question, and that's supposed to be similar.
But maybe you would be looking for an answer that is similar to some other answer. So you just invented the hypothetical document embeddings, but you are generating the, fake, it's not fake, answer to the user question. But you hope that the language vocabulary in this fake answer is similar to the actual answer. So you can keep adding new things, and then you have metadata that you can use to narrow the, space of the vectors that you have to, set. But this, this is not something you can retrofit into the pipeline because you have to start those metadata fields.
So if you start thinking of metadata, you have to go back to data engineering and just add them. And this might require a lot of work done again that you have done already. But this is what it is. Yeah.
[00:16:33] Tobias Macey:
Another divergence from what data engineers are typically used to in the context of these embeddings and vector databases is that there's not a lot of opportunity for being able to do sort of a backfill or an incremental reload, at least in the case where you need to change your chunking strategy or change your embedding model. You need to effectively rerun all of the data every time whenever you make a change of that nature versus just I need to add a new document to the database using all of the same parameters. Whereas in more structured data contexts, you can either mutate the data in place or, you know, append to it without necessarily having to do as drastic of a rebuild.
And given the fact that you might be dealing with large volumes of data, it likely brings in requirements of more complex or more sophisticated parallel processing. And I'm wondering how you're seeing some of those requirements change the tool sets or platform capabilities that engineering teams need to incorporate and invest in to be able to support these embedding experimentation and being able to evolve embeddings over time as new embedding models come up or as they need to change the trunking strategies or etcetera?
[00:18:03] Bartosz Mikulski:
K. I'm I think this is not solved yet. At least, I'm not not I'm not aware of any solution to this as of now. So for now, what what what I have been doing is just creating new collections of, data, with the different trunk strategies, trunking strategies, and using those. And, of course, you have to ingest them again, and it takes time, and you have to process them. And if you use some SaaS embedding model, you pay for the embeddings all the time when you do it. So this is the problematic part, and I'm not aware of any solutions. Maybe definitely someone's working on them, and I would love to hear about it. But, I don't know it.
[00:18:58] Tobias Macey:
Yeah. In particular, I imagine that teams who are doing sort of the traditional extract transform and load or extract load and transform workflows for filling their data warehouse, whatever batch or even streaming tools they're using to do that likely aren't going to be able to, provide the timeliness or scalability that they would need for doing massive reprocessing of all of the documents for regenerating embeddings, which likely pushes them into adopting something like a spark or array where maybe they didn't already have that as part of their infrastructure.
[00:19:35] Bartosz Mikulski:
Yep. And, then you have to, explain the engineers that they have done something, and it was perfect, but we need something else, which is probably not not not the thing to want to say to people rather often. Yeah. But this is what it is. It will be great to have a solution, but I think we don't have it yet.
[00:20:04] Tobias Macey:
And beyond the embeddings, as you move into some of the more sophisticated AI applications where maybe you need to incorporate something like long and short term memory for a chatbot or an agent style application. You also have the management of conversational history and responses and maybe also, additional data collection to support fine tuning of some of those models. How does that introduce new requirements and new workload capabilities to data engineering and ML ops teams to be able to support those types of applications.
[00:20:44] Bartosz Mikulski:
Okay. So I'm not that familiar with agent memory, so let's maybe focus on fine tuning. So first of all, like in the machine learning classical machine learning, you need the training data set, and it consists of the input and out. This is pretty obvious. But, gathering the quality output gets sticky because, you can get the data from the chat. For example, if you're building a chatbot, from from the chat, and assume that if nobody is complaining about it, then it's probably correct. But, this is not the case because people might just stop using this tool if they are not satisfied. It doesn't mean that, like, someone doesn't didn't bother to click the button that they don't like it, then which means that they have liked the the output and you can use it.
What's even worse, how you are going to get the correction? You got something wrong. The person who is using your app is not satisfied. They click the feedback button, but they don't like it. And now to show them the message that okay. So write down what you want to get instead. So if you wanted to have a helpful tool and it already disappointed you, and now you also got a homework. So this is this is not going to work, this way. So, sadly, what you are going to need, in all of the cases, you might get away with getting the data from the user, but you will need some data levels. So someone who can just write the okay. Inputs you can get from the actual user, but the output that you expect, you need someone who know what is the output and who can write them down and what is the sad part.
In most cases, that person might be you. So I will be writing this because there's no no one else was going to to do it. Yeah. But you need the data, and it's not not going to appear magically from nowhere.
[00:22:53] Tobias Macey:
Because of the fact that the overall space of generative AI applications and the different ways that these large language models are being incorporated into different application architectures, thinking in terms again of things like agents versus straight workflows versus just a back and forth chatbot and even just going from single turn to multi turn. How has that evolution of capabilities and use cases changed the types of work and responsibilities for data engineers and ML ops engineers over that period? And maybe what are some of the ways that you are forecasting those changes to continue as we go forward?
[00:23:38] Bartosz Mikulski:
K. The generative AI by by itself was a big evolution in the AI space, but, I think it wasn't the biggest, because for me, the biggest trend was the coding tools. Like, Cursor, before it was, of course, a GitHub Copilot. But it it could finish the current line, but it wasn't reduced. Okay. It was useful. But, in comparison to Cursor, it's almost nothing. And, I think this is the the biggest, trend in responsibilities because now, I, as a data engineer, I I can do front end now. This may not be the prettiest front end, but but I can do it. Yeah. I can I can I can make it work? So, with this tool, you can have teams who are really almost like full stacks in everything.
You might specialize in data engineering still, but but you can do other work. And, this really makes a lot of things possible, and you don't need to involve someone from other team when you are just building something, maybe not even internally, like, for the real long team, but for just building something, might be a prototype that is good enough to fall to other other people, and you don't need to involve, the person from from another team, like a front end engineer who probably is not on your data engineering team because you don't need this kind of skills. And and you can still still do it.
So for for me, this this is the biggest trend. Yeah? Like, the tools you can use to generate code. And, of course, it's not a pity code, might have some bugs, might be inefficient, but doesn't really matter if it's can it allows you to do something that was not possible to do for you. So this was the biggest change and, the biggest shift in the responsibilities because now you have more responsibilities in a sense, because you can do everything. Okay. Almost. But but you can still do it. Yeah. It's not that you got a responsibility and you have not you're capable of doing it. Can you get it work?
[00:25:57] Tobias Macey:
As far as the skills and capabilities for these engineers who are tasked with supporting AI applications, working with some of these vector databases, document embeddings, getting involved in data collection for fine tuning datasets, all of the various pieces that come into supporting these applications. What are some of the common skills gaps that you see or that teams should be aware of and watching out for and identifying opportunities for training on? So there is a huge,
[00:26:30] Bartosz Mikulski:
huge gap between copy pasting some code to get it work from some tutorial. They will have your first version of a chatbot and making it work in production and not be ashamed of the result. And it's not even that much of the engineering work as, realizing that you're mostly like every other software. Your software is going to only to to be as good as your test. And so if you cannot test it and you cannot prove that it works, then it probably doesn't. And in case of the native AI, this testing might be very extensive because you have the entire workflow. We have the steps. You have the examples that you don't really want to see in production when someone is trying to abuse your tool, but and they will try to abuse your tool.
So, you have to handle this tool. So this this is the the skill gap. You know? You can get it work pretty easy using some old online tutorials and then spend months to get it to the quality level that is required for production.
[00:27:42] Tobias Macey:
In your experience of working in this space and working with teams who are building these different AI applications, building the data feeds that support them, what are some of the most interesting or innovative or unexpected ways that you've seen those teams address these evolving needs of AI systems and be able to support them as they evolve and scale?
[00:28:04] Bartosz Mikulski:
K. Maybe not even the needs of the system, but the way how you build it. What was most unexpected for me as a data entry is that you can make the biggest difference with the user experience, with the UI. I mean, because people expect that to see a chat, maybe the summarize with AI button, and you don't need this. You can just hide it in the back and show them the the final result. Yeah? One of the projects we have built a reporting to top us just just a page. Yeah. And, you could get data extracted from some online reviews on this page, and it didn't scream this is AI based.
It's vast AI based, but you don't have to tell it everyone. Yeah. Because right now, it seems to be the way to market the story. And now now this is the thing with AI. But people don't care, and a lot of people actually don't like it. So maybe you don't need to show that this is AI based. You just use it and you show them the results, and they don't have to know it.
[00:29:16] Tobias Macey:
I I think another interesting evolution, particularly for data engineers in terms of the scope of their work, is going back to that chunking and embedding generation. The inclusion of ML and AI in the data pipeline itself, I think, think, is another notable shift from maybe five or ten years ago where it was largely just deterministic processing and transformation using coded logic, where now you're relying on these different AI models being able to generate those embeddings, process that content on your behalf, particularly if you're doing something like generating, semantic chunks where you actually feed the text through an LLM to summarize before it gets embedded. So I'm wondering how you're seeing some of that shift in terms of toolset impact data teams who maybe don't have that history already.
[00:30:14] Bartosz Mikulski:
Well, not sure if I have seen a team like this because I've worked with, teams who worked with, natural language processing before. So they're very used to, to use something, for embeddings and text processing on those embeddings. So this was not something fucking for those teams. But, yeah, I I can imagine, that it might be might be something new because, for example, I have seen people replacing OCR software with models that can recognize data from images. And we have multimodal model and, just replace something that used to be hard, like, character recognition from menus with with a model, and it turns out to be cheaper.
So there there is something that might be for some teams, but I don't think that this is, as much as it was fucked like it used to be, like, two years ago when people suddenly realized that this thing exists. And it it was there for some time already, but a a lot of people discovered it suddenly. Yeah.
[00:31:33] Tobias Macey:
And in your experience of working in this space, learning about some of these newer and evolving techniques for building and supporting these AI applications, what are some of the most interesting or unexpected or challenging lessons that you've learned personally?
[00:31:49] Bartosz Mikulski:
Okay. People will get very, very creative at trying to break the filter you build. As soon after they realize this is this is AI, they will have they they will just have to break it. Someone will come and try to, try to use it in some way that, it's not supposed to be used. Right? But in the best case, not really. It's very bad best case. They will use it as a free pro, to track the PTE. And just because you can process the the request, and you have to be prepared for this. And, really, if you have a chatbot, then maybe switching it off when you detect that it's getting abused is not a bad idea.
You might laugh at at this, this approach, but it it this is something you might consider because, otherwise, at the k. Maybe it's not the best case, but not such a bad case when you become a topic of, memes on the Internet. I have a screen for from your chatbot, and people are laughing at it. And it's bad, but it could be worse. And people will try to break it. That's just not not something that you have seen before with any other app. There are people who break apps for fun, but the way way more of when you start to use it.
[00:33:16] Tobias Macey:
And as you continue to invest in your own knowledge and work with the teams that you're involved with and just try to stay abreast of what's happening in the industry. What are some of the emerging trends that you are paying particular attention to and investing your own learning efforts into?
[00:33:36] Bartosz Mikulski:
I have just discovered prompt comparison. Apparently, it is possible to use a a model, generative AI model, to transform the prompt that you that's for another model, make it shorter, with useless, fewer tokens, but still get similar or the same performance. And, I really got interested in that. I cannot say much about it yet because I have not learned enough, but I didn't know if it's possible, and I discovered, like, a week or two ago. Yeah. That's, really something I want to spend some time working on because looks cool. Yeah. It makes the makes the call strippers, first of all.
Then maybe you can feed more data in your prompt so you have bigger context. And that just sounds cool. Yeah. Just converted your prompt, compressed it, and it works the same. So just just for this those three reasons, the this is the thing that I I want to take a look. And, when I learn enough about it, I will probably write a blog post of Asifuall. So may maybe I can find it later. And so far, what I have found is this, LLM lingua library from Microsoft. I think I got to the name right. So yeah. And this is the maybe not a trend, but some area of interest.
[00:35:13] Tobias Macey:
And are there any other aspects of data engineering requirements around AI applications and just supporting these applications and the data that they consume and produce that we didn't discuss yet that you'd like to cover before we close out the show?
[00:35:28] Bartosz Mikulski:
Maybe one thing, you don't have to support every, input. You can just choose what what the tool is supposed to do and maybe, cluster the data, do some topic modeling, decide if people are already asking the questions you expect, to see. And if they don't, then maybe just filter it out. I mean, it doesn't have to do everything. It's not general. If it's not general purpose app, then just decide if this is the thing you want to support.
[00:36:01] Tobias Macey:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling, or technology that's available for data management today, maybe in particular, and how it relates to supporting AI apps.
[00:36:21] Bartosz Mikulski:
Okay. We have a lot of tools for, evaluation, like monitoring or, just doing evaluation testing. And, it's not really a gap in the tooling because I think we have already too many. But they're kind of, trying to do everything, and, I think we need some consolidation. I would like to have one tool for this. Like, it will do everything, but at least do it in some way the creators of the tool choose to to do it, because, right now, you have to start try to do everything, but they really don't, and we need several of them. The documentation is usually, let's say, politely lagging.
Most likely not even existing. So, yeah, I I would love to see a tool that just gets the job done. May it may have some opinions about how to do it. I might need to adjust my code to do it. It's fine. Just I I don't want I don't need three tools for for everything. So this is the gap that I see right now.
[00:37:46] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share your thoughts and experience of the data requirements around these AI applications and some of the ways that it's shifting the responsibilities and the tooling and the work required for data engineers and MLOps engineers. So appreciate the time and energy you're putting into that, and I hope you enjoy the rest of your day. Thank you. Bye bye.
[00:38:17] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@dataengineeringpodcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and and tell
[00:38:50] Tobias Macey:
your friends
[00:38:53] Bartosz Mikulski:
and coworkers.
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds today for the details.
[00:00:48] Tobias Macey:
Your host is Tobias Macy, and today I'm interviewing Bartov Mikulski about how to prepare data for use in AI applications. So Bartov, can you start by introducing yourself?
[00:00:57] Bartosz Mikulski:
So I'm Bartov. I'm an MLOps engineer. I've been working as a data engineer for some time, then I switched to MLOps. And, along the way, I realized that that the real problem is not really the software that if you do, maybe the data is more important, like, not even the data that you process, but the way how to test it. And this kind of applies to AI also. So this was the the way how how it went from data engineering to AI.
[00:01:31] Tobias Macey:
And do you remember how you first got started working in the space of data and AI and ML?
[00:01:36] Bartosz Mikulski:
Kind of by accident. I mean, we had some data engineering work, to to do in the back end project, and I liked it. So then I stayed in data engineering. And, then along the way, I got interested in, mapping, but I was never very good at training those models. But somehow, I was better to deploy them and keep it running so I got involved in MLOps, and that I just stayed in this idea.
[00:02:09] Tobias Macey:
And so now that a lot of the industry has started moving into the space of generative AI and building AI applications, obviously, there are a lot of data requirements that go with that. But I'm wondering if you could just start by outlining some of the main categories of the types of data assets that are needed specifically for AI applications and some of the ways that that maybe differs from the, I guess, traditional, I'll I'll say, data assets that engineering teams are used to working with.
[00:02:40] Bartosz Mikulski:
Okay. So first of all, this is machine learning, so it doesn't really differ that much. You need some test dataset. In the case of generative AI, we call it evaluation dataset because, you know, we need fancy naming. But this is your test, dataset, and you will use it to verify if this thing whatever you are building works works correctly. You don't need the training data set that much though unless you are going to fine tune something. So it should be easier, to get the required data. Although, then quickly you realize that if you are building something that involves multiple steps, like, you do one call and you retrieve some data, you do another call to do your AI model, you need the tray and the testing datasets for all of the steps separately because you will be testing them separately and to figure out what doesn't work because some somehow, always something doesn't work well, and you have to figure out what it is. Yeah.
So you will have a lot of test data, and, this is the the data asset that you would need to gather somehow. You can well, you you can generate it to some extent, but at some point, you will have to start getting the real data.
[00:04:08] Tobias Macey:
And in this space of generative AI applications, there are a few different styles that have been emerging. RAG was kind of the first one once we moved past the initial phase of just prompt engineering. And now we've moved into these agentic architectures, and there are a few different, styles of AI apps that have been coming up. Another one being graph rag where you incorporate knowledge graphs. And I'm wondering how the particular type of application changes the requirements around the types of data assets that are available to those AI applications for feeding into the models or or for storing some of the outputs or metrics around the model generation itself.
[00:04:52] Bartosz Mikulski:
Okay. So besides the data in the database, obviously, what what you need is, as we discussed, the evaluation dataset. So in all of those cases, it looks a little bit different. It's It's because for a rack, you have the user question, and then the AI is calling some service, let's say, database with some query. So it needs a test set that consists of at least two things, the input from the user and the query you want to send or multiple queries. And then you check if this really happened or if the query that was sent is similar enough to what you expect. Then you get the response, and you have to generate the answers. And this needs it's separate, dataset for testing because this is another step, and it it can fail too.
And then, of course, you will test it as a entire workflow. So you have to be prepared for this also. And this is was one interaction, really. Yeah. You receive something and you generate a dancer. And if you have multiple steps, you will have to multiply your datasets. And for the agents, it gets even funnier because, you have no control of of the of the process. Well, I have some control over the process, but the agent can choose a tool and choose the parameters for the tool. And then your dataset has to contain the queries and the tools that you want to use and the parameters you would expect to see when this query is sent.
So, basically, keep multiplying the test datasets.
[00:06:38] Tobias Macey:
In terms of the areas of responsibility for what the role looks like and who is responsible for what pieces of the life cycle of the application and the different data that gets fed into or retrieved from those different stages, I'm wondering how you're seeing that breakdown in terms of different organizations and how that maybe is influenced by the either size and scale of the organization or the type of application or use case that they're powering.
[00:07:08] Bartosz Mikulski:
K. So as a freelance AI engineer, I can say that everything that is even remotely related to AI is responsibility of the AI engineer, but it doesn't have to be this way. So we already had some setup, yeah, because you probably had some ML models. So it can stay this way. You have the data and training into this gathering the data and maybe cleaning it. You have the data scientist. In this case, we'll write the prompts and do the experimentation on the prompts. You might have the MLOps team deploying it. In this case, deployment is really just changing the prompt unless you use the open source modem, then you have to redeploy something.
But on the other hand, the step when you are getting the production data is more, work intensive because you have those intermediate calls to the to the model. So in this case, the envelope steam is still required. So I think it doesn't today, it doesn't change that much. It does requires maybe something that we are not used to. We're just working with text on both ends of the the model. So just not feeding it text, but also getting the text from it.
[00:08:35] Tobias Macey:
And for data engineering teams in particular who are used to working with more structured datasets, doing something along the lines of data warehousing, business intelligence, or maybe even feeding some of those curated datasets back into application contexts. What are some of the types of skills that transfer well to this world of unstructured data and preparing it for AI applications, particularly working with things like vector databases? And what are some of the skills that need to be acquired for people in that situation so that they can more effectively work with and support the MLOps and AI engineer teams?
[00:09:16] Bartosz Mikulski:
Okay. So if you outline the process in detail, you will always find something that we are right now. So if you do calls to the database, you probably know the query language for the database, whatever it is. If it's SQL or any other thing, you you might know it already. So this is a skill that you can just use. In the other areas okay. Maybe vector databases might be kind of new for you. So this, the data entry team might need to learn because it is like a normal database when you are inserting data to into it. So it's maybe not that relevant, but when you are receiving, it's a little bit surprising at first.
So the matching of of documents. The other things that you can transfer, I think, the entire machine learning process, like the deployment, follow deployments, AB testing, testing in general, experimentation, this doesn't change. Change the tool, but the process stays the same. So people already know a lot that they need to know when they use generative AI. Maybe they just don't realize it yet.
[00:10:38] Tobias Macey:
On that vector database side, they take a number of different forms where you have document oriented vector databases in the shape of things like Qdren. You have pure vector databases, sort of like Pinecone, and then you have vector add ons for relational databases like PG vector and postgres, as well as a whole slew of other formulations of vector storage in various contexts. And I'm wondering how the inclusion of vectors as a data type and as a core asset that is consumed and produced by these AI applications changes some of the ways that teams need to think about data modeling, in particular around things like trunking strategies, metadata management.
What are the pieces of information that you want to strip out before you run it through an embedding model? What are some of the pieces of information that are actually useful for putting into the embedding model? I know that, for instance, HTML, there have been conversations about whether to keep or strip out the tags, whether they're helpful or harmful, and just some of those types of, you know, tactical elements of building these data assets that teams need to be thinking about and trained up on.
[00:11:53] Bartosz Mikulski:
Okay. So trunking is definitely something new for data engineers, and you just have to get used to this. So from the strategy start that you have basically, you have to remember that you will probably have to chunk the documents because they will not fit in the context window of the model. Even if they do, that might be too expensive to use it this way. So even though the best way might be possibly, we will have to test this always to always send the entire document to the model. But in reality, you will chunk it. So you have several ways to do it. You can just decide that there's a fixed size of the trunk.
So let's say, for the sake of the example, 500 characters, and then you just cut the text every 500 characters, maybe too small a number I think this too small, but just for example. And then you start to build on top of this idea because you probably don't want to, cut it in the middle of the word. So you might have some tracking started with that. Okay. It's 500. But if there's in the middle of the word, we do it a little bit earlier. And then you realize, okay. That's not amid the middle of the word, but still in the middle of the sentence. And then you go back. Yeah. It's not in the middle of the sentence, but maybe this inside of the paragraph. And so, basically, just invented the recursive trunking strategy. You are cutting the text when it makes sense to preserve as bigger chunk of as you can. But if you can't, then you just resolve just just stay with the trunking of, in the middle of the word.
But, still, it might not be enough because, possibly, you can just be unlucky. Yeah. So, the sentence that you need might end up in the other trunk. So then we added overlapping, trunking strategies to take some part of another trunk. You not even considered it, like, to be the part of the trunk that you want, but you just overlap with another thing. And you have duplicates, but it's it's supposed to help you find the the relevant information. But still, it's, may not be enough because, sometimes when you write the document, you have first description of the problem, like a few paragraphs, and then you start writing the description of the solution. And your tracking extension might perfectly, allow you to find the description of the problem, but you are not interested in the problem. Or even know what is the problem. You have it. You want the solution.
And it's somehow another trunk that was not matched. So then you can use some we got parent document ready, but when you match by chunks, but you get the entire document. And you can still build on top of those those ideas because sometimes out of strength the topic in the middle of the text. And you can use something called semantic trunking. So use use the generative AI model to tell you where this trunk ends. So from the basic idea of cutting the text, at some point, you can build a lot of, more advanced, techniques. And then you realize that if you have document, the the trunk that you want to match and you match it by the query from the user, you are not really matching the same things. You have the answer and the question, and that's supposed to be similar.
But maybe you would be looking for an answer that is similar to some other answer. So you just invented the hypothetical document embeddings, but you are generating the, fake, it's not fake, answer to the user question. But you hope that the language vocabulary in this fake answer is similar to the actual answer. So you can keep adding new things, and then you have metadata that you can use to narrow the, space of the vectors that you have to, set. But this, this is not something you can retrofit into the pipeline because you have to start those metadata fields.
So if you start thinking of metadata, you have to go back to data engineering and just add them. And this might require a lot of work done again that you have done already. But this is what it is. Yeah.
[00:16:33] Tobias Macey:
Another divergence from what data engineers are typically used to in the context of these embeddings and vector databases is that there's not a lot of opportunity for being able to do sort of a backfill or an incremental reload, at least in the case where you need to change your chunking strategy or change your embedding model. You need to effectively rerun all of the data every time whenever you make a change of that nature versus just I need to add a new document to the database using all of the same parameters. Whereas in more structured data contexts, you can either mutate the data in place or, you know, append to it without necessarily having to do as drastic of a rebuild.
And given the fact that you might be dealing with large volumes of data, it likely brings in requirements of more complex or more sophisticated parallel processing. And I'm wondering how you're seeing some of those requirements change the tool sets or platform capabilities that engineering teams need to incorporate and invest in to be able to support these embedding experimentation and being able to evolve embeddings over time as new embedding models come up or as they need to change the trunking strategies or etcetera?
[00:18:03] Bartosz Mikulski:
K. I'm I think this is not solved yet. At least, I'm not not I'm not aware of any solution to this as of now. So for now, what what what I have been doing is just creating new collections of, data, with the different trunk strategies, trunking strategies, and using those. And, of course, you have to ingest them again, and it takes time, and you have to process them. And if you use some SaaS embedding model, you pay for the embeddings all the time when you do it. So this is the problematic part, and I'm not aware of any solutions. Maybe definitely someone's working on them, and I would love to hear about it. But, I don't know it.
[00:18:58] Tobias Macey:
Yeah. In particular, I imagine that teams who are doing sort of the traditional extract transform and load or extract load and transform workflows for filling their data warehouse, whatever batch or even streaming tools they're using to do that likely aren't going to be able to, provide the timeliness or scalability that they would need for doing massive reprocessing of all of the documents for regenerating embeddings, which likely pushes them into adopting something like a spark or array where maybe they didn't already have that as part of their infrastructure.
[00:19:35] Bartosz Mikulski:
Yep. And, then you have to, explain the engineers that they have done something, and it was perfect, but we need something else, which is probably not not not the thing to want to say to people rather often. Yeah. But this is what it is. It will be great to have a solution, but I think we don't have it yet.
[00:20:04] Tobias Macey:
And beyond the embeddings, as you move into some of the more sophisticated AI applications where maybe you need to incorporate something like long and short term memory for a chatbot or an agent style application. You also have the management of conversational history and responses and maybe also, additional data collection to support fine tuning of some of those models. How does that introduce new requirements and new workload capabilities to data engineering and ML ops teams to be able to support those types of applications.
[00:20:44] Bartosz Mikulski:
Okay. So I'm not that familiar with agent memory, so let's maybe focus on fine tuning. So first of all, like in the machine learning classical machine learning, you need the training data set, and it consists of the input and out. This is pretty obvious. But, gathering the quality output gets sticky because, you can get the data from the chat. For example, if you're building a chatbot, from from the chat, and assume that if nobody is complaining about it, then it's probably correct. But, this is not the case because people might just stop using this tool if they are not satisfied. It doesn't mean that, like, someone doesn't didn't bother to click the button that they don't like it, then which means that they have liked the the output and you can use it.
What's even worse, how you are going to get the correction? You got something wrong. The person who is using your app is not satisfied. They click the feedback button, but they don't like it. And now to show them the message that okay. So write down what you want to get instead. So if you wanted to have a helpful tool and it already disappointed you, and now you also got a homework. So this is this is not going to work, this way. So, sadly, what you are going to need, in all of the cases, you might get away with getting the data from the user, but you will need some data levels. So someone who can just write the okay. Inputs you can get from the actual user, but the output that you expect, you need someone who know what is the output and who can write them down and what is the sad part.
In most cases, that person might be you. So I will be writing this because there's no no one else was going to to do it. Yeah. But you need the data, and it's not not going to appear magically from nowhere.
[00:22:53] Tobias Macey:
Because of the fact that the overall space of generative AI applications and the different ways that these large language models are being incorporated into different application architectures, thinking in terms again of things like agents versus straight workflows versus just a back and forth chatbot and even just going from single turn to multi turn. How has that evolution of capabilities and use cases changed the types of work and responsibilities for data engineers and ML ops engineers over that period? And maybe what are some of the ways that you are forecasting those changes to continue as we go forward?
[00:23:38] Bartosz Mikulski:
K. The generative AI by by itself was a big evolution in the AI space, but, I think it wasn't the biggest, because for me, the biggest trend was the coding tools. Like, Cursor, before it was, of course, a GitHub Copilot. But it it could finish the current line, but it wasn't reduced. Okay. It was useful. But, in comparison to Cursor, it's almost nothing. And, I think this is the the biggest, trend in responsibilities because now, I, as a data engineer, I I can do front end now. This may not be the prettiest front end, but but I can do it. Yeah. I can I can I can make it work? So, with this tool, you can have teams who are really almost like full stacks in everything.
You might specialize in data engineering still, but but you can do other work. And, this really makes a lot of things possible, and you don't need to involve someone from other team when you are just building something, maybe not even internally, like, for the real long team, but for just building something, might be a prototype that is good enough to fall to other other people, and you don't need to involve, the person from from another team, like a front end engineer who probably is not on your data engineering team because you don't need this kind of skills. And and you can still still do it.
So for for me, this this is the biggest trend. Yeah? Like, the tools you can use to generate code. And, of course, it's not a pity code, might have some bugs, might be inefficient, but doesn't really matter if it's can it allows you to do something that was not possible to do for you. So this was the biggest change and, the biggest shift in the responsibilities because now you have more responsibilities in a sense, because you can do everything. Okay. Almost. But but you can still do it. Yeah. It's not that you got a responsibility and you have not you're capable of doing it. Can you get it work?
[00:25:57] Tobias Macey:
As far as the skills and capabilities for these engineers who are tasked with supporting AI applications, working with some of these vector databases, document embeddings, getting involved in data collection for fine tuning datasets, all of the various pieces that come into supporting these applications. What are some of the common skills gaps that you see or that teams should be aware of and watching out for and identifying opportunities for training on? So there is a huge,
[00:26:30] Bartosz Mikulski:
huge gap between copy pasting some code to get it work from some tutorial. They will have your first version of a chatbot and making it work in production and not be ashamed of the result. And it's not even that much of the engineering work as, realizing that you're mostly like every other software. Your software is going to only to to be as good as your test. And so if you cannot test it and you cannot prove that it works, then it probably doesn't. And in case of the native AI, this testing might be very extensive because you have the entire workflow. We have the steps. You have the examples that you don't really want to see in production when someone is trying to abuse your tool, but and they will try to abuse your tool.
So, you have to handle this tool. So this this is the the skill gap. You know? You can get it work pretty easy using some old online tutorials and then spend months to get it to the quality level that is required for production.
[00:27:42] Tobias Macey:
In your experience of working in this space and working with teams who are building these different AI applications, building the data feeds that support them, what are some of the most interesting or innovative or unexpected ways that you've seen those teams address these evolving needs of AI systems and be able to support them as they evolve and scale?
[00:28:04] Bartosz Mikulski:
K. Maybe not even the needs of the system, but the way how you build it. What was most unexpected for me as a data entry is that you can make the biggest difference with the user experience, with the UI. I mean, because people expect that to see a chat, maybe the summarize with AI button, and you don't need this. You can just hide it in the back and show them the the final result. Yeah? One of the projects we have built a reporting to top us just just a page. Yeah. And, you could get data extracted from some online reviews on this page, and it didn't scream this is AI based.
It's vast AI based, but you don't have to tell it everyone. Yeah. Because right now, it seems to be the way to market the story. And now now this is the thing with AI. But people don't care, and a lot of people actually don't like it. So maybe you don't need to show that this is AI based. You just use it and you show them the results, and they don't have to know it.
[00:29:16] Tobias Macey:
I I think another interesting evolution, particularly for data engineers in terms of the scope of their work, is going back to that chunking and embedding generation. The inclusion of ML and AI in the data pipeline itself, I think, think, is another notable shift from maybe five or ten years ago where it was largely just deterministic processing and transformation using coded logic, where now you're relying on these different AI models being able to generate those embeddings, process that content on your behalf, particularly if you're doing something like generating, semantic chunks where you actually feed the text through an LLM to summarize before it gets embedded. So I'm wondering how you're seeing some of that shift in terms of toolset impact data teams who maybe don't have that history already.
[00:30:14] Bartosz Mikulski:
Well, not sure if I have seen a team like this because I've worked with, teams who worked with, natural language processing before. So they're very used to, to use something, for embeddings and text processing on those embeddings. So this was not something fucking for those teams. But, yeah, I I can imagine, that it might be might be something new because, for example, I have seen people replacing OCR software with models that can recognize data from images. And we have multimodal model and, just replace something that used to be hard, like, character recognition from menus with with a model, and it turns out to be cheaper.
So there there is something that might be for some teams, but I don't think that this is, as much as it was fucked like it used to be, like, two years ago when people suddenly realized that this thing exists. And it it was there for some time already, but a a lot of people discovered it suddenly. Yeah.
[00:31:33] Tobias Macey:
And in your experience of working in this space, learning about some of these newer and evolving techniques for building and supporting these AI applications, what are some of the most interesting or unexpected or challenging lessons that you've learned personally?
[00:31:49] Bartosz Mikulski:
Okay. People will get very, very creative at trying to break the filter you build. As soon after they realize this is this is AI, they will have they they will just have to break it. Someone will come and try to, try to use it in some way that, it's not supposed to be used. Right? But in the best case, not really. It's very bad best case. They will use it as a free pro, to track the PTE. And just because you can process the the request, and you have to be prepared for this. And, really, if you have a chatbot, then maybe switching it off when you detect that it's getting abused is not a bad idea.
You might laugh at at this, this approach, but it it this is something you might consider because, otherwise, at the k. Maybe it's not the best case, but not such a bad case when you become a topic of, memes on the Internet. I have a screen for from your chatbot, and people are laughing at it. And it's bad, but it could be worse. And people will try to break it. That's just not not something that you have seen before with any other app. There are people who break apps for fun, but the way way more of when you start to use it.
[00:33:16] Tobias Macey:
And as you continue to invest in your own knowledge and work with the teams that you're involved with and just try to stay abreast of what's happening in the industry. What are some of the emerging trends that you are paying particular attention to and investing your own learning efforts into?
[00:33:36] Bartosz Mikulski:
I have just discovered prompt comparison. Apparently, it is possible to use a a model, generative AI model, to transform the prompt that you that's for another model, make it shorter, with useless, fewer tokens, but still get similar or the same performance. And, I really got interested in that. I cannot say much about it yet because I have not learned enough, but I didn't know if it's possible, and I discovered, like, a week or two ago. Yeah. That's, really something I want to spend some time working on because looks cool. Yeah. It makes the makes the call strippers, first of all.
Then maybe you can feed more data in your prompt so you have bigger context. And that just sounds cool. Yeah. Just converted your prompt, compressed it, and it works the same. So just just for this those three reasons, the this is the thing that I I want to take a look. And, when I learn enough about it, I will probably write a blog post of Asifuall. So may maybe I can find it later. And so far, what I have found is this, LLM lingua library from Microsoft. I think I got to the name right. So yeah. And this is the maybe not a trend, but some area of interest.
[00:35:13] Tobias Macey:
And are there any other aspects of data engineering requirements around AI applications and just supporting these applications and the data that they consume and produce that we didn't discuss yet that you'd like to cover before we close out the show?
[00:35:28] Bartosz Mikulski:
Maybe one thing, you don't have to support every, input. You can just choose what what the tool is supposed to do and maybe, cluster the data, do some topic modeling, decide if people are already asking the questions you expect, to see. And if they don't, then maybe just filter it out. I mean, it doesn't have to do everything. It's not general. If it's not general purpose app, then just decide if this is the thing you want to support.
[00:36:01] Tobias Macey:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling, or technology that's available for data management today, maybe in particular, and how it relates to supporting AI apps.
[00:36:21] Bartosz Mikulski:
Okay. We have a lot of tools for, evaluation, like monitoring or, just doing evaluation testing. And, it's not really a gap in the tooling because I think we have already too many. But they're kind of, trying to do everything, and, I think we need some consolidation. I would like to have one tool for this. Like, it will do everything, but at least do it in some way the creators of the tool choose to to do it, because, right now, you have to start try to do everything, but they really don't, and we need several of them. The documentation is usually, let's say, politely lagging.
Most likely not even existing. So, yeah, I I would love to see a tool that just gets the job done. May it may have some opinions about how to do it. I might need to adjust my code to do it. It's fine. Just I I don't want I don't need three tools for for everything. So this is the gap that I see right now.
[00:37:46] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share your thoughts and experience of the data requirements around these AI applications and some of the ways that it's shifting the responsibilities and the tooling and the work required for data engineers and MLOps engineers. So appreciate the time and energy you're putting into that, and I hope you enjoy the rest of your day. Thank you. Bye bye.
[00:38:17] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@dataengineeringpodcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and and tell
[00:38:50] Tobias Macey:
your friends
[00:38:53] Bartosz Mikulski:
and coworkers.
Introduction and Guest Introduction
Transition from Data Engineering to AI
Data Requirements for AI Applications
Generative AI Application Styles
Roles and Responsibilities in AI Projects
Skills for Data Engineers in AI
Vector Databases and Data Modeling
Challenges in Data Processing for AI
Fine-Tuning and Data Collection
Evolution of AI Capabilities and Responsibilities
Skills Gaps in AI Support
Innovative Approaches in AI Systems
Lessons Learned in AI Development
Emerging Trends in AI
Final Thoughts and Closing Remarks