Summary
In this episode of the Data Engineering Podcast Gleb Mezhanskiy, CEO and co-founder of DataFold, talks about the intersection of AI and data engineering. He discusses the challenges and opportunities of integrating AI into data engineering, particularly using large language models (LLMs) to enhance productivity and reduce manual toil. The conversation covers the potential of AI to transform data engineering tasks, such as text-to-SQL interfaces and creating semantic graphs to improve data accessibility, and explores practical applications of LLMs in automating code reviews, testing, and understanding data lineage.
Announcements
Parting Question
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
In this episode of the Data Engineering Podcast Gleb Mezhanskiy, CEO and co-founder of DataFold, talks about the intersection of AI and data engineering. He discusses the challenges and opportunities of integrating AI into data engineering, particularly using large language models (LLMs) to enhance productivity and reduce manual toil. The conversation covers the potential of AI to transform data engineering tasks, such as text-to-SQL interfaces and creating semantic graphs to improve data accessibility, and explores practical applications of LLMs in automating code reviews, testing, and understanding data lineage.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
- Your host is Tobias Macey and today I'm interviewing Gleb Mezhanskiy about
- Introduction
- How did you get involved in the area of data management?
- modern data stack is dead
- where is AI in the data stack?
- "buy our tool to ship AI"
- opportunities for LLM in DE workflow
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
[00:00:11]
Tobias Macey:
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds today for the details.
Your host is Tobias Macey, and today, I'd like to welcome back Gleb Myszynski, where we're going to talk about the work of data engineering to build AI, to build better data engineering, and all of the things that come out of that idea. So, Gleb, for folks who haven't heard any of your past appearances, if you could just give a quick introduction.
[00:01:07] Gleb Mezhanskiy:
Yeah. Thanks for having me again, Tobias. Always fun to be at podcast. I'm glad I am CEO and cofounder of DataFold. We work on automating data engineering workflows now also with AI. Prior to starting DataFold, I was a data engineer, data scientist, data product manager, and I got a chance to build three data platforms pretty much from scratch at three very different companies, including Autodesk and Lyft, where I was one of the first founding data engineers and got to build a lot of pipelines and infrastructure and also break a lot of pipelines and infrastructure. And I've always been fascinated by how important is data engineering to the business In that, it unlocks the delivery of the actual applications that are data driven, be that dashboards or machine learning models or now in busingly also, BI applications.
And at the same time, as a data engineer, I have always been very frustrated with how manual, error prone, tedious, and toilsome my personal workflow was and pretty much started Dataflow to solve that problem and remove all the manual work from the data engineering workflow so that we can ship high quality data faster and help all the wonderful businesses that are trying to leverage data actually do it. So excited to chat.
[00:02:34] Tobias Macey:
In the context of data engineering, AI, obviously, there's a lot of hype that's being thrown around about, oh, you just rub some AI on it. It'll be magical, and your problems are solved. You don't need to work anymore. It's going to replace all of your junior engineers or whatever the current marketing spin is for it. And it's undeniable that large language models, generative AI, the current era that we're in, has a lot of potential. There are a lot of useful applications of it, but the work to actually realize those capabilities is often a little bit opaque or misunderstood or confusing.
And so there are definitely a lot of opportunities for being able to bring large language models or other generative AI technologies into the context of data engineering work or development environments. But the work of actually getting it to the point where it is more help than hindrance is often where things start to fall apart. And I'm wondering if you can just start from the work that you're doing and the experience you've had of actually incorporating LLMs into some of your product, some of the lessons learned about what are some of those impedance mismatches, what are some of those stumbling blocks that you're going to run into on the path of saying, I've got a model. I've got a problem. Let's put them together.
[00:03:57] Gleb Mezhanskiy:
Yeah. Absolutely. And I think that's spot on observation, Tobias, in terms of there's a lot of noise and hype around AI everywhere. But, yeah, we don't have a really clear idea and consensus how actually it impacts state engineering. And maybe before we dive into, like, okay, what is actually working, it's worth kind of disambiguating and cutting through the noise a little bit. And I've been thinking about this recently, but I think there is probably two main things that everyone gets a bit confused about. One is the confusion of software engineering and data engineering.
Software engineering and data engineering are very related. And in many ways, they are similar. In data engineering, we ultimately also write code that produces some outcome. But unlike software engineering, typically, we're not really building a deterministic application that performs a certain function. We write in code that processes large amounts of data. And, usually, that data is highly imperfect. And so we're dealing not just with, code. We're dealing also with extremely complex, extremely noisy inputs and a lot of the times also unpredictable outputs. And that makes the workflow quite different.
And I think one important distinction is when we see lots of different tools and advancements in tools that are affecting software engineers and impacting their workflows for the better like, one example is, I think, over the past year, we've seen amazing amazing improvement of the kind of Copilot type of support within a software engineering workflow through various tools. We at Dataflow, for example, use cursor ID a lot, and we really like how it seamlessly plugs in and enables our engineers working on the application code just be more productive, spend less time on, a lot of, like, boilerplate, total sum tasks.
And those tools are really it's really exciting how it affects the software engineering workflow. There's also a huge part in the software engineering space right now that is devoted to the agents. So, for example, with Cursor, the idea is that you plug it in the IDE in a few touch points for developer, like code completion and then kind of in the system and helps you mock up and refactor the code. And it's very seamless, but it's still kind of part of the core workflow for human. And And then there's a second school of thought where there's an agent that takes the tasks that can be very loosely defined and then basically builds an app from scratch or takes a Jira or linear ticket and does the work from scratch. And it's also very exciting. I would say, in our experience testing multiple tools, the results there are far less impressive, and actual impact on the business for us in terms of software engineering has been far less impressive than with more, like, a ID native enhancement.
But all of that is to say that while those tools are really impactful for software engineers and there's a lot happening also in other parts of the workflow, we've seen very limited impact of those particular tools on the data engineers workflow. And the primary reason is that although we're also writing code as data engineers, the tools that are built for software engineers, they lack very important context about the data. And it is kind of a simple idea and a simple statement, but what's underneath is actually quite a bit of complexity. Because if you think about what data engineer needs to do in order to do their job, they have to understand not just the code base, but they also have to have a really good grasp on the underlying data that their code base is processing, which is actually a very hard task by itself starting from understanding what data you have in the first place, how was the data computed, where it's coming, who is consuming it, what are the relationships between all the datasets.
And absent of that context, the tools that you may have supporting your workflow, yes, they can help you generate the code, but the impact of that would be quite limited relative to, how complex your workflow is. And I think that means that for data engineers, we need to see a specialized class of tools that would be dedicated at improving data engineers' workflow and would excel at doing that by having that context that is critical for data engineer to do their job. That's kind of, I think, one aspect of the confusion sort of like all the advances in software engineering tools are exciting and inspiring. It doesn't mean that now data engineer's workflow is impacted as as significantly as the software engineer's workflow.
I think the other type of confusion that I'm seeing is is a lot of talk about AI in the data space. And all the vendors you see out there are, I think, smartly positioning them themselves as really relevant and essential to the fundamental tectonic shift we've now seen technology, meaning they try to position themselves as relevant in the in the world where LMS are really providing big opportunity for businesses to, to improve and grow and automate a lot of business processes. But if you double click into what is exactly everyone is saying is it's pretty much we're going to help you, the data team, the data engineer, ship AI to your business and to your stakeholders. Like, we are the best, you know, workflow engine so that you can get data delivered for AI, or we are the best data quality vendor that will help you ensure the quality of the data that goes into AI, or we have the most integrations with all the vector databases that are important for AI.
And kind of the the the message that you're getting from all of this and by no means, this is not import this is definitely important and relevant. But what's interesting about this, it's we're saying essentially, data engineer. You have so many things to do, and now you also have to ship AI. We're gonna help you ship AI. It's so important that you ship beta for AI applications. We are the best tool to help you ship AI. But it almost sounds like this is data engineers in the service of AI. And I think what's really interesting to explore and to unpack and what I would personally love for myself as a data engineer is kind of reversing that question and asking the question of, okay. So we have now this fundamental shift in technology, amazing capabilities by LLMs.
How does it actually help me in my workflow? So what does the AI four beta engineer look like? And I think we need much more of that discussion because I think that if we make people who are actually working on all these important problems more productive with the help of AI, then they all for sure do amazing things with data. And I think that's a really exciting opportunity to explore.
[00:11:10] Tobias Macey:
One of the first and most vocal applications of AI in that context of helping the data engineers by maybe taking some of the burden off them that I've seen is the idea of talk to your data warehouse in English or text to SQL or whatever formulation it ends up taking where rather than saying, oh, now you need to build your complicated star or snowflake schema and then build all of the different dashboards and visualizations for your business intelligence. You just put an AI on top of it, and then your data consumers just talk to the AI and say, hey. What was my net promoter score last quarter, or what's my year over year revenue growth, or how much growth can I, expect in the next quarter based on current sales? And it's going to just automatically generate the relevant queries. It's going to generate the visualizations for them, and you, as a data engineer or as an analytics engineer, don't need to worry about it anymore.
And from the description, it sounds amazing. It's like, great. K. Job done. I don't need to worry about that toilsome work. I do all of the interesting work of getting the data to where it needs to be, and then the AI does the rest. But then you still have to deal with issues of making sure that you have the appropriate semantics maps so that the AI understands what the question actually means in the context of the data that you have, which that's the hardest problem in data anyway no matter what. So the AI doesn't actually solve anything for you. It just maybe exacerbates the problem because somebody asks the AI the question, the AI gives an answer, but it's answering it based on a misunderstanding of the data that you have. And so you still have those issues of hallucination, incorrect data, or variance in the way that the data is being interpreted. And I'm wondering what you have seen as far as the actual practical applications of the AI being that simplifying interface versus the amount of effort that's needed to be able to actually make that useful.
[00:13:10] Gleb Mezhanskiy:
Yeah. I think this is text to SQL is the holy grail of the data space. I would say for as long as I've worked in the space for over a decade that, you know, people really try to solve this problem multiple times. And, obviously, now in hindsight, it's obvious that pre LLM, all of those, approaches using traditional NLP were doomed. And now that we we have LLMs, it seems like, okay. Finally, we can actually solve this problem. And I'm very optimistic that it indeed will help make data way more accessible, and I think it eventually will have tremendous impact on how humans interact with data and how data is leveraged. But I think that the how and how it happens and how it's applied is also very important because I don't think that the fundamental problem is that people cannot write SQL.
SQL is actually not that hard to to write and to master. I I think the fundamental issue is that if we think about the life cycle of data in the organization, it's very important to understand that the raw data that it gets collected from, you know, all the business systems and all the events and logs and everything we have in a data lake, it is pretty much unusable. And it's unusable both by machines and AI or and and people if we just try to, you know, throw a bunch of queries so they didn't ask, you know, try to answer really key business questions. And in order for the data to become use usable, we need what is currently is the job of a data engineer of structuring, filtering, merging, aggregating this data, curating it, and creating a really structured representation of what is our business and what are all the entities in the business that we care about, like customers, products, orders.
So that then this data can be fed into all the applications. Right? Business intelligence, machine learning, AI. And I don't think that text to SQL replaces that because if we just do that on top of the raw data, we basically get garbage in, garbage out. I do think that in certain applications in certain applications of that, we can actually get very good results even today if we put that level of a system on top of highly curated, semantically structured datasets. Right? So if we have a number of tables that are well defined that describe how our business works, having a text to SQL interface could be actually extremely powerful because we know that the questions that are asked and will be translated into code will be answered with the data which has been already prepared and structured. And so it's actually quite easy for the system to be able to make sense about it.
But I don't think we are there where just, like, you don't need the data team. Let's just ask a question. Almost guaranteed that the answer will be wrong. So data engineer in that regard, data engineering and data engineers, are definitely not going to lose their jobs because now it's easy to generate SQL from text.
[00:16:19] Tobias Macey:
And in the context even of that text to SQL use case, what I've been hearing a lot is that it's not even very good at that. One, because LLMs are bad at math and SQL is just a manifestation of relational algebra, thereby math. But that if you bring a knowledge graph into the system where the AI is using the knowledge graph to understand what are the relations between all the different entities from which it then generates the queries, it actually does a much better job. But, again, you'd have to build the knowledge graph first. And I think maybe that's one of the places where bringing AI earlier in the cycle is actually potentially useful, where you can use the AI to do some of that root work of saying, here are all the different representations that I have of this entity or this concept across my different data sources.
Give me a first pass of what a unified model looks like to be able to represent that given all of the data that I have about it and all the ways that it's being represent that given all of the data that I have about it and all the ways that it's being represented. And I'm wondering what you've seen in that context of bringing the AI into that data modeling, data curation workflow of it's not the end user interacting with it. It's the data engineer using the AI as their copilot, if you will, or as their assistant to be able to do some of that tedious work that would otherwise be okay. Well, I've got 15 different spreadsheets. I need to visually look across them and try and figure out the similarities and differences, etcetera.
[00:17:50] Gleb Mezhanskiy:
Yeah. That's a good point, Tobias. I would say that there are I have two thoughts there. On how the EI plugs in to actually make text to SQL work, yes, you absolutely need that kind of semantic graph of what it what datasets you have, how are they related, what are all the metrics, how those metrics are computed. And in that regard, what's really interesting is that the metrics layer that was, at some point, a really hot idea in the modern data stack probably about for, you know, three to five years ago. And then everyone was really disappointed with how little impact it actually made on on the data team's, productivity and just overall on the data stack. It almost like now now it's the metric layer's time. Because if you take the metrics layer and, which gives you a really structured representation of the core entities and the metrics, putting the text to SQL is almost, like, the most impactful thing that you can do because then you have a structured preservation of your data model, which allows AI to be very, very effective at being able to answer questions while being while while operating on a structured graph.
And so I think we'll see really exciting applications coming out of the hybrid of that kind of fundamental metric layer semantic graph and text to SQL in you know, we're already seeing that the early impacts of that. But I think over the next two years, it probably would become the a really popular way to open up data for ultimate stakeholders instead of classical BI of, like, drag and drop, interfaces and kind of passively consume dashboards. But then the second point which he made is, basically, can AI actually help us get to that structured representation? And I think absolutely, for the data engineer's workflow. So not for a, I would say, business stakeholder or someone who is data consumer, but for data producer, I think that leveraging LLMs to help you build data models and especially build them faster build them faster in the sense of understanding all the semantic relationships, not just writing code, is a very promising area. And that comes back to the my point about how software tools are limited in their help of you know, for data engineers. Right? I can write SQL, but if I if my tool does not understand what are the relationships between the datasets, then it can't even help me write joins properly.
And one of the interesting things we've done at DataFold was actually build a system that essentially infers a entity relationship diagram from the raw data that you have combined with all the ad hoc SQL queries that have been written by people. So, previously, that would be a very hard problem to solve. But with the help of LLMs, we can actually have a really good shot at understanding what is the what are all the entities that your business have in your data lake, how they're related. And that's almost like a probabilistic graph because people can be writing joins correctly or incorrectly, and you have noisy data. And sometimes keys that you think are, like, primary keys or foreign keys are not perfect.
But if you have a large enough dataset of queries that were ran against your warehouse, you can actually have a really good shot at understanding what's the semantic graph looks like. And the context on which we actually did this was to help data teams build testing environments for their data. But the the implications of having that knowledge is actually very powerful. Right? So to your point, we can use that tools to help write SQL. So I'm very bullish on the ability to help engine data engineers build pipelines by creating a semantic graph without the need for curation. Because previously, that problem was almost pushed to people with all the kind of data governance tools. The idea was, let's have data stewards define all the canonical datasets and all the relationships. And, obviously, we just powered this completely non scalable.
So now we're finally at the point where we can automate that kind of semantic data mining, with LMS.
[00:22:11] Tobias Macey:
That brings us back around to another point that I wanted to dig into further in the context of how to actually integrate the LLMs into these different use cases and workflows. You brought up the example of Cursor as an IDE that was built specifically with LLM use cases in mind, juxtaposed with something like a Versus code or VIM or Emacs where the LLM is a bolt on. It's something that you're trying to retrofit into the experience. And it can be useful, but it requires a lot more effort to be able to actually set it up, configure it, make it aware of the code base that you're trying to operate on, etcetera, versus the prepackaged product.
And we're seeing that same type of thing in the context of data where you mentioned there are all these different vendors of, oh, hey. We're gonna make it super easy for you to make your data ready for AI or use this AI on your data. But most teams already have some sort of system in place, and they just wanna be able to retrofit the LLM into it to be able to start getting some of those gains, would the eventual goal of having the LLM maybe be a core portion of their data system, their data product? And I'm wondering, in that process of bringing an LLM, retrofitting it onto an existing system, whether that be your your code editor, your deployment environment, your data warehouse, what have you.
What are some of those impedance mismatches or some of the issues in conceptual understanding about how to bring the appropriate, I'm gonna use the word knowledge, even though it's a bit of a misnomer, into the operating memory of the LLM so that it can actually do the thing that you're trying to tell it to do? Yeah. That's a great question, Tobias. I think that to answer this, we kinda need to go back to
[00:23:55] Gleb Mezhanskiy:
what are the jobs to be done for data engineer, and how does the data engineer workflow actually look like. And if we were to visualize it, it actually looks quite similar to the software engineering workflow in just the types of tasks that a data engineer does day to day to do their work. And by the way, we're saying data engineer as sort of like a blank label, but I don't necessarily mean just people who have data engineering in the title because all roles that are working with data, including data scientists, analysts, analytics engineers, and VM in many cases, and software engineers, a lot of them actually do data engineering in terms of building pipelines and developing pipelines as part of their job. It's just data engineers probably do this, you know, % of their data time. And if I'm a data analyst or data scientist, I would be doing this maybe 40% of the time of my week. And so if we think about what do I need to do to, let's say, ship a new data model like a table or extend an existing data model, you know, refactor definitions or add new types of information into an existing model, it starts with planning. Right? So I'm doing planning.
I'm trying to find the data that I need for for my work. And a lot of the times, a lot of information can be sourced from documentation, from a data catalog. I think right now, the data catalog, giving you the sense of, like, what datasets I have and what's the profile of those datasets, has been largely solved. There are great tools. You know, some are open source. Some are vendors. But overall, understanding what datasets you have now is way easier than it was five years ago. You also probably are consulting your tribal knowledge, and you go to Slack and you do, like, search for certain definitions. And that's also now is largely solved with a lot of the enterprise search tools. And then you go into writing code.
And writing code, I think this is also an important misconception. Like, if you are not really, you know, doing this for for a living, you think that people spend most of their time actually writing SQL and in terms of, like, writing SQL to for production. And in my experience, actual writing of the SQL or other types of code is maybe, like, 10 to 15% of my time, whereas all the operational tasks around testing it, talking to people to get context, doing code reviews, shipping it to production, monitoring it, remediating issues, talking to more people is where the bulk of the work is happening.
And if that's true, then that means that probably as we talk about automation, these operational workflows are where the bulk of the lift coming from MLMs can actually happen. And so for actual writing code as a data engineer, I would still recommend probably using the best in class software tools these days, like Cursor. It will even though it's not aware of the data, it will probably still help you write a lot of boilerplate and will speed up your workflow somewhat. And or you can use other IDs with Copeland or, like, Versus Code plus Copilot. I think those tools will just help you speed up the writing of the code itself. But back to operational workflows that I think take the majority of the of the time within any kind of cycle of shipping some of shipping something. When it comes to what happens after you wrote the code, right, typically, if you have people who care about the quality of the data, it means that you have to do a fair amount of testing of your work.
And testing is both helping making sure that my code is correct. Right? Does it conform to the expectations? Does it produce the data that I expect? But it's also about understanding potential breakages. Data systems are historically fragile in the sense that you have layers and layers of dependencies that are often opaque because, I can be changing some definition of what an active user is somewhere in the pipeline. But then I can be completely oblivious of the fact that 10 jobs down the road, someone builds a machine learning model that consumes that definition and tries to automate certain decisions for, like, for example, spend and manipulating that metric. And so if I'm not aware of those downstream dependencies, I couldn't be actually causing a massive business disruption just by the sheer fact of changing it. And so the testing that involves not just understanding how the data behaves, but also how the data is consumed and what are the larger business implications for making any kind of modification to the code is where a ton of time is spent in the data engineering. And so what's interesting is that is the use case where, historically, we at Data Vault spend a lot of time thinking about even pre AI. And before a lens were a thing, what we did there was came up with a concept of data diffing. And the idea is everyone can see code diff. Right? My code looked like this before I made a change. Now it's it's a different numb you know, it's a different set of characters that, the code looks like. And, defining the code is something that is, like, embedded in GitHub. Right? You can see the diff. But the very hard question is understanding how does the data change based on the change in the code because that is not obvious. That happens, like, once you actually run the code against the database. And so Datadiff allow you to see the impact of a code change on the data. And that by itself was quite impactful, and we've seen a lot of teams adopt that, you know, large enterprise teams, fast moving software you know, startup teams. But we were not fully satisfied with the degree of automation that feature alone produced because people are still required to, like, sift through all the data diffs and explore them for multiple tables and see how the downstream impacts when they pass themselves through lineage.
And it felt like, okay. Now at least we can give people all the information, but they still have to sift through a lot of it, and some of the important details can be missed. And the big unlock that LLMs bring this particular workflow is once LLMs became pretty good in comprehending the code and actually, semantically understanding the code, which pretty much happened over 2024 with the latest generation of fundamental, you know, large language models, we were able to do two things. One, take a lot of information and condense it into, like, three bullet points, kind of like an executive summary. And those bullet points are essentially helping the data engineer understand on the high level what are the most important impacts that I need to worry about for any given change and for a code reviewer to understand the same. And that just helps people to get on the same page very quickly and say they're running a lot of time that otherwise could spend be spent in meetings, running back and forth, you know, putting comments on a code change. And the second unlock that we've seen is the opportunity to to drill down and explore all the impacts and do the testing by, essentially, chatting with your pull request, chatting with your code. And that it comes in the form of a chat interface where you're basically speaking to an agent that has a full context of your code, full context of the data change, data diff, and also full context of your lineage so that I can actually understand how every line of code that was modified affecting the data and what does that mean for the business.
And you can ask questions, and it produces the answers way faster than you would by essentially looking at all the different, you know, code changes and and data dips. And that ended up save saving a lot of times a lot of time for data teams. And now that I'm describing this, you kind of feel that I it sounds like almost having a buddy that just, like, helps you think for the code, almost like having a code reviewer, except for with AI. With LLM, this is a body that's always available to you twenty four seven and probably makes your mistakes because it has all the context and can set through a lot of informations really quickly. So that's an example of how an LOAD could be applied into an operational use case that historically has been really time consuming and take a lot of manual work out of that context.
[00:32:13] Tobias Macey:
And I really wanna dig into that one word that you said probably at least a half dozen times, if not, maybe a couple of dozen was that context, where that, I think, is the key piece that is so critical and also probably the most difficult portion of making AI useful is context. What context does it need? How do you get that context to it? How do you model that context? How do you keep it up to date? And so I think that really is where the difference comes in between the cursor example that we touched on earlier versus the retrofitting onto Emacs or whatever your tool or workflow of choices is how do you actually get the context to the place that it needs to be. And so you just discussed the use case that you have of being able to use the LLM in that use case of interpreting the various data diffs, understanding what is the actual ramifications of this change. And I'm wondering if you can just talk through some of the lessons learned about how you actually populate and maintain that context and how you're able to instruct the LLM how to take advantage of the context that you've given it? That's a great question, Tobias. And I think what's interesting
[00:33:21] Gleb Mezhanskiy:
is that at face value, it seems like you wanna throw all the information you have at LLM. Right? Just like tell you everything and then let it figure out things. And in fact, it is obviously not as easy as that. And in fact, it's actually counterproductive to oversupply the LM with context, in part because the context window of Flash language models is limited. And the trade off there is, one, you just, like, can't physically fit everything. And, two, even if you were dealing with a model that actually is designed to have a very large convex window, if you overuse it and supply too much information, L and M just get gets lost. It's also it starts being far far less effective in understanding what's actually important versus not, and the overall effectiveness of your system goes down.
So back to your question of, like, what is the actual information that is important to provide as context into LLM? It really depends on what is the workflow that we're talking about. In the context of a code review and testing, where we are trying to fundamentally answer the question of, a, if we change the code, was a change correct relative to what we tried to do, what was what the task was, or did we not conform to the business requirement? The second question is, did we follow the best processes such as, you know, code guidelines and performance guidelines or not? And the third question is, okay. Let's say we conform to the business requirements. We did the good job at following our coding best practices, But we may still cause a business disruption just by making a change that can be a surprise either for a human consumer of data downstream or could throw off a machine learning model that was trained based on the different distribution of data. Right? And so these are fundamental three questions that we try to answer. And by the way, even without AI, that's what a good code review would ultimately accomplish done by humans.
So what is the context that it's important for LM to have here? First, obviously, it is the code difference. Right? So we already know what the original code was, what's the new code is. And feeding that into l m is really important so that I can understand, okay, what are the actual changes in the code itself, in the logic. And, I won't go into the details here because, obviously, the code base can be very large. Sometimes your PR can fetch a lot of code, so you have to be quite strategic in terms of how do you feed that on the technical side. But conceptually, that's what we have to provide as an input, number one. The second important input is the data diff. Right? It's understanding if I have a kind of main branch version of the code, understanding what data it produces and what are the metrics showing. Right? And then if I have a new version of the code, let's call it a developer branch, what data it produces and what is the difference in the output?
Let's say, with my main branch code, I see that I have 37 orders on Monday. But with the new version of the code, I see that I have 39. And so that already tells me that, okay. So this is the important impact on the output data and on the metrics. And that can and that's important both on the value levels, understanding how the individual cells, rows and columns, are changing, but it's also important to do roll ups and understand what is the impact on metrics. And coupling that context with the code diff allows us to understand how changes in the code affect the actual data output. And the third really important aspect is the lineage. So lineage is fundamentally understanding how the data flows throughout your system, how it's computed, how it's aggregated, and how it's consumed.
And the lineage is a graph, and there are kind of two directions of exploration. One of them is upstream, which helps us understand how how did the data get to the point where you're looking at it. Right? So, for example, if I'm looking at number of orders and I'm changing a formula, where does the information about orders come from in the first place? And that is important because that can tell us a lot about how a given metric is is computed and what are the source of truth. Are we getting it from Salesforce? Are we getting it from our internal system? And then the downstream lineage is also important because it tells us how the data gets consumed, and that is absolutely essential information that can help us understand what downstream systems and metrics will be affected. And lineage graph in itself can be very complex, and building it actually is a top problem because you have to essentially scrape all of your data platform information, all the queries, all the BI tools to understand how data flows, how it's consumed and produced. But let's say you have this lineage graph. It's actually also a lot of information by itself. And so to properly supply that lineage information into an online context, you actually kinda need, your system to be able to explore lineage graph on its own to see, like, okay. If I am if the developer make made a change here, what are the important downstream implications of that? So now we're talking about kind of the system to be able to kind of traverse that and do analysis on its own for for the context. I would say these are the three most important types of context. And then the fourth one is kind of optional. Yeah, if your team has any kind of best practices, SQL linting rules, documentation rules, you can also provide them as context, and then your kind of AI code reviewer assistance can help you reason about, well, did you conform or not? And if not, making suggestions about what to correct. Eventually, probably going in and correcting your code itself. I think that's ultimately where this is going. But, again, it's pretty much would be operating on the same side of input context.
[00:39:13] Tobias Macey:
Another interesting element of bringing LLMs into the context of the data engineering workflow and use case, one is the privacy aspect, which is a whole other conversation. I don't wanna get too deep into that quagmire. But, also, when you're working as a data engineer, one of the things you need to be thinking about is what is my data platform? What are the tools that I rely on? What are the ways that they link together? And if you're going to rely on an LLM or generative AI as part of that tool chain, how does that fit into that platform? What is some of the scaffolding? What are some of the workflows? What are some of the custom development that you need to do where a lot of the first pass and naive use cases for generative AI and LLMs is, oh, well, just go and open up the chat GPT UI or just go run LM studio or use cloud or what have you. But if you want to get into anything sophisticated where you're actually relying on this as a component of your workflow, you want to make sure that it's customized, that you own it in some fashion.
And so that is likely going to require doing some custom development using something like a lane chain or a lane graph or, crew AI or whatever where you're actually building additional scaffolding logic around just that kernel of the LLM. And I'm curious how you're seeing some of the needs and use cases of incorporating the LLM more closely into that actual core capabilities of the data platform through that effort of customization and, software engineering.
[00:40:45] Gleb Mezhanskiy:
That's a great point, Tobias. I think that the models themselves are getting rapidly commoditized in the sense that their capabilities, the fund you know, the foundational large language models, their interfaces are very similar. Their capabilities are similar. We're seeing a lot of race between the companies training those models in terms of beating each other in benchmarks. Looks like the whole industry is converging on adding more reasoning, and then the ways that this is happening is also converging on the same experience and the matter and the difference is, like, who is doing this better? Right? Who is bidding the metrics? Who provides the best, the cheaper inference, the faster inference, more intelligence for for the same price? And to that and I don't think that differentiation or the effectiveness of whatever is the automation that you're trying to bring really depends on the choice of the model. Maybe for certain narrow applications, actually, maybe choosing a more specialized model and or fine tuning model would be more applicable. But still, I don't think the model is where you really where the magic happens these days.
Model is important for magic, but it's not something that actually allows you to build a really effective application by just, you know, choosing something better than what's available to everyone else. The actual magic and the value add and the automation happens in how you leverage that model in your workflow. So all the orchestration in terms of how do you prompt the model, what kind of context do you provide, how do you tune the prompt, how do you tune the inputs, how do you evaluate the performance of the model in production, how do you make various ALM based actors that may be playing different roles interact with each other. That is where the hard work is happening, and that is where I think the actual value and impact is created. And that's where all the complexity is. So I think you don't have to be, you know, a PhD and really understand how the models are trained. Although, I would say just like in computer science, it's obviously very helpful to understand how these models are trained and their architectures and their trade offs. But you don't have to be good at, you know, training those models in order to effectively leverage them. But to leverage them, you have to do a lot of work to effectively plug them in the workflows. And I think that the applications and companies and teams that are thinking about what is the workflow, what is the ideal user interface, what is all the information that I we can gather to make l m do the better job, and then are able to rapidly iterate will ultimately create the most impact with OLMs.
[00:43:31] Tobias Macey:
And so on that note, in your experience of working with the LLMs, working with other data teams, and keeping apprised of the evolution of the space, what are some of the most interesting or innovative or unexpected ways that you've seen teams bring LLMs into that inner loop of building and maintaining and evolving their data systems?
[00:43:52] Gleb Mezhanskiy:
I think the most, in hindsight, it's obvious, but not necessarily obvious when you're just starting realization is that no one really knows how to ship LLM AI based applications. There are obviously, you know, guides and tutorials and still, like, there there's a lot you can learn from looking at what people are doing, but the field is evolving so fast that nothing replaces fast experimentation and just building things. It's not that you can just hire someone who worked on building an LLM based application, like, six months ago, a year ago, and all of a sudden, you, you know, gain a lot of advantage as you would with many other technologies. Like, you know, if we were, I guess, working in a space of video streaming, it will be very beneficial to have extensive experience with working with video streaming and codecs. And with LLMs, one, no one really knows exactly how they work, even the company in terms of, like, how they behave. Right? In terms of even the companies that are shipping them are discovering more and more novel ways of leveraging them more effectively every week.
And from the from the teams that are using leveraging the lens like like data folds, the thing that we found matter the most is the ability to, a, just stay on top of the field and understanding what's the what's the, like, most exciting thing that people are are doing, how they relate to our field, how can we borrow some of those ideas. But most importantly is is rapid experimentation with some sort of methodology that allows you to try new things, measure results quickly, and then being able to scrap your approach that you thought was great and just go with a different one. Because a lot of times when a new model is released, you have to kind of adjust a lot of things. You have to adjust the prompts. You have to even rearchitect some of the flows that you build.
And that is both difficult but also incredibly exciting because the pace of innovation and what is possible to solve is evolving extremely fast. I would say the fastest of any previous technological wave of disruption that we've seen.
[00:46:17] Tobias Macey:
In your experience and in your work of investing in this space, figuring out how best to apply LLMs to the problems facing data engineers and how to incorporate that into your products, what are some of the most interesting or unexpected or challenging lessons that you've learned personally?
[00:46:34] Gleb Mezhanskiy:
Yeah. I I think that the the interesting realization was that specifically for data engineering domain again, if you just take the problem at face value, you think, well, let's just build a Copilot or an agent that would kind of try to automate data engineer a way. And I don't think we have the tech ready for an agent to just, like, really take a task and run with it yet. I don't think it's been solved in software space. I think it's, in some ways, even harder to solve in data space. We'll eventually get there. I don't think we are there yet. I don't think that the biggest feedback you can make on the engineering workflow again is, like, having a copilot because that's not where data engineers spend most of their time in terms of, like, writing production code. It's all operational tasks. And there are certain kinds of problems in the data engineering space where it's not even a day to day, you know, you help, you save, like, an hour, two hours, three hours.
But there are certain types of workflows where to complete a task, a team needs to spend, like, ten thousand hours. And a good example of such a project would be a data platform migration where, for example, you have millions of lines of code on legacy database. You have to move them over to a new modern data warehouse. You have to refactor them, optimize them, repackage them into a new kind of framework. Right? You may be moving from, like, stored procedures on Oracle to DBT plus Databricks. And doing that requires a certain number of hours for every object. And because you're dealing with a large database that at enterprise level sums up to enormous amount of work.
And, historically, these projects would last years and be done by, a lot of times, outsource talents from, you know, consultants or or a size. And for data engineer, that's, like, probably one of the most miserable projects to do. I've done I've led a a project at Lyft, and it's been an absolute grind where you you're not shipping new things. You're not shipping AI. You're not shipping even data pipelines. You're just, like, solving technical debt for years. And what's interesting is that those types of projects and workflows are actually, I would say, where AI and OMS can make today the most impact because we can take a task.
We can reverse engineer it. We know exactly what is the target of you know, you move the code, you do all of these things with the code, and, ultimately, the data has to be the same. Right? You're moving you're going through multiple complex steps, but what's important for the business is once you move from, let's say, you know, Teradata to Snowflake, your output is the same because, otherwise, business wouldn't accept it. And that allows us to, a, leverage LMS for a lot of the tasks that are historically manual, but also have a really clear objective function for OMS, like, dipping the output on a legacy system to a modern system and using it as a constraint.
And if you put those two things together, you have a very powerful system that is, a, extremely flexible and scalable thanks to all ends, but also can be constrained to a very objective definition of what's good. You know, unlike a lot of this text to SQL generation that cannot be constrained to the definition of what's good. Because, like, how do you know? By by the end of migration, you do know. And that allows AI to make tremendous impact on the productivity of a data team by essentially taking a project over the last four years, cost millions of dollars, and go our budget and constrain that into weeks and, you know, just a fraction of the price. I think that is where we can see real impact of AI that's, like, useful. It's working.
And we also see the parallels in software space as well. There are also a lot of the up like, really thoughtful enterprise applications of AI is actually taking these legacy code bases and, you know, helping teams maintain them and or migrate them. And I think that there are more opportunities like that in a data engineering space where, we'll see AI make tremendous impacts.
[00:51:03] Tobias Macey:
And as you continue to keep in touch with the evolution in the space, work with data teams, evaluate what are the cases where LLMs are beneficial versus you're better off going with good old human ingenuity. What are some of the things you're keeping a particularly close eye on or any projects or context you're excited to explore?
[00:51:27] Gleb Mezhanskiy:
In terms of where you where where I think that LMS would really make a huge impact on the workflow?
[00:51:33] Tobias Macey:
Just LLMs in general, how to apply them to data engineering problems, how to incorporate them more closely and with less legwork into the actual problem solving apparatus of an organization.
[00:51:46] Gleb Mezhanskiy:
Yeah. So I think that on multiple levels, there's a lot of exciting things. Like, for example, being able to prompt an OLM from SQL as a function call that's available these days in modern data platforms is incredibly impactful. Right? Because instead of trying to in many instances, we're dealing with extremely massive data. And instead of having to write, like, complex case one statements and regexes and, like, UDFs to be able to clean the data, to classify things, and to just tangle the mess, we can now apply LLMs from within SQL, from within the query to solve that problem. And that is incredibly impactful for a whole variety of different applications. So I'm very excited about all these capabilities that are now, you know, brought by the major data platforms like, you know, Snowflake, Databricks, BigQuery.
I think that we if we go into the workflow itself, like, what does data engineer do and how to make that work better? I think there's a ton of opportunity to further automate a lot of tasks. I think a big big one is data observability and monitoring. I honestly think that data observability in its current state is a dead end in terms of, like, let's cover all data with alerts and monitors and then be the first to know about any anomalies. It's useful, but then it quickly leads to a lot of noise, alert, fatigue, and ultimately kind of could be even net negative on the workflow of a data engineer.
I think that this is a type of workflow where putting an AI to investigate those alerts, do the root cause analysis, and potentially remediation is where I see a lot of opportunity for saving a ton of time for data team while also improving the SLAs and the overall quality of the output of a data engineering team. And that's something that we are really excited about. Something we're working on Dataflow, and we are excited about coming later this year.
[00:53:56] Tobias Macey:
Are there any other aspects of this overall space of using LLMs to improve the lives of data engineers and the work that data engineers can do to the effectiveness of those LLMs that we didn't discuss yet that you'd like to cover before we close out the show?
[00:54:12] Gleb Mezhanskiy:
I think that, you know, we talked a lot about kind of the the workflow improvements. I think that, overall, my recommendation to data engineers today would be to learn how to ship Elven applications. It's not that hard. Frameworks like LandChain make it very easy to compose multiple blocks together and ship something that works whether or not you end up losing using blockchain or other framework in production and whether your, you know, team allows that. Doesn't really matter, but it's really, really, really useful to try and build and learn all the components.
And by it's just like software engineering. You know? Learning how to code opens up so many opportunities for you to solve problems. Right? You see a problem and you're like, I can write a Python script for that. And I think that with LLMs, it's almost like a new skill that both software engineers and data engineers need to learn where you see a problem and you think that, okay. I actually think I can scale the problem into three tasks that I can give to an LLM. Like, one would be extraction web. It could be, like, reasoning and classification. And now it just solves the problem.
And so but but really learning how to build and trying helps you build that intuition. And so my recommendation will be for all data engineers while listening to this is try to build your own application that solves either a business problem or helps you in your own workflow because knowing how to build with OMS just gives you tremendous superpowers and will definitely be helpful in your career in the coming years.
[00:55:52] Tobias Macey:
I definitely would like to reinforce that statement because despite the AI maximalist, the AI skeptics, no matter what you think about it, LLMs aren't going anywhere. They're going to continue to grow in their usage and their capabilities, so it's worth understanding how to use them and investing in that skill because it is going to be one of those core tools in your toolbox for many years to come. And so for anybody who wants to get in touch with you and follow along with the work that you are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get what your current perspective is on the biggest gap in the tooling or technology for data management today.
[00:56:35] Gleb Mezhanskiy:
I think that there's a lot of kind of skepticism and some bitterness around kind of modern data stack failed us in a sense that we were so excited that more data stack will make things so great five years ago, and we're kind of disappointed. And I think that I'm an optimist here. I think that modern data stack in the sense of infrastructure and getting a lot of the fundamental challenges out of the way, like running queries and getting data in and out of different databases and visualizing the query outputs and having amazing null books.
All of that that we now take for granted is actually so great relative to where we were, you know, five, seven, eight, ten years ago. I don't think it's enough. So I think that, I am with the data practitioners for, like, well, it's 01/25. We have all these amazing models. Why is it still so hard to ship data? Absolutely with you. And I think what I'm excited about is now that we have this really great foundation with modern data stack in the sense of infrastructure, I'm excited about, one, getting everyone on modern data stack to the point of migrations. Right? Let's get everyone on modern infrastructure so that they can ship faster. Obviously, a problem that I'm really passionate about in solving and working.
Second, once you are on the modern data infrastructure, how to keep modernizing your team's workflows so that data engineers are spending more and more time on solving hard problems and thinking and planning on the valued activities that are really worth their time and less and less on operational toil that just is burnout inducing and keeps everyone back. So I'm excited about the modern data stack renaissance, thanks to the fundamental capabilities of large language models.
[00:58:30] Tobias Macey:
Absolutely. Well, thank you very much for taking the time today to join me and sharing your thoughts and experiences around building with LLMs to improve the capabilities of data engineers. It's definitely an area that we all need to be keeping track of and investing some time into. So I appreciate the insights that you've been able to share, and I hope you enjoy the rest of your day.
[00:58:50] Gleb Mezhanskiy:
Thank you so much, Tobias.
[00:58:59] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds today for the details.
Your host is Tobias Macey, and today, I'd like to welcome back Gleb Myszynski, where we're going to talk about the work of data engineering to build AI, to build better data engineering, and all of the things that come out of that idea. So, Gleb, for folks who haven't heard any of your past appearances, if you could just give a quick introduction.
[00:01:07] Gleb Mezhanskiy:
Yeah. Thanks for having me again, Tobias. Always fun to be at podcast. I'm glad I am CEO and cofounder of DataFold. We work on automating data engineering workflows now also with AI. Prior to starting DataFold, I was a data engineer, data scientist, data product manager, and I got a chance to build three data platforms pretty much from scratch at three very different companies, including Autodesk and Lyft, where I was one of the first founding data engineers and got to build a lot of pipelines and infrastructure and also break a lot of pipelines and infrastructure. And I've always been fascinated by how important is data engineering to the business In that, it unlocks the delivery of the actual applications that are data driven, be that dashboards or machine learning models or now in busingly also, BI applications.
And at the same time, as a data engineer, I have always been very frustrated with how manual, error prone, tedious, and toilsome my personal workflow was and pretty much started Dataflow to solve that problem and remove all the manual work from the data engineering workflow so that we can ship high quality data faster and help all the wonderful businesses that are trying to leverage data actually do it. So excited to chat.
[00:02:34] Tobias Macey:
In the context of data engineering, AI, obviously, there's a lot of hype that's being thrown around about, oh, you just rub some AI on it. It'll be magical, and your problems are solved. You don't need to work anymore. It's going to replace all of your junior engineers or whatever the current marketing spin is for it. And it's undeniable that large language models, generative AI, the current era that we're in, has a lot of potential. There are a lot of useful applications of it, but the work to actually realize those capabilities is often a little bit opaque or misunderstood or confusing.
And so there are definitely a lot of opportunities for being able to bring large language models or other generative AI technologies into the context of data engineering work or development environments. But the work of actually getting it to the point where it is more help than hindrance is often where things start to fall apart. And I'm wondering if you can just start from the work that you're doing and the experience you've had of actually incorporating LLMs into some of your product, some of the lessons learned about what are some of those impedance mismatches, what are some of those stumbling blocks that you're going to run into on the path of saying, I've got a model. I've got a problem. Let's put them together.
[00:03:57] Gleb Mezhanskiy:
Yeah. Absolutely. And I think that's spot on observation, Tobias, in terms of there's a lot of noise and hype around AI everywhere. But, yeah, we don't have a really clear idea and consensus how actually it impacts state engineering. And maybe before we dive into, like, okay, what is actually working, it's worth kind of disambiguating and cutting through the noise a little bit. And I've been thinking about this recently, but I think there is probably two main things that everyone gets a bit confused about. One is the confusion of software engineering and data engineering.
Software engineering and data engineering are very related. And in many ways, they are similar. In data engineering, we ultimately also write code that produces some outcome. But unlike software engineering, typically, we're not really building a deterministic application that performs a certain function. We write in code that processes large amounts of data. And, usually, that data is highly imperfect. And so we're dealing not just with, code. We're dealing also with extremely complex, extremely noisy inputs and a lot of the times also unpredictable outputs. And that makes the workflow quite different.
And I think one important distinction is when we see lots of different tools and advancements in tools that are affecting software engineers and impacting their workflows for the better like, one example is, I think, over the past year, we've seen amazing amazing improvement of the kind of Copilot type of support within a software engineering workflow through various tools. We at Dataflow, for example, use cursor ID a lot, and we really like how it seamlessly plugs in and enables our engineers working on the application code just be more productive, spend less time on, a lot of, like, boilerplate, total sum tasks.
And those tools are really it's really exciting how it affects the software engineering workflow. There's also a huge part in the software engineering space right now that is devoted to the agents. So, for example, with Cursor, the idea is that you plug it in the IDE in a few touch points for developer, like code completion and then kind of in the system and helps you mock up and refactor the code. And it's very seamless, but it's still kind of part of the core workflow for human. And And then there's a second school of thought where there's an agent that takes the tasks that can be very loosely defined and then basically builds an app from scratch or takes a Jira or linear ticket and does the work from scratch. And it's also very exciting. I would say, in our experience testing multiple tools, the results there are far less impressive, and actual impact on the business for us in terms of software engineering has been far less impressive than with more, like, a ID native enhancement.
But all of that is to say that while those tools are really impactful for software engineers and there's a lot happening also in other parts of the workflow, we've seen very limited impact of those particular tools on the data engineers workflow. And the primary reason is that although we're also writing code as data engineers, the tools that are built for software engineers, they lack very important context about the data. And it is kind of a simple idea and a simple statement, but what's underneath is actually quite a bit of complexity. Because if you think about what data engineer needs to do in order to do their job, they have to understand not just the code base, but they also have to have a really good grasp on the underlying data that their code base is processing, which is actually a very hard task by itself starting from understanding what data you have in the first place, how was the data computed, where it's coming, who is consuming it, what are the relationships between all the datasets.
And absent of that context, the tools that you may have supporting your workflow, yes, they can help you generate the code, but the impact of that would be quite limited relative to, how complex your workflow is. And I think that means that for data engineers, we need to see a specialized class of tools that would be dedicated at improving data engineers' workflow and would excel at doing that by having that context that is critical for data engineer to do their job. That's kind of, I think, one aspect of the confusion sort of like all the advances in software engineering tools are exciting and inspiring. It doesn't mean that now data engineer's workflow is impacted as as significantly as the software engineer's workflow.
I think the other type of confusion that I'm seeing is is a lot of talk about AI in the data space. And all the vendors you see out there are, I think, smartly positioning them themselves as really relevant and essential to the fundamental tectonic shift we've now seen technology, meaning they try to position themselves as relevant in the in the world where LMS are really providing big opportunity for businesses to, to improve and grow and automate a lot of business processes. But if you double click into what is exactly everyone is saying is it's pretty much we're going to help you, the data team, the data engineer, ship AI to your business and to your stakeholders. Like, we are the best, you know, workflow engine so that you can get data delivered for AI, or we are the best data quality vendor that will help you ensure the quality of the data that goes into AI, or we have the most integrations with all the vector databases that are important for AI.
And kind of the the the message that you're getting from all of this and by no means, this is not import this is definitely important and relevant. But what's interesting about this, it's we're saying essentially, data engineer. You have so many things to do, and now you also have to ship AI. We're gonna help you ship AI. It's so important that you ship beta for AI applications. We are the best tool to help you ship AI. But it almost sounds like this is data engineers in the service of AI. And I think what's really interesting to explore and to unpack and what I would personally love for myself as a data engineer is kind of reversing that question and asking the question of, okay. So we have now this fundamental shift in technology, amazing capabilities by LLMs.
How does it actually help me in my workflow? So what does the AI four beta engineer look like? And I think we need much more of that discussion because I think that if we make people who are actually working on all these important problems more productive with the help of AI, then they all for sure do amazing things with data. And I think that's a really exciting opportunity to explore.
[00:11:10] Tobias Macey:
One of the first and most vocal applications of AI in that context of helping the data engineers by maybe taking some of the burden off them that I've seen is the idea of talk to your data warehouse in English or text to SQL or whatever formulation it ends up taking where rather than saying, oh, now you need to build your complicated star or snowflake schema and then build all of the different dashboards and visualizations for your business intelligence. You just put an AI on top of it, and then your data consumers just talk to the AI and say, hey. What was my net promoter score last quarter, or what's my year over year revenue growth, or how much growth can I, expect in the next quarter based on current sales? And it's going to just automatically generate the relevant queries. It's going to generate the visualizations for them, and you, as a data engineer or as an analytics engineer, don't need to worry about it anymore.
And from the description, it sounds amazing. It's like, great. K. Job done. I don't need to worry about that toilsome work. I do all of the interesting work of getting the data to where it needs to be, and then the AI does the rest. But then you still have to deal with issues of making sure that you have the appropriate semantics maps so that the AI understands what the question actually means in the context of the data that you have, which that's the hardest problem in data anyway no matter what. So the AI doesn't actually solve anything for you. It just maybe exacerbates the problem because somebody asks the AI the question, the AI gives an answer, but it's answering it based on a misunderstanding of the data that you have. And so you still have those issues of hallucination, incorrect data, or variance in the way that the data is being interpreted. And I'm wondering what you have seen as far as the actual practical applications of the AI being that simplifying interface versus the amount of effort that's needed to be able to actually make that useful.
[00:13:10] Gleb Mezhanskiy:
Yeah. I think this is text to SQL is the holy grail of the data space. I would say for as long as I've worked in the space for over a decade that, you know, people really try to solve this problem multiple times. And, obviously, now in hindsight, it's obvious that pre LLM, all of those, approaches using traditional NLP were doomed. And now that we we have LLMs, it seems like, okay. Finally, we can actually solve this problem. And I'm very optimistic that it indeed will help make data way more accessible, and I think it eventually will have tremendous impact on how humans interact with data and how data is leveraged. But I think that the how and how it happens and how it's applied is also very important because I don't think that the fundamental problem is that people cannot write SQL.
SQL is actually not that hard to to write and to master. I I think the fundamental issue is that if we think about the life cycle of data in the organization, it's very important to understand that the raw data that it gets collected from, you know, all the business systems and all the events and logs and everything we have in a data lake, it is pretty much unusable. And it's unusable both by machines and AI or and and people if we just try to, you know, throw a bunch of queries so they didn't ask, you know, try to answer really key business questions. And in order for the data to become use usable, we need what is currently is the job of a data engineer of structuring, filtering, merging, aggregating this data, curating it, and creating a really structured representation of what is our business and what are all the entities in the business that we care about, like customers, products, orders.
So that then this data can be fed into all the applications. Right? Business intelligence, machine learning, AI. And I don't think that text to SQL replaces that because if we just do that on top of the raw data, we basically get garbage in, garbage out. I do think that in certain applications in certain applications of that, we can actually get very good results even today if we put that level of a system on top of highly curated, semantically structured datasets. Right? So if we have a number of tables that are well defined that describe how our business works, having a text to SQL interface could be actually extremely powerful because we know that the questions that are asked and will be translated into code will be answered with the data which has been already prepared and structured. And so it's actually quite easy for the system to be able to make sense about it.
But I don't think we are there where just, like, you don't need the data team. Let's just ask a question. Almost guaranteed that the answer will be wrong. So data engineer in that regard, data engineering and data engineers, are definitely not going to lose their jobs because now it's easy to generate SQL from text.
[00:16:19] Tobias Macey:
And in the context even of that text to SQL use case, what I've been hearing a lot is that it's not even very good at that. One, because LLMs are bad at math and SQL is just a manifestation of relational algebra, thereby math. But that if you bring a knowledge graph into the system where the AI is using the knowledge graph to understand what are the relations between all the different entities from which it then generates the queries, it actually does a much better job. But, again, you'd have to build the knowledge graph first. And I think maybe that's one of the places where bringing AI earlier in the cycle is actually potentially useful, where you can use the AI to do some of that root work of saying, here are all the different representations that I have of this entity or this concept across my different data sources.
Give me a first pass of what a unified model looks like to be able to represent that given all of the data that I have about it and all the ways that it's being represent that given all of the data that I have about it and all the ways that it's being represented. And I'm wondering what you've seen in that context of bringing the AI into that data modeling, data curation workflow of it's not the end user interacting with it. It's the data engineer using the AI as their copilot, if you will, or as their assistant to be able to do some of that tedious work that would otherwise be okay. Well, I've got 15 different spreadsheets. I need to visually look across them and try and figure out the similarities and differences, etcetera.
[00:17:50] Gleb Mezhanskiy:
Yeah. That's a good point, Tobias. I would say that there are I have two thoughts there. On how the EI plugs in to actually make text to SQL work, yes, you absolutely need that kind of semantic graph of what it what datasets you have, how are they related, what are all the metrics, how those metrics are computed. And in that regard, what's really interesting is that the metrics layer that was, at some point, a really hot idea in the modern data stack probably about for, you know, three to five years ago. And then everyone was really disappointed with how little impact it actually made on on the data team's, productivity and just overall on the data stack. It almost like now now it's the metric layer's time. Because if you take the metrics layer and, which gives you a really structured representation of the core entities and the metrics, putting the text to SQL is almost, like, the most impactful thing that you can do because then you have a structured preservation of your data model, which allows AI to be very, very effective at being able to answer questions while being while while operating on a structured graph.
And so I think we'll see really exciting applications coming out of the hybrid of that kind of fundamental metric layer semantic graph and text to SQL in you know, we're already seeing that the early impacts of that. But I think over the next two years, it probably would become the a really popular way to open up data for ultimate stakeholders instead of classical BI of, like, drag and drop, interfaces and kind of passively consume dashboards. But then the second point which he made is, basically, can AI actually help us get to that structured representation? And I think absolutely, for the data engineer's workflow. So not for a, I would say, business stakeholder or someone who is data consumer, but for data producer, I think that leveraging LLMs to help you build data models and especially build them faster build them faster in the sense of understanding all the semantic relationships, not just writing code, is a very promising area. And that comes back to the my point about how software tools are limited in their help of you know, for data engineers. Right? I can write SQL, but if I if my tool does not understand what are the relationships between the datasets, then it can't even help me write joins properly.
And one of the interesting things we've done at DataFold was actually build a system that essentially infers a entity relationship diagram from the raw data that you have combined with all the ad hoc SQL queries that have been written by people. So, previously, that would be a very hard problem to solve. But with the help of LLMs, we can actually have a really good shot at understanding what is the what are all the entities that your business have in your data lake, how they're related. And that's almost like a probabilistic graph because people can be writing joins correctly or incorrectly, and you have noisy data. And sometimes keys that you think are, like, primary keys or foreign keys are not perfect.
But if you have a large enough dataset of queries that were ran against your warehouse, you can actually have a really good shot at understanding what's the semantic graph looks like. And the context on which we actually did this was to help data teams build testing environments for their data. But the the implications of having that knowledge is actually very powerful. Right? So to your point, we can use that tools to help write SQL. So I'm very bullish on the ability to help engine data engineers build pipelines by creating a semantic graph without the need for curation. Because previously, that problem was almost pushed to people with all the kind of data governance tools. The idea was, let's have data stewards define all the canonical datasets and all the relationships. And, obviously, we just powered this completely non scalable.
So now we're finally at the point where we can automate that kind of semantic data mining, with LMS.
[00:22:11] Tobias Macey:
That brings us back around to another point that I wanted to dig into further in the context of how to actually integrate the LLMs into these different use cases and workflows. You brought up the example of Cursor as an IDE that was built specifically with LLM use cases in mind, juxtaposed with something like a Versus code or VIM or Emacs where the LLM is a bolt on. It's something that you're trying to retrofit into the experience. And it can be useful, but it requires a lot more effort to be able to actually set it up, configure it, make it aware of the code base that you're trying to operate on, etcetera, versus the prepackaged product.
And we're seeing that same type of thing in the context of data where you mentioned there are all these different vendors of, oh, hey. We're gonna make it super easy for you to make your data ready for AI or use this AI on your data. But most teams already have some sort of system in place, and they just wanna be able to retrofit the LLM into it to be able to start getting some of those gains, would the eventual goal of having the LLM maybe be a core portion of their data system, their data product? And I'm wondering, in that process of bringing an LLM, retrofitting it onto an existing system, whether that be your your code editor, your deployment environment, your data warehouse, what have you.
What are some of those impedance mismatches or some of the issues in conceptual understanding about how to bring the appropriate, I'm gonna use the word knowledge, even though it's a bit of a misnomer, into the operating memory of the LLM so that it can actually do the thing that you're trying to tell it to do? Yeah. That's a great question, Tobias. I think that to answer this, we kinda need to go back to
[00:23:55] Gleb Mezhanskiy:
what are the jobs to be done for data engineer, and how does the data engineer workflow actually look like. And if we were to visualize it, it actually looks quite similar to the software engineering workflow in just the types of tasks that a data engineer does day to day to do their work. And by the way, we're saying data engineer as sort of like a blank label, but I don't necessarily mean just people who have data engineering in the title because all roles that are working with data, including data scientists, analysts, analytics engineers, and VM in many cases, and software engineers, a lot of them actually do data engineering in terms of building pipelines and developing pipelines as part of their job. It's just data engineers probably do this, you know, % of their data time. And if I'm a data analyst or data scientist, I would be doing this maybe 40% of the time of my week. And so if we think about what do I need to do to, let's say, ship a new data model like a table or extend an existing data model, you know, refactor definitions or add new types of information into an existing model, it starts with planning. Right? So I'm doing planning.
I'm trying to find the data that I need for for my work. And a lot of the times, a lot of information can be sourced from documentation, from a data catalog. I think right now, the data catalog, giving you the sense of, like, what datasets I have and what's the profile of those datasets, has been largely solved. There are great tools. You know, some are open source. Some are vendors. But overall, understanding what datasets you have now is way easier than it was five years ago. You also probably are consulting your tribal knowledge, and you go to Slack and you do, like, search for certain definitions. And that's also now is largely solved with a lot of the enterprise search tools. And then you go into writing code.
And writing code, I think this is also an important misconception. Like, if you are not really, you know, doing this for for a living, you think that people spend most of their time actually writing SQL and in terms of, like, writing SQL to for production. And in my experience, actual writing of the SQL or other types of code is maybe, like, 10 to 15% of my time, whereas all the operational tasks around testing it, talking to people to get context, doing code reviews, shipping it to production, monitoring it, remediating issues, talking to more people is where the bulk of the work is happening.
And if that's true, then that means that probably as we talk about automation, these operational workflows are where the bulk of the lift coming from MLMs can actually happen. And so for actual writing code as a data engineer, I would still recommend probably using the best in class software tools these days, like Cursor. It will even though it's not aware of the data, it will probably still help you write a lot of boilerplate and will speed up your workflow somewhat. And or you can use other IDs with Copeland or, like, Versus Code plus Copilot. I think those tools will just help you speed up the writing of the code itself. But back to operational workflows that I think take the majority of the of the time within any kind of cycle of shipping some of shipping something. When it comes to what happens after you wrote the code, right, typically, if you have people who care about the quality of the data, it means that you have to do a fair amount of testing of your work.
And testing is both helping making sure that my code is correct. Right? Does it conform to the expectations? Does it produce the data that I expect? But it's also about understanding potential breakages. Data systems are historically fragile in the sense that you have layers and layers of dependencies that are often opaque because, I can be changing some definition of what an active user is somewhere in the pipeline. But then I can be completely oblivious of the fact that 10 jobs down the road, someone builds a machine learning model that consumes that definition and tries to automate certain decisions for, like, for example, spend and manipulating that metric. And so if I'm not aware of those downstream dependencies, I couldn't be actually causing a massive business disruption just by the sheer fact of changing it. And so the testing that involves not just understanding how the data behaves, but also how the data is consumed and what are the larger business implications for making any kind of modification to the code is where a ton of time is spent in the data engineering. And so what's interesting is that is the use case where, historically, we at Data Vault spend a lot of time thinking about even pre AI. And before a lens were a thing, what we did there was came up with a concept of data diffing. And the idea is everyone can see code diff. Right? My code looked like this before I made a change. Now it's it's a different numb you know, it's a different set of characters that, the code looks like. And, defining the code is something that is, like, embedded in GitHub. Right? You can see the diff. But the very hard question is understanding how does the data change based on the change in the code because that is not obvious. That happens, like, once you actually run the code against the database. And so Datadiff allow you to see the impact of a code change on the data. And that by itself was quite impactful, and we've seen a lot of teams adopt that, you know, large enterprise teams, fast moving software you know, startup teams. But we were not fully satisfied with the degree of automation that feature alone produced because people are still required to, like, sift through all the data diffs and explore them for multiple tables and see how the downstream impacts when they pass themselves through lineage.
And it felt like, okay. Now at least we can give people all the information, but they still have to sift through a lot of it, and some of the important details can be missed. And the big unlock that LLMs bring this particular workflow is once LLMs became pretty good in comprehending the code and actually, semantically understanding the code, which pretty much happened over 2024 with the latest generation of fundamental, you know, large language models, we were able to do two things. One, take a lot of information and condense it into, like, three bullet points, kind of like an executive summary. And those bullet points are essentially helping the data engineer understand on the high level what are the most important impacts that I need to worry about for any given change and for a code reviewer to understand the same. And that just helps people to get on the same page very quickly and say they're running a lot of time that otherwise could spend be spent in meetings, running back and forth, you know, putting comments on a code change. And the second unlock that we've seen is the opportunity to to drill down and explore all the impacts and do the testing by, essentially, chatting with your pull request, chatting with your code. And that it comes in the form of a chat interface where you're basically speaking to an agent that has a full context of your code, full context of the data change, data diff, and also full context of your lineage so that I can actually understand how every line of code that was modified affecting the data and what does that mean for the business.
And you can ask questions, and it produces the answers way faster than you would by essentially looking at all the different, you know, code changes and and data dips. And that ended up save saving a lot of times a lot of time for data teams. And now that I'm describing this, you kind of feel that I it sounds like almost having a buddy that just, like, helps you think for the code, almost like having a code reviewer, except for with AI. With LLM, this is a body that's always available to you twenty four seven and probably makes your mistakes because it has all the context and can set through a lot of informations really quickly. So that's an example of how an LOAD could be applied into an operational use case that historically has been really time consuming and take a lot of manual work out of that context.
[00:32:13] Tobias Macey:
And I really wanna dig into that one word that you said probably at least a half dozen times, if not, maybe a couple of dozen was that context, where that, I think, is the key piece that is so critical and also probably the most difficult portion of making AI useful is context. What context does it need? How do you get that context to it? How do you model that context? How do you keep it up to date? And so I think that really is where the difference comes in between the cursor example that we touched on earlier versus the retrofitting onto Emacs or whatever your tool or workflow of choices is how do you actually get the context to the place that it needs to be. And so you just discussed the use case that you have of being able to use the LLM in that use case of interpreting the various data diffs, understanding what is the actual ramifications of this change. And I'm wondering if you can just talk through some of the lessons learned about how you actually populate and maintain that context and how you're able to instruct the LLM how to take advantage of the context that you've given it? That's a great question, Tobias. And I think what's interesting
[00:33:21] Gleb Mezhanskiy:
is that at face value, it seems like you wanna throw all the information you have at LLM. Right? Just like tell you everything and then let it figure out things. And in fact, it is obviously not as easy as that. And in fact, it's actually counterproductive to oversupply the LM with context, in part because the context window of Flash language models is limited. And the trade off there is, one, you just, like, can't physically fit everything. And, two, even if you were dealing with a model that actually is designed to have a very large convex window, if you overuse it and supply too much information, L and M just get gets lost. It's also it starts being far far less effective in understanding what's actually important versus not, and the overall effectiveness of your system goes down.
So back to your question of, like, what is the actual information that is important to provide as context into LLM? It really depends on what is the workflow that we're talking about. In the context of a code review and testing, where we are trying to fundamentally answer the question of, a, if we change the code, was a change correct relative to what we tried to do, what was what the task was, or did we not conform to the business requirement? The second question is, did we follow the best processes such as, you know, code guidelines and performance guidelines or not? And the third question is, okay. Let's say we conform to the business requirements. We did the good job at following our coding best practices, But we may still cause a business disruption just by making a change that can be a surprise either for a human consumer of data downstream or could throw off a machine learning model that was trained based on the different distribution of data. Right? And so these are fundamental three questions that we try to answer. And by the way, even without AI, that's what a good code review would ultimately accomplish done by humans.
So what is the context that it's important for LM to have here? First, obviously, it is the code difference. Right? So we already know what the original code was, what's the new code is. And feeding that into l m is really important so that I can understand, okay, what are the actual changes in the code itself, in the logic. And, I won't go into the details here because, obviously, the code base can be very large. Sometimes your PR can fetch a lot of code, so you have to be quite strategic in terms of how do you feed that on the technical side. But conceptually, that's what we have to provide as an input, number one. The second important input is the data diff. Right? It's understanding if I have a kind of main branch version of the code, understanding what data it produces and what are the metrics showing. Right? And then if I have a new version of the code, let's call it a developer branch, what data it produces and what is the difference in the output?
Let's say, with my main branch code, I see that I have 37 orders on Monday. But with the new version of the code, I see that I have 39. And so that already tells me that, okay. So this is the important impact on the output data and on the metrics. And that can and that's important both on the value levels, understanding how the individual cells, rows and columns, are changing, but it's also important to do roll ups and understand what is the impact on metrics. And coupling that context with the code diff allows us to understand how changes in the code affect the actual data output. And the third really important aspect is the lineage. So lineage is fundamentally understanding how the data flows throughout your system, how it's computed, how it's aggregated, and how it's consumed.
And the lineage is a graph, and there are kind of two directions of exploration. One of them is upstream, which helps us understand how how did the data get to the point where you're looking at it. Right? So, for example, if I'm looking at number of orders and I'm changing a formula, where does the information about orders come from in the first place? And that is important because that can tell us a lot about how a given metric is is computed and what are the source of truth. Are we getting it from Salesforce? Are we getting it from our internal system? And then the downstream lineage is also important because it tells us how the data gets consumed, and that is absolutely essential information that can help us understand what downstream systems and metrics will be affected. And lineage graph in itself can be very complex, and building it actually is a top problem because you have to essentially scrape all of your data platform information, all the queries, all the BI tools to understand how data flows, how it's consumed and produced. But let's say you have this lineage graph. It's actually also a lot of information by itself. And so to properly supply that lineage information into an online context, you actually kinda need, your system to be able to explore lineage graph on its own to see, like, okay. If I am if the developer make made a change here, what are the important downstream implications of that? So now we're talking about kind of the system to be able to kind of traverse that and do analysis on its own for for the context. I would say these are the three most important types of context. And then the fourth one is kind of optional. Yeah, if your team has any kind of best practices, SQL linting rules, documentation rules, you can also provide them as context, and then your kind of AI code reviewer assistance can help you reason about, well, did you conform or not? And if not, making suggestions about what to correct. Eventually, probably going in and correcting your code itself. I think that's ultimately where this is going. But, again, it's pretty much would be operating on the same side of input context.
[00:39:13] Tobias Macey:
Another interesting element of bringing LLMs into the context of the data engineering workflow and use case, one is the privacy aspect, which is a whole other conversation. I don't wanna get too deep into that quagmire. But, also, when you're working as a data engineer, one of the things you need to be thinking about is what is my data platform? What are the tools that I rely on? What are the ways that they link together? And if you're going to rely on an LLM or generative AI as part of that tool chain, how does that fit into that platform? What is some of the scaffolding? What are some of the workflows? What are some of the custom development that you need to do where a lot of the first pass and naive use cases for generative AI and LLMs is, oh, well, just go and open up the chat GPT UI or just go run LM studio or use cloud or what have you. But if you want to get into anything sophisticated where you're actually relying on this as a component of your workflow, you want to make sure that it's customized, that you own it in some fashion.
And so that is likely going to require doing some custom development using something like a lane chain or a lane graph or, crew AI or whatever where you're actually building additional scaffolding logic around just that kernel of the LLM. And I'm curious how you're seeing some of the needs and use cases of incorporating the LLM more closely into that actual core capabilities of the data platform through that effort of customization and, software engineering.
[00:40:45] Gleb Mezhanskiy:
That's a great point, Tobias. I think that the models themselves are getting rapidly commoditized in the sense that their capabilities, the fund you know, the foundational large language models, their interfaces are very similar. Their capabilities are similar. We're seeing a lot of race between the companies training those models in terms of beating each other in benchmarks. Looks like the whole industry is converging on adding more reasoning, and then the ways that this is happening is also converging on the same experience and the matter and the difference is, like, who is doing this better? Right? Who is bidding the metrics? Who provides the best, the cheaper inference, the faster inference, more intelligence for for the same price? And to that and I don't think that differentiation or the effectiveness of whatever is the automation that you're trying to bring really depends on the choice of the model. Maybe for certain narrow applications, actually, maybe choosing a more specialized model and or fine tuning model would be more applicable. But still, I don't think the model is where you really where the magic happens these days.
Model is important for magic, but it's not something that actually allows you to build a really effective application by just, you know, choosing something better than what's available to everyone else. The actual magic and the value add and the automation happens in how you leverage that model in your workflow. So all the orchestration in terms of how do you prompt the model, what kind of context do you provide, how do you tune the prompt, how do you tune the inputs, how do you evaluate the performance of the model in production, how do you make various ALM based actors that may be playing different roles interact with each other. That is where the hard work is happening, and that is where I think the actual value and impact is created. And that's where all the complexity is. So I think you don't have to be, you know, a PhD and really understand how the models are trained. Although, I would say just like in computer science, it's obviously very helpful to understand how these models are trained and their architectures and their trade offs. But you don't have to be good at, you know, training those models in order to effectively leverage them. But to leverage them, you have to do a lot of work to effectively plug them in the workflows. And I think that the applications and companies and teams that are thinking about what is the workflow, what is the ideal user interface, what is all the information that I we can gather to make l m do the better job, and then are able to rapidly iterate will ultimately create the most impact with OLMs.
[00:43:31] Tobias Macey:
And so on that note, in your experience of working with the LLMs, working with other data teams, and keeping apprised of the evolution of the space, what are some of the most interesting or innovative or unexpected ways that you've seen teams bring LLMs into that inner loop of building and maintaining and evolving their data systems?
[00:43:52] Gleb Mezhanskiy:
I think the most, in hindsight, it's obvious, but not necessarily obvious when you're just starting realization is that no one really knows how to ship LLM AI based applications. There are obviously, you know, guides and tutorials and still, like, there there's a lot you can learn from looking at what people are doing, but the field is evolving so fast that nothing replaces fast experimentation and just building things. It's not that you can just hire someone who worked on building an LLM based application, like, six months ago, a year ago, and all of a sudden, you, you know, gain a lot of advantage as you would with many other technologies. Like, you know, if we were, I guess, working in a space of video streaming, it will be very beneficial to have extensive experience with working with video streaming and codecs. And with LLMs, one, no one really knows exactly how they work, even the company in terms of, like, how they behave. Right? In terms of even the companies that are shipping them are discovering more and more novel ways of leveraging them more effectively every week.
And from the from the teams that are using leveraging the lens like like data folds, the thing that we found matter the most is the ability to, a, just stay on top of the field and understanding what's the what's the, like, most exciting thing that people are are doing, how they relate to our field, how can we borrow some of those ideas. But most importantly is is rapid experimentation with some sort of methodology that allows you to try new things, measure results quickly, and then being able to scrap your approach that you thought was great and just go with a different one. Because a lot of times when a new model is released, you have to kind of adjust a lot of things. You have to adjust the prompts. You have to even rearchitect some of the flows that you build.
And that is both difficult but also incredibly exciting because the pace of innovation and what is possible to solve is evolving extremely fast. I would say the fastest of any previous technological wave of disruption that we've seen.
[00:46:17] Tobias Macey:
In your experience and in your work of investing in this space, figuring out how best to apply LLMs to the problems facing data engineers and how to incorporate that into your products, what are some of the most interesting or unexpected or challenging lessons that you've learned personally?
[00:46:34] Gleb Mezhanskiy:
Yeah. I I think that the the interesting realization was that specifically for data engineering domain again, if you just take the problem at face value, you think, well, let's just build a Copilot or an agent that would kind of try to automate data engineer a way. And I don't think we have the tech ready for an agent to just, like, really take a task and run with it yet. I don't think it's been solved in software space. I think it's, in some ways, even harder to solve in data space. We'll eventually get there. I don't think we are there yet. I don't think that the biggest feedback you can make on the engineering workflow again is, like, having a copilot because that's not where data engineers spend most of their time in terms of, like, writing production code. It's all operational tasks. And there are certain kinds of problems in the data engineering space where it's not even a day to day, you know, you help, you save, like, an hour, two hours, three hours.
But there are certain types of workflows where to complete a task, a team needs to spend, like, ten thousand hours. And a good example of such a project would be a data platform migration where, for example, you have millions of lines of code on legacy database. You have to move them over to a new modern data warehouse. You have to refactor them, optimize them, repackage them into a new kind of framework. Right? You may be moving from, like, stored procedures on Oracle to DBT plus Databricks. And doing that requires a certain number of hours for every object. And because you're dealing with a large database that at enterprise level sums up to enormous amount of work.
And, historically, these projects would last years and be done by, a lot of times, outsource talents from, you know, consultants or or a size. And for data engineer, that's, like, probably one of the most miserable projects to do. I've done I've led a a project at Lyft, and it's been an absolute grind where you you're not shipping new things. You're not shipping AI. You're not shipping even data pipelines. You're just, like, solving technical debt for years. And what's interesting is that those types of projects and workflows are actually, I would say, where AI and OMS can make today the most impact because we can take a task.
We can reverse engineer it. We know exactly what is the target of you know, you move the code, you do all of these things with the code, and, ultimately, the data has to be the same. Right? You're moving you're going through multiple complex steps, but what's important for the business is once you move from, let's say, you know, Teradata to Snowflake, your output is the same because, otherwise, business wouldn't accept it. And that allows us to, a, leverage LMS for a lot of the tasks that are historically manual, but also have a really clear objective function for OMS, like, dipping the output on a legacy system to a modern system and using it as a constraint.
And if you put those two things together, you have a very powerful system that is, a, extremely flexible and scalable thanks to all ends, but also can be constrained to a very objective definition of what's good. You know, unlike a lot of this text to SQL generation that cannot be constrained to the definition of what's good. Because, like, how do you know? By by the end of migration, you do know. And that allows AI to make tremendous impact on the productivity of a data team by essentially taking a project over the last four years, cost millions of dollars, and go our budget and constrain that into weeks and, you know, just a fraction of the price. I think that is where we can see real impact of AI that's, like, useful. It's working.
And we also see the parallels in software space as well. There are also a lot of the up like, really thoughtful enterprise applications of AI is actually taking these legacy code bases and, you know, helping teams maintain them and or migrate them. And I think that there are more opportunities like that in a data engineering space where, we'll see AI make tremendous impacts.
[00:51:03] Tobias Macey:
And as you continue to keep in touch with the evolution in the space, work with data teams, evaluate what are the cases where LLMs are beneficial versus you're better off going with good old human ingenuity. What are some of the things you're keeping a particularly close eye on or any projects or context you're excited to explore?
[00:51:27] Gleb Mezhanskiy:
In terms of where you where where I think that LMS would really make a huge impact on the workflow?
[00:51:33] Tobias Macey:
Just LLMs in general, how to apply them to data engineering problems, how to incorporate them more closely and with less legwork into the actual problem solving apparatus of an organization.
[00:51:46] Gleb Mezhanskiy:
Yeah. So I think that on multiple levels, there's a lot of exciting things. Like, for example, being able to prompt an OLM from SQL as a function call that's available these days in modern data platforms is incredibly impactful. Right? Because instead of trying to in many instances, we're dealing with extremely massive data. And instead of having to write, like, complex case one statements and regexes and, like, UDFs to be able to clean the data, to classify things, and to just tangle the mess, we can now apply LLMs from within SQL, from within the query to solve that problem. And that is incredibly impactful for a whole variety of different applications. So I'm very excited about all these capabilities that are now, you know, brought by the major data platforms like, you know, Snowflake, Databricks, BigQuery.
I think that we if we go into the workflow itself, like, what does data engineer do and how to make that work better? I think there's a ton of opportunity to further automate a lot of tasks. I think a big big one is data observability and monitoring. I honestly think that data observability in its current state is a dead end in terms of, like, let's cover all data with alerts and monitors and then be the first to know about any anomalies. It's useful, but then it quickly leads to a lot of noise, alert, fatigue, and ultimately kind of could be even net negative on the workflow of a data engineer.
I think that this is a type of workflow where putting an AI to investigate those alerts, do the root cause analysis, and potentially remediation is where I see a lot of opportunity for saving a ton of time for data team while also improving the SLAs and the overall quality of the output of a data engineering team. And that's something that we are really excited about. Something we're working on Dataflow, and we are excited about coming later this year.
[00:53:56] Tobias Macey:
Are there any other aspects of this overall space of using LLMs to improve the lives of data engineers and the work that data engineers can do to the effectiveness of those LLMs that we didn't discuss yet that you'd like to cover before we close out the show?
[00:54:12] Gleb Mezhanskiy:
I think that, you know, we talked a lot about kind of the the workflow improvements. I think that, overall, my recommendation to data engineers today would be to learn how to ship Elven applications. It's not that hard. Frameworks like LandChain make it very easy to compose multiple blocks together and ship something that works whether or not you end up losing using blockchain or other framework in production and whether your, you know, team allows that. Doesn't really matter, but it's really, really, really useful to try and build and learn all the components.
And by it's just like software engineering. You know? Learning how to code opens up so many opportunities for you to solve problems. Right? You see a problem and you're like, I can write a Python script for that. And I think that with LLMs, it's almost like a new skill that both software engineers and data engineers need to learn where you see a problem and you think that, okay. I actually think I can scale the problem into three tasks that I can give to an LLM. Like, one would be extraction web. It could be, like, reasoning and classification. And now it just solves the problem.
And so but but really learning how to build and trying helps you build that intuition. And so my recommendation will be for all data engineers while listening to this is try to build your own application that solves either a business problem or helps you in your own workflow because knowing how to build with OMS just gives you tremendous superpowers and will definitely be helpful in your career in the coming years.
[00:55:52] Tobias Macey:
I definitely would like to reinforce that statement because despite the AI maximalist, the AI skeptics, no matter what you think about it, LLMs aren't going anywhere. They're going to continue to grow in their usage and their capabilities, so it's worth understanding how to use them and investing in that skill because it is going to be one of those core tools in your toolbox for many years to come. And so for anybody who wants to get in touch with you and follow along with the work that you are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get what your current perspective is on the biggest gap in the tooling or technology for data management today.
[00:56:35] Gleb Mezhanskiy:
I think that there's a lot of kind of skepticism and some bitterness around kind of modern data stack failed us in a sense that we were so excited that more data stack will make things so great five years ago, and we're kind of disappointed. And I think that I'm an optimist here. I think that modern data stack in the sense of infrastructure and getting a lot of the fundamental challenges out of the way, like running queries and getting data in and out of different databases and visualizing the query outputs and having amazing null books.
All of that that we now take for granted is actually so great relative to where we were, you know, five, seven, eight, ten years ago. I don't think it's enough. So I think that, I am with the data practitioners for, like, well, it's 01/25. We have all these amazing models. Why is it still so hard to ship data? Absolutely with you. And I think what I'm excited about is now that we have this really great foundation with modern data stack in the sense of infrastructure, I'm excited about, one, getting everyone on modern data stack to the point of migrations. Right? Let's get everyone on modern infrastructure so that they can ship faster. Obviously, a problem that I'm really passionate about in solving and working.
Second, once you are on the modern data infrastructure, how to keep modernizing your team's workflows so that data engineers are spending more and more time on solving hard problems and thinking and planning on the valued activities that are really worth their time and less and less on operational toil that just is burnout inducing and keeps everyone back. So I'm excited about the modern data stack renaissance, thanks to the fundamental capabilities of large language models.
[00:58:30] Tobias Macey:
Absolutely. Well, thank you very much for taking the time today to join me and sharing your thoughts and experiences around building with LLMs to improve the capabilities of data engineers. It's definitely an area that we all need to be keeping track of and investing some time into. So I appreciate the insights that you've been able to share, and I hope you enjoy the rest of your day.
[00:58:50] Gleb Mezhanskiy:
Thank you so much, Tobias.
[00:58:59] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Guest Welcome
Gleb Myszynski's Background and DataFold
AI in Data Engineering: Hype vs Reality
AI for Data Engineers: Opportunities and Challenges
Text to SQL: Potential and Pitfalls
Data Engineer's Workflow and AI Integration
The Importance of Context in AI Applications
Customizing LLMs for Data Platforms
Lessons from Applying LLMs in Data Engineering
Future of LLMs in Data Engineering