Summary
Generative AI has rapidly gained adoption for numerous use cases. To support those applications, organizational data platforms need to add new features and data teams have increased responsibility. In this episode Lior Gavish, co-founder of Monte Carlo, discusses the various ways that data teams are evolving to support AI powered features and how they are incorporating AI into their work.
Announcements
Parting Question
Generative AI has rapidly gained adoption for numerous use cases. To support those applications, organizational data platforms need to add new features and data teams have increased responsibility. In this episode Lior Gavish, co-founder of Monte Carlo, discusses the various ways that data teams are evolving to support AI powered features and how they are incorporating AI into their work.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
- Your host is Tobias Macey and today I'm interviewing Lior Gavish about the impact of AI on data engineers
- Introduction
- How did you get involved in the area of data management?
- Can you start by clarifying what we are discussing when we say "AI"?
- Previous generations of machine learning (e.g. deep learning, reinforcement learning, etc.) required new features in the data platform. What new demands is the current generation of AI introducing?
- Generative AI also has the potential to be incorporated in the creation/execution of data pipelines. What are the risk/reward tradeoffs that you have seen in practice?
- What are the areas where LLMs have proven useful/effective in data engineering?
- Vector embeddings have rapidly become a ubiquitous data format as a result of the growth in retrieval augmented generation (RAG) for AI applications. What are the end-to-end operational requirements to support this use case effectively?
- As with all data, the reliability and quality of the vectors will impact the viability of the AI application. What are the different failure modes/quality metrics/error conditions that they are subject to?
- As much as vectors, vector databases, RAG, etc. seem exotic and new, it is all ultimately shades of the same work that we have been doing for years. What are the areas of overlap in the work required for running the current generation of AI, and what are the areas where it diverges?
- What new skills do data teams need to acquire to be effective in supporting AI applications?
- What are the most interesting, innovative, or unexpected ways that you have seen AI impact data engineering teams?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working with the current generation of AI?
- When is AI the wrong choice?
- What are your predictions for the future impact of AI on data engineering teams?
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your
- Monte Carlo
- NLP == Natural Language Processing
- Large Language Models
- Generative AI
- MLOps
- ML Engineer
- Feature Store
- Retrieval Augmented Generation (RAG)
- Langchain
[00:00:11]
Tobias Macey:
Hello, and welcome to the Data Engineering podcast, the show about modern data management. Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end to end data lake has platform built on Trino, the query engine Apache Iceberg was designed for. Starburst has complete support for all table formats, including Apache Iceberg, Hive, and Delta Lake. And Starburst is trusted by teams of all sizes, including Comcast and DoorDash. Want to see Starburst in action? Go to data engineering podcast.com/starburst today and get $500 in credits to try Starburst Galaxy, the easiest and fastest way to get started using Trino.
Your host is Tobias Macey. And today, I'm welcoming back Lior Gavish to talk about the impact of AI on data engineers. So, Lior, can you start by introducing yourself?
[00:01:01] Lior Gavish:
Hi, Tobias. Thanks for having me today. I'm Lior. I'm the cofounder of of a company called, Monte Carlo. We're the data observability company, which means we we help data teams, data engineers, data analysts, data scientists create reliable and and trusted data for for whatever they're using, whether it's analytics or machine learning or increasingly AI. We've been around for about 5 years. We're now serving over 400 data teams and and growing quickly. And, you know, we love everything data, so it's it's always exciting to to be to be on the show.
[00:01:41] Tobias Macey:
As I mentioned, you've been on a couple of times before, but for anybody who hasn't heard your past appearances, which I'll link in the show notes, can you just refresh our memories to how you got started in data?
[00:01:51] Lior Gavish:
Oh, absolutely. I got started in data, I I wanna say over 15 years ago as what you would typically call today, a machine learning engineer. I was actually building NLP models to help summarize and and and classify, news articles. That's how I started. I then went on to start my own company in in in cybersecurity and used analytics and machine learning to, solve certain kinds of of of fraud use cases. Company got acquired by a company by by a a bigger, larger than public cybersecurity firm, and I went on to to lead the, the data and engineering teams. There are about a 100 people, you know, building various kinds of of real time protection from fraud and and cyber attacks and heavily relied on on analytics and machine learning. So kind of a, my background is a combination of of engineering and data, and therefore also my my interest in in how do you operationalize data and and how would you make it actually as reliable and as trusted as it could be when it serves, know, real time applications and and sometimes millions of users. So that's that's my, background and gist.
[00:03:11] Tobias Macey:
It's always interesting in this world that we're in now of large language models and generative AI hearing the terms natural language processing and the fact that you were working so hard to be able to summarize those news articles when now people just say, oh, I'll just throw it at chat GPT, and it does it for me. But there are also, of course, the the issues around quality and accuracy in those summarizations. So
[00:03:34] Lior Gavish:
Yeah. Scratches on my back. What I, spent, you know, months on end trying to to build with very custom and and and and specialized algorithms you can now do with an API call that that costs, you know, fractions of of a cent. So, there you go. There's some progress in the world.
[00:03:56] Tobias Macey:
And nowadays, whenever somebody says the word AI, the assumption is that we're talking about chat TPT and the like. But for purposes of this conversation, when we're talking about the impact that the current phase of AI is having on data teams and data engineers in particular. I'm wondering if you can give some clarifying detail about what you mean when we're saying AI for the purposes of this conversation.
[00:04:22] Lior Gavish:
It's a great point. I think, you you know, AI the the debate of AI and ML and what's AI has been going on for for, for as long as I can remember. And it's always been there's been a lot of opinions on it. I I I think for to or, you know, for for the sake of this discussion, we could probably think of AI as, generally speaking, the the generative AI models that emerged in the last, you know, almost 2 years now. So think, you know, ChatGPT, OpenAI, and now more recently, you know, various kinds of of lamas and and and, mistral and mistropic and whatever.
I think those models introduce kind of a a a foundational shift in how we think about using AI. And I'm not sure we're super close, but it's it's probably probably the the most promising avenue to true, you know, machine intelligence potentially getting closer and closer to human level intelligence. And so for, you know, for right now, I'm thinking about AI mostly as those large language models that have been introduced pretty recently.
[00:05:36] Tobias Macey:
Machine learning went through a bit of a resurgence 5 years ago thereabouts with the idea of ML engineers and the, the work of actually bringing machine learning into the real time application experience for end users and MLOps. So there's been a lot of that conversation happening for a little while now. With the advent of generative AI and the new demands that they're placing, particularly thinking in terms of things like prompt engineering, retrieval augmented generation, the fact of the models themselves being so much bigger. I'm curious if you can start by giving a bit of an overview of what you're seeing as the new requirements and the new features that are required in data platforms for being able to support these new categories of model in an operational environment?
[00:06:31] Lior Gavish:
Absolutely. You're right. About 5 or 10 years ago, there's been a a a an explosion of tools to bring, you know, machine machine learning into production, whether it's, feature stores and models serving frameworks and and and model versioning and, all kinds of different technologies. Journey of AI has brought in, you know, a new stack and and, I'm happy to go through the different components of that stack. I think 1st and foremost, and and and and the thing that that probably many of our listeners have have used is or or the model APIs. Right? And OpenAI did something very cool, which didn't exactly exist in the ML world. It basically made models a commodity by serving them, as an API. Right? You can make a very simple HTTP request to OpenAI and and now to many other providers and get access to the latest and greatest, model that's been trained by, I mean, 500 PhDs and and a $1,000,000,000 worth of GPUs. Right?
And and and that's a really core part of of the stat that that everybody's familiar with. Having said that, over the last 2 years, there's there's been a lot of other components, that kind of emerged, to help teams build with with generative AI. 1st and foremost, Rag, you mentioned it. It's this idea of, well, how do we add long term memory and other capabilities to to these large language models? Retrieval, augmented generation is is the the probably the most dominant, way to do so. And it's this idea of, you know, let's take a lot of, data, typically unstructured data, but but sometimes structured data, put it in a database, and make it available to the model, while, it is, responding to to to user prompts.
Right? So for example, if I a very simple example, if I want, to use a model to answer, a lot of questions about, my, you know, documentation, my developer documentation, I can put all of these documents that have existed, that they've been built over the years into, a database. Specifically, I might use a vector database. And then, when the model gets a new question from a user about how to do this or that, it might use the database to retrieve relevant documents to the user's prod and then create, and then, create an answer to the to that question using those documents. So, you know, typically by some form of of summarization, or extraction.
And so RAG is kind of a new has prompted, if you will, a bunch of different technologies, most prominently the the vector databases that that probably many have heard of, but also, you know, frameworks and libraries that help build those RAD applications. Next in line is probably, various tools for fine tuning. It's, fine tuning is this idea of, you know, let's take a large language model that's been trained over maybe the entire Internet and all of the books ever written and, and whatever it is. And then maybe let's customize it to, you know, to a more specific need. Maybe I want to have Yeah. Specific knowledge about some topic that's, that's very relevant to to to my business problem.
You can basically use, a set of of documents as a way to fine tune the model and and add a certain, a certain specialization to it. There's a bunch of tools that that will help you do that effectively. Most prominently, you know, APIs coming, you you know, coming from the model providers that allow you to to to add a dataset, train a model, and then serve that model. But then also, again, other software tools to to make that easier. The the 4th, technology that's that's emerging in terms of of generating AI applications are, or what I broadly call orchestration technologies.
So think frameworks like langchain, but there's increasingly more and more frameworks, agent frameworks, prop management tools, all kinds of different tools that help you, basically create an application using generative AI. It allows you to, to make a series of, or to orchestrate a series of calls to to models, you know, using the the outputs of a call to, to trigger another call, using, RAG in in in the mids, and combining models and and prompts and and and and and and information from a database and kinda interesting ways to create to create higher level abstractions, if you will, of of these models.
So that's that's been, pretty exciting too. And then, those 4 probably allow, from what I've seen, allow users to create pretty sophisticated applications using Journey of AI. But then as you take these things to production, and expose them to, you know, to a broader set of users, potentially external users outside of your own company, there's there's a couple more, sets of tools that come up. 1st and foremost, security. Journey of AI opens a lot of possibilities in terms of things that could go wrong in terms of security and privacy. And so we're seeing more and more tools that, help you manage that in real time.
Generally speaking, AI firewalls, that's kind of interesting. And then, of course, you know, the topic that that's that's near and dear to my heart, the the reliability and quality tooling, whether it's, you know, observability of the type that that we work on in Monte Carlo. It's, you know, the idea of monitoring the quality of of of the Gen AI system in production, but also, obviously, preproduction tools that help with evaluating new versions of an application or new versions of the model, etcetera, etcetera. And so all these kinda different tools are are coming up. Lots of companies that are trying to build those. Lots of teams are trying to build those in house. Very exciting.
I would generally categorize, you know, I think it's just helpful to remember how you might get those tools. I think there's kinda 3 big options here. You can either buy those tools from cloud providers, like AWS, GCP, Azure, and and increasingly OpenAI. And that that's what most most teams do as far as I can tell. Then, increasingly, there's good, you know, end to end offerings from from the data clouds, from Snowflake and Databricks, both offer a pretty robust set of tools across, all these categories, to help data teams build with AI. And then you'll, of course, find specialized solutions for different different parts of the stack. So whether it's, you know, Pinecone, for vector databases on on the Rag side or or Monte Carlo on the observability side.
But of course, when when you need to upgrade from the from the basic version to the to advanced use cases, you you can definitely find, you know, highly specialized and and professional tools for for each part of the stack now. So, yeah, it's, the the the there's a lot that's coming into the stack to enable, generative AI, and and lots of new technologies, lots of new tools. Having said that, the, I do still think the foundation is is, you know, the classic data pipelines. Right? The the the one thing that that that I think people are realizing is that no matter how you go about Genov AI, the the core piece is, marrying those models with with the data that you manage anyways. Right? If you don't combine your own data with the models, you're basically building a a commodity. Right? Something that ChatGPT could do.
The whole point is how how do you create a a unique, specialized, personalized experience for your own users that, that heavily relies on your own data? And then, what everybody ends up doing is building lots of data pipelines, to feed those models with with their own proprietary data and and and and make those models useful in the context of their own business and and and, you know, their own users.
[00:15:52] Tobias Macey:
There are a lot of different pieces that you talked through there. Some of them are new infrastructure components. Some of them are just new practices and how to actually manage the model and the end user application. I'm curious what you have seen as the ways that teams are thinking about which of those components are the responsibility of the data engineers, which are the responsibility of ML engineers, if you have them, which of them belong to the operations and infrastructure teams, which of them belong to application engineers, and just some of the ways that you see the the breakdown of who owns which piece?
[00:16:33] Lior Gavish:
Such a good question. It's a mele right now. It's so hard. There there isn't I I wouldn't say there's an established path to building generative AI in every company. Does it slightly differently based on, you know, the the specific, you know, org structure and talents that that exist in that team. To be clear, you know, there's you need all, Right? There's an element of of software engineering here because you are building an application, typically a a web service, typically with some form of of user facing application. There's a good element of of data engineering here, with data pipelines as as we as we discussed.
There's some element of of machine learning engineering, of ML engineering and and and data science here when it comes to exploring the models and understanding how to use the data. And and and also, you know, to be honest, generative AI, at least right now is most effective when it's combined with more, traditional ML and and sometimes deterministic approaches. The the combination is actually very powerful and data scientists are actually good at at at and making this whole thing, work nicely together. And, of course, there's, you know, product managers and product designers involved because, again, it's it's it's an application. It needs to work in in a way that makes sense for for its, consumers.
And so what I typically see is all of those teams involved, in various capacities and and and everybody focusing on their, you know, on their own pieces. I think the and and and and and all those teams also employ those different pieces of the stack in in different ways. Just to give you an an example, a a software engineer might use, you know, a a a a a model API, right, something like OpenAI, to, to generate a response sometimes maybe in in real time to to a user prompt. And a data engineer might use that same API to process text documents in bulk as part of a pipeline. Right? And a data scientist might might use it in in, you know, in a in a third way. And so right right now, it's it's a mix. I don't think there's clear ownership of who does what, and we're definitely saying, you know, software engineers is building data pipelines and data engineers building user facing applications.
I think over time, you know, we'll probably get to to some, you know, some best practice or some understanding of of how to, you know, how how to split all these different components, between the between the different teams. And and what I suspect is, you know, we'll definitely see much more or multidisciplinary team tackling this. Right? And, you know, and and and this existed for a while now, teams that are made up of software engineers, data engineers, data scientists working together, but it was probably the the exception, not the rule. I think with generative AI, it's increasingly going to be the the rule, if you will, and and we'll we'll see those teams kind of working together to to build, you know, solutions and and full applications.
[00:20:00] Tobias Macey:
Another interesting aspect of the ways that generative AI is turning everything on its head is that for a long time, it was the case that data engineering was there to support data science, and then it turned into machine learning. And now we're seeing it come full circle where these generative AI technologies are also being used in the data engineering workflow of pipeline design, transformation generation, code generation. And I'm wondering how you're seeing data engineers start to bring generative AI into that development flow and into the work of building and maintaining the data pipelines that then go on to feed the generative AI?
[00:20:44] Lior Gavish:
Yeah. Absolutely. I think the it's kinda what you mentioned, Tobias. Right? Like, it starts from from what a lot of engineers are are doing right now, which is use, you know, various kinds of copilots, to accelerate development. And in in in the case of data engineers, you you can very effectively use generative AI to to build, you know, your pipelines, whether using PySpark, or SQL, or or what have you, generally, I can probably accelerate some some elements of it. We're not quite there in terms of, you know, replacing data engineers with AI. We may we would never get there, but it can certainly make people more more productive.
I think it can even or or the other thing that's happening a little bit, I think it it generally, I democratizes access to to these things. Right? So, you know, it it the the whole, text to SQL thing is working pretty pretty decently, which means that maybe certain things that, used to be delegated to to data engineers, when there's a need to create an, you know, a new pipeline or an or a new analysis. Now maybe someone with less technical skills can do it using generative the ad model. So it's kinda democratizing access to data and to pipelines, and and that actually frees up data engineers to to, to do the things that that they do best rather than, you know, answering to ad hoc, requests.
I think the most exciting thing though for for data engineers is actually that, generative AI unlocks, access to unstructured data. And what I mean by that is imagine a lot or, and and there's plenty of examples, but a lot of enterprises, a lot have a lot of very useful unstructured data documents, basically, generated in the business, you know, whether it's legal documents or, you know, technical documentation or, or lots of other corpus of information that that are useful. If you wanted to to make these things useful for the business in the past, you know, a data engineer can do it. They would need, you know, a data scientist and a machine learning engineer to come in and and process that data and extract, generally, extract structured data from it. Right? If you wanted to process all your legal documents and and and extract information from there, you actually needed to hire, you know, highly specialized data people that would build, you know, NLP algorithms, quite like I used to do 15 years ago in order to do that. And the data engineer would help them kinda string things together and build the pipeline, but but still a lot of the work would would have to be, outsourced in a way.
And and now data engineers can actually do that on their own. Right? Especially with with models being available, natively in in in tools like Snowflake or Databricks. A data engineer can take, you know, a body of legal document documents and extract information from there without getting any help. It's as easy as as creating a prompt and and applying a a function to that dataset. Right? And so I I think that's a that's a force multiplier. Right? It it opens up opportunities for data engineers to use enterprise data much more effectively, and with much less help from from from other teams that that they might have depended on in in the past. So that's that that's a pretty exciting shift and change, you know, in in in my view.
[00:24:40] Tobias Macey:
One of the other major ways that AI and machine learning models have found their way into the data engineering workflow is in particular in that context of retrieval augmented generation where you need to be able to generate the vector embeddings of the data that you want to use for that context. And so you have to have some capacity being able to run those embedding models in that pipeline workflow and pipeline environment, and you also need to be thinking about what are the considerations around how I want to generate the embeddings, what do I need to to be thinking about as far as chunking of the different sizes of embeddings that we're creating. And I'm wondering if you can talk to how those new requirements are being addressed if in data teams and some of the new skills and new training that's necessary to be able to build pipelines to support that RAG use case effectively.
[00:25:39] Lior Gavish:
Yeah. Absolutely. I think the so as I said, incredibly helpful for data teams, right? There's a lot of value in being able to, to process unstructured data. And, you know, to be honest, I would dare say that at this point, there isn't yet, you know, a best practice. Like, you you can't, you know, buy a book and understand how to how to build brag. Although I'm sure someone is selling a book. Nobody really has done that a ton. You can hire, you know, a rag expert that will tell you how to do it. It's mostly about getting hands on experience and being curious and experimenting and finding, you know, what's what's right for your particular use case and your particular, need. And so, you know, I'm I I probably couldn't give people advice on on on how to build RAG pipelines at this at this point, and and I've seen so many different approaches, oftentimes highly dependent on the background of the people building the pipelines. Right? Software engineers stack it in a certain way and then data engineers do it in a completely different way and, you know, and both are very valid right now. I think that the the having said that, there's there's probably a few things that people need to start thinking about, in terms of how to build that effectively.
Right? And, essentially, like, how to go from that prototype phase of, you know, take a bunch of documents, run them through, lang chain and and create a and and put them in a vector database. That's, you know, that's something that that we're all learning how to do, to take it to the next level. There's kind of a higher level questions that need to be that that that we're seeing teams increasingly ask. It's things around security and privacy, for example. Like, how do you make sure, that, whatever access controls apply to that data and, you know, whatever sense of information is there continues to be governed as it as it goes through data pipelines.
Over the years, we've we've developed various methodologies around around that in the in the structured data world, but applying it with unstructured data is is quite different, because it's messy, because you take a, you know, a body of documents and you, honestly, you don't even know what's there and and how it should be protected. So there's much more potential for leak and for people having access to things they're not supposed to see. So, that's definitely an area where where people need to to to develop, you know, a a a knowledge and a set of best practices around it again that that apply to their business.
The other one is cost, efficiency. You know, again, building a prototype will probably not cost you a lot of money and nobody would care. But now, you know, if you process the the entire corpus of of enterprise data and and run-in batch every single day in your pipelines and and, make I don't know how many API calls per document. The bill can run pretty high pretty quickly. And so data teams need to to figure out how to, first, how to create visibility and manage that cost, but also how to optimize it because a lot of use cases end up being exorbitantly expensive. And so you need to think about how to how to minimize the number of calls and then how to potentially use faster, cheaper models where where it applies.
Cost is a thing. And, of course, reliability and quality. Right? These these these pipelines tend to, you know, have have a lot of issues with that. Right? It's first and foremost hallucinations. Right? Everybody that use chat GPT is probably gone to a restaurant that that never existed. They asked for for a recommendation for dinner and true story. And so so you need to figure out how how to deal with those things, how to deal with the fact that it's not deterministic. Right? Like, how how do you even understand the the quality of the output? And even if you painstakingly go and and and look at a lot of results and evaluate them in in various ways, and there are lots techniques around that. How do you make sure that that remains true as you release changes to the pipeline or changes to the underlying model that you're using and and things like that?
And and and so that's, that's something that I think, you know, there's a lot of skills to be learned around. So, yeah, lots of things to learn. I have more questions than answers, around this, but but, yeah, it's exciting times. And then I'm sure that that over the next couple of years, we'll, you know, we'll we'll gradually learn how to how to do all these things effectively.
[00:30:35] Tobias Macey:
Yeah. The it's it's funny how much that is the case right now of I have a lot of questions and I have a lot of ideas, but there aren't any established answers yet because all of it is too new. And so everybody's flailing around in the dark and making their best attempt, and eventually, we'll convene on a set of best practices, but we're not there yet.
[00:30:56] Lior Gavish:
Yeah. Too too early to call, I'd say.
[00:31:01] Tobias Macey:
One of the other interesting side effects of the current stage of AI and the the rise of generative AI in particular is the growth and interest of vector databases, vector indexing. So we're seeing a new growth of that particular segment where it seems like every day I hear of a new vector database that's out there and optimized for some particular use case. Vector databases as a technology predate the current AI craze of generative models for, in particular, things like semantic and similarity searching. And I'm wondering if there are any other aspects of the ways that AI applications and generative models in particular are bringing some of these new technologies or newly created categories into the data engineering ecosystem and how those are being used outside of that context of just supporting generative AI models?
[00:32:04] Lior Gavish:
A vector databases generally use or in addition to being used in kind of rag scenarios, I think people are really excited about, you know, the traditional search. I've, you know, I've talked to this company that that, works you know, that's an ad tech, one of our customers, and and, you know, there there's certain workflows there where you help marketers find relevant places to, to place ads. And, you know, traditionally that's been done with basically elastic search, right? Some some form of of keyword search with with a lot of bells and whistles around it, but and and they've gradually introduced vector embeddings and and and then, vector database that process. Right? And and not not in the context of rag, necessarily, but just, you know, if you wanna find a a place to to put your ad about, I know, fit fitness subscription or or what have you finding, you know, places online that that match that with and and and and vector databases actually do that extremely well, apparently much better than than than traditional search.
And so there's a a resurgence of of of search now with with with vector databases available. And and, of course, those powerful language models that that create the embeddings, it seems to create, you know, better, higher relevance results with kinda deeper a deeper semantic understanding. That's, of course, been available, you know, on on Google or, you know, in other, search engines that that have been highly optimized over the years. But now, you know, probably, you know, smaller use cases, can be built quite rapidly, and have very, very powerful semantic search, using vector databases and embeddings.
[00:34:08] Tobias Macey:
Now bringing it back around to the question of reliability, quality. We have been fighting that battle for a few years now in just the pure business intelligence, data analyst type use case of how do we build more reliable systems. Now with the added stresses and requirements around AI applications and these very probabilistic and not deterministic use cases? How are you seeing that impact the way teams are thinking about reliability, data quality, data observability, and some of the ways that you're thinking about that at Monte Carlo to be able to help support those teams?
[00:34:50] Lior Gavish:
Yeah. Quick question. First of all, like everything, you know, we're still learning. There there's, there there's more question than answers, but I will call out kind of based on on what we're seeing. First of all, I go it it it all goes back to to basics in a sense. Right? Pipelines are pipelines, and you have to make sure they're they're working reliably. Right? You have to make sure that vectors or or tables, you have to make sure they're getting updated on time. You wanna make sure that you have all the dataset there that's that nothing is missing. You also wanna make sure there's no duplications, and duplications are particularly bad in in vector databases because they really hamper your your ability to get, you know, the the k nearest, neighbors and and and get the most, you know, the most relevant doc documents there because you you might get 5 copies of the same document if if you have it in the database.
And so you you wanna make sure these things are working. And, of course, all the all the structural stuff. Right? Do you have the the right vector dimensions that you're expecting, and did you use the right embedding model and a consistent embedding model, right? That that, it's it's very important to use, the same embedding model in the pipeline and engineering retrieval. Right? And of course, the metadata around it. Right? Vectors are always accompanied by metadata that helps, you know, trace them back, to to where they came from or or associate them with with, you know, with with an account or a user or or other things. And you need to make sure that the metadata is there and it's it's complete and and it's accurate in in the same way that structured data, was tested.
So so all these things are there, and and of course, lineage too. Right? In order to effectively manage the quality and reliability, you need to understand, where where the date where those vectors came from, how they're consumed downstream, all that good stuff in in observability. Right? When you have a problem, you you need to understand what its impact downstream is and and and what and and and to find the root cause, you you need to understand what's, what's upstream. So all all these things are kind of classic things that, translate from from structured data to unstructured data pretty directly.
The one thing that's a little bit different is, well, how you how you measure the the quality of the data itself. And and in the structured world, you know, we we we develop a lot of methodologies. Right? Like, we can, calculate a lot of quality metrics, like, you know, how many nodes you have and how many duplications and, and and all sorts of these things. And we can have people that understand the pipeline build, you know, more more precise metrics around it that maybe take into account, you know, a deeper understanding of of the dataset and and the business. And that doesn't translate as well to text or image data that goes into vector databases.
But we're we're seeing, increasingly methods to deal with that. So as an example, you know, you we're we're seeing people doing things like, well, I could probably use generative AI to calculate quality metrics about unstructured data. As an example, I could take all the texts that I have, and I could use a model to, for example, classified into topics, right, and track what topics I have in the dataset and make sure that's stable and that's behaving as expected, over time as the dataset changes and shifts or goes through the pipeline. So that's an example. I could also use generative models to determine whether there's sensitive data or PII in inside the set, right, and and track it over time and so on and so forth. You can get really creative with with how you use generative AI to create quality metrics on top of unstructured data. And and I think it's a very promising avenue, but of course, a lot to be learned there.
It's complicated and it's early days. And, you know, if I'm honest, if you talk to, data teams building generative AI pipelines today, You know, a lot of it still relies on people eyeballing results, and trying to make sure that that they make sense. It's it's not always easy to to to scale and and and and automate these things. And so we're gradually learning how to take bites off of that manual process and and turn it into a more consistent and automated approach.
[00:39:40] Tobias Macey:
As you have been navigating the rapidly changing landscape of data in the face of generative AI and the ways that data teams are working to adapt and ways that you at Monte Carlo are trying to help support them. I'm curious. What are some of the most interesting or innovative or unexpected ways that you're seeing AI impact those data engineering teams?
[00:40:04] Lior Gavish:
Yeah. So many exciting examples here. I'll name a few. I think one, one really exciting trend is I think that the fact that we're unlocking structured data is bringing data engineering closer to the forefront of of decision making. And and the example I I I I like to quote here is, you know, talking to this sports team professional sports team. It's one of our customers. And, you know, in the past, you know, data engineers could provide, some structured data out there about players. Right, certain statistics about how they performed in matches. But but that data is pretty sparse. Right? And it it's only available in the in the top leagues.
But if you're trying to spot the next, you know, the next generation of talent, the people coming from lower leagues or from high schools, and you're trying to scout, you know, the next star, it's it's really, really hard to get structured data about that, and it can be extremely unreliable. But there is a lot of unstructured data. Lots of scouts out there writing reports about players, you know, all around the world. And so data engineers were actually able to use AI to parse out that unstructured data and in fact, structure it. Right? Measure sentiment there, and then user use, methodologies of structured data to extract intelligence out of it. Right? You can benchmark the same player over time. You can benchmark the scouter. You can really get a lot of good insight, and that really brought the the data engineers to the forefront of the, you know, of the of the sporting professionals, and, you know, made them made them superstars, essentially. Right? And automated something that was extremely hard to do prior. The you know, and and another example where generally they actually have data teams to get in front of the customers of that organization.
And and I'm thinking here about a a cybersecurity company. And in cybersecurity, there there is a lot of unstructured data, things like security policies and various documents in the in the enterprise and lots of exchanges of documentation between companies trying to deal with each other and and and and buy technology from each other. And that data team was able to do a bunch of different cool things with with generative AI, whether it's responding to to questions based on, you know, a a body of of documents that that that describe all security policies or whether it's, you know, in compliance, translating policies to controls automatically, things that are really, really high value for that company's customers that have been built almost exclusively by data engineers.
So it kinda really brought them to the to the forefront. No no longer, you know, a platform team working for analysts, but, rather serving the company's, customers directly. That was pretty exciting. And then even saw this one data engineering team that was able to actually generate new revenue for the company in a pretty direct manner. It's an energy company. They they get, requests for quotes all the time. People asking them like to, you know, provide a certain supply of energy at a certain time and location, and they were able to, and and historically, they haven't been able to serve all those requests for quotes.
They get a lot of these and and that's potential revenue. Right? If they're able to to respond to a quote that might turn into a customer. Right? And so they basically turn this from this very manual process where humans, you know, take in, you know, request by request and try to respond to it. They they automated it and and and and using generative AI and data engineers were able to suddenly generate a whole lot of new revenue, just from from automating that process and and and solving the the the manual labor, attached to it. So I I I think the point is, you know, data engineering is is is becoming even more valuable and and impactful in the business, you know, more than it's ever been. So and and that's exciting news.
[00:44:43] Tobias Macey:
And in your own work of building a product that is supporting people in this space, trying to understand the impact that AI is having on data and the ways that teams are building these systems? What are some of the most interesting or unexpected or challenging lessons that you've learned personally?
[00:45:01] Lior Gavish:
I think probably the biggest challenge, a couple of of big channel. One is understanding where and how generative AI can be applied, you know, to serve our customers. You know, we we we brainstormed, I wanna say, over 70 use cases in data observability that the generative AI can support. And it's it's it's it's pretty challenging to understand, well, where does it work well, where can it be applied to the maximum effect, where can it make the most impact, and where can it work as reliably as we need it, The which leads to the other challenge. It's extremely or it's it's it's not incredibly hard to get to, like, really cool demos with generative AI.
Right? It's all it's you you can find, almost any use case you can think of. You can find a few examples where Genovir really blows your mind in terms of what you can do, which I think is why why the world is is so excited about it. It's a whole other ballgame to bring into production. Right? It's a whole other ballgame to do it in an environment where you can't expect to you you you can't predict all the inputs or maybe you can't predict them, but, but models are undeterministic, so they work only part of the time. And and there's a lot of art and science that goes into figuring out how to to make these generative models, work well enough so that they can serve a human trying to to to to accomplish a, you know, a a a a a task in the in the their day to day life. And I think the biggest learning there was really that combination between, you know, generative AI and and more deterministic approaches. For example, you know, using generative AI to to get an answer, but then validating it against, you know, heuristics or statistical models in order to make sure that that, you know, that response makes sense and is and is valid for a human to use. And so there's those are probably the the biggest challenges I've experienced personally while while building with generative AI.
[00:47:20] Tobias Macey:
For people who are trying to navigate this current stage of the data ecosystem, what are some of the cases where where you're seeing AI as just being the wrong choice and not something that is worthwhile to invest in or try to incorporate into your data stack?
[00:47:41] Lior Gavish:
Good question. It's not the wrong choice, but I I see a lot of or or the common pitfall I see is, is is the Microsoft approach of, let's throw a a copilot or a chat interface on on on every problem. Right? And just assume that what people want is to interact with in natural language with whatever they were doing before. And that's sometimes valid, and and it can be useful in certain scenarios. Right? And and, you know, in certain cases, yeah, I'll use an example from data engineer. Text to SQL can be very effective, but you really have to think about, about the end user. Right? They may not or natural language questions is not the answer to any problem.
And sometimes people would just rather interact with with with interfaces the way they are. Right? Sometimes that's easier and simpler and works more reliably. And so I'd I'd I'd probably say it's it's important to think about where generative AI really allows people to do something that they've that they haven't been able to do before and where generative AI has an unfair advantage. Right? And we talked about some of these things, but, yeah, extract extracting structured information from unstructured data is is really exciting. It's something that people couldn't do before, but interacting in natural language with a dashboard, maybe not as groundbreaking as as you might think. Right? Sometimes it is, but but oftentimes it's not. And so I would just think about that that kind of idea of of where where can Generative AI really, really stand out and and and have a a competitive advantage, quote, unquote.
And and the answers are not always obvious.
[00:49:40] Tobias Macey:
What are some of the trends that you're keeping an eye on or some of the predictions that you have on the ways that AI is going to impact data engineering in the medium to long term?
[00:49:53] Lior Gavish:
Good question. I I think we'll continue to see some of the things that we've already talked about. You know, it'll get easier to build pipelines, it'll get easier to democratize data. It'll get easier to put, you know, data engineers to work on the on the hardest, most high value problems. So that's, that's definitely happening and will continue to happen. You know, we'll continue to see more unstructured data being processed and and being fully put to use by data engineers. I think it's probably the most exciting element of generative AI.
And then also, you know, as a result of these two things, I think it'll, you know, bring data engineers closer to the customer, closer to the revenue, closer to decision making in a way that that data engineer data engineering teams have have never been before. So we'll continue to see that in the long term.
[00:50:56] Tobias Macey:
Are there any other aspects of this topic of AI and the impact that it's having on data engineers and data engineering teams that we didn't discuss yet that you'd like to cover before we close out the show?
[00:51:08] Lior Gavish:
I'll I'll probably, again, kind going back to to to the basics. Right? General AI is cool and shiny, but in order to make it work in the real world, we have we have to do all the things that we know we have to do. Right? We need to build solid pipelines. We need to make sure they're cost effective, that they're properly governed from a security and compliance perspective, and and, of course, make sure they're they're reliable and and high quality because at the end of the day, you know, those models as smart as they are, can't overcome a security breach.
They can't overcome garbage data, and and and they won't be successful if if they're cost prohibitive. Right? And so we need to make sure all these fundamentals, work well. And and at Monte Carlo, we're we're excited to tackle the the reliability and quality aspects of it. And, you know, we're we're also excited to see how how other vendors and and Saks will help with with with the 2 other challenges. So, yeah, back to the basics.
[00:52:15] Tobias Macey:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or
[00:52:38] Lior Gavish:
think it's, it's it's not necessarily one tool, but, you know, with the theme themes that we've spoken about today in mind, I think we're seeing a consolidation of the the data engineering stack and the software engineering stack. And I'm kinda curious to see how we're going to marry or like to see more, easier ways to marry together those data pipelines and those customer facing, applications, you know, in the same platform working nicely together. I think there's some interesting opportunities in in stitching those two worlds and and making the whole, system work nicely together for, you know, for a team where there's, you know, a software engineer, a data engineer, and the machine learning engineer, all all working together to build a single platform, a single application.
[00:53:33] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share the ways that you're seeing AI be an impact on and for the data engineering teams out there. Appreciate all of the time and energy that you and your team are putting into helping to support those folks. So, thank you again for taking the time today, and I hope you enjoy the rest of your day.
[00:53:55] Lior Gavish:
Thank you, Tobias. Super fun.
[00:54:05] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at data engineering podcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Hello, and welcome to the Data Engineering podcast, the show about modern data management. Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end to end data lake has platform built on Trino, the query engine Apache Iceberg was designed for. Starburst has complete support for all table formats, including Apache Iceberg, Hive, and Delta Lake. And Starburst is trusted by teams of all sizes, including Comcast and DoorDash. Want to see Starburst in action? Go to data engineering podcast.com/starburst today and get $500 in credits to try Starburst Galaxy, the easiest and fastest way to get started using Trino.
Your host is Tobias Macey. And today, I'm welcoming back Lior Gavish to talk about the impact of AI on data engineers. So, Lior, can you start by introducing yourself?
[00:01:01] Lior Gavish:
Hi, Tobias. Thanks for having me today. I'm Lior. I'm the cofounder of of a company called, Monte Carlo. We're the data observability company, which means we we help data teams, data engineers, data analysts, data scientists create reliable and and trusted data for for whatever they're using, whether it's analytics or machine learning or increasingly AI. We've been around for about 5 years. We're now serving over 400 data teams and and growing quickly. And, you know, we love everything data, so it's it's always exciting to to be to be on the show.
[00:01:41] Tobias Macey:
As I mentioned, you've been on a couple of times before, but for anybody who hasn't heard your past appearances, which I'll link in the show notes, can you just refresh our memories to how you got started in data?
[00:01:51] Lior Gavish:
Oh, absolutely. I got started in data, I I wanna say over 15 years ago as what you would typically call today, a machine learning engineer. I was actually building NLP models to help summarize and and and classify, news articles. That's how I started. I then went on to start my own company in in in cybersecurity and used analytics and machine learning to, solve certain kinds of of of fraud use cases. Company got acquired by a company by by a a bigger, larger than public cybersecurity firm, and I went on to to lead the, the data and engineering teams. There are about a 100 people, you know, building various kinds of of real time protection from fraud and and cyber attacks and heavily relied on on analytics and machine learning. So kind of a, my background is a combination of of engineering and data, and therefore also my my interest in in how do you operationalize data and and how would you make it actually as reliable and as trusted as it could be when it serves, know, real time applications and and sometimes millions of users. So that's that's my, background and gist.
[00:03:11] Tobias Macey:
It's always interesting in this world that we're in now of large language models and generative AI hearing the terms natural language processing and the fact that you were working so hard to be able to summarize those news articles when now people just say, oh, I'll just throw it at chat GPT, and it does it for me. But there are also, of course, the the issues around quality and accuracy in those summarizations. So
[00:03:34] Lior Gavish:
Yeah. Scratches on my back. What I, spent, you know, months on end trying to to build with very custom and and and and specialized algorithms you can now do with an API call that that costs, you know, fractions of of a cent. So, there you go. There's some progress in the world.
[00:03:56] Tobias Macey:
And nowadays, whenever somebody says the word AI, the assumption is that we're talking about chat TPT and the like. But for purposes of this conversation, when we're talking about the impact that the current phase of AI is having on data teams and data engineers in particular. I'm wondering if you can give some clarifying detail about what you mean when we're saying AI for the purposes of this conversation.
[00:04:22] Lior Gavish:
It's a great point. I think, you you know, AI the the debate of AI and ML and what's AI has been going on for for, for as long as I can remember. And it's always been there's been a lot of opinions on it. I I I think for to or, you know, for for the sake of this discussion, we could probably think of AI as, generally speaking, the the generative AI models that emerged in the last, you know, almost 2 years now. So think, you know, ChatGPT, OpenAI, and now more recently, you know, various kinds of of lamas and and and, mistral and mistropic and whatever.
I think those models introduce kind of a a a foundational shift in how we think about using AI. And I'm not sure we're super close, but it's it's probably probably the the most promising avenue to true, you know, machine intelligence potentially getting closer and closer to human level intelligence. And so for, you know, for right now, I'm thinking about AI mostly as those large language models that have been introduced pretty recently.
[00:05:36] Tobias Macey:
Machine learning went through a bit of a resurgence 5 years ago thereabouts with the idea of ML engineers and the, the work of actually bringing machine learning into the real time application experience for end users and MLOps. So there's been a lot of that conversation happening for a little while now. With the advent of generative AI and the new demands that they're placing, particularly thinking in terms of things like prompt engineering, retrieval augmented generation, the fact of the models themselves being so much bigger. I'm curious if you can start by giving a bit of an overview of what you're seeing as the new requirements and the new features that are required in data platforms for being able to support these new categories of model in an operational environment?
[00:06:31] Lior Gavish:
Absolutely. You're right. About 5 or 10 years ago, there's been a a a an explosion of tools to bring, you know, machine machine learning into production, whether it's, feature stores and models serving frameworks and and and model versioning and, all kinds of different technologies. Journey of AI has brought in, you know, a new stack and and, I'm happy to go through the different components of that stack. I think 1st and foremost, and and and and the thing that that probably many of our listeners have have used is or or the model APIs. Right? And OpenAI did something very cool, which didn't exactly exist in the ML world. It basically made models a commodity by serving them, as an API. Right? You can make a very simple HTTP request to OpenAI and and now to many other providers and get access to the latest and greatest, model that's been trained by, I mean, 500 PhDs and and a $1,000,000,000 worth of GPUs. Right?
And and and that's a really core part of of the stat that that everybody's familiar with. Having said that, over the last 2 years, there's there's been a lot of other components, that kind of emerged, to help teams build with with generative AI. 1st and foremost, Rag, you mentioned it. It's this idea of, well, how do we add long term memory and other capabilities to to these large language models? Retrieval, augmented generation is is the the probably the most dominant, way to do so. And it's this idea of, you know, let's take a lot of, data, typically unstructured data, but but sometimes structured data, put it in a database, and make it available to the model, while, it is, responding to to to user prompts.
Right? So for example, if I a very simple example, if I want, to use a model to answer, a lot of questions about, my, you know, documentation, my developer documentation, I can put all of these documents that have existed, that they've been built over the years into, a database. Specifically, I might use a vector database. And then, when the model gets a new question from a user about how to do this or that, it might use the database to retrieve relevant documents to the user's prod and then create, and then, create an answer to the to that question using those documents. So, you know, typically by some form of of summarization, or extraction.
And so RAG is kind of a new has prompted, if you will, a bunch of different technologies, most prominently the the vector databases that that probably many have heard of, but also, you know, frameworks and libraries that help build those RAD applications. Next in line is probably, various tools for fine tuning. It's, fine tuning is this idea of, you know, let's take a large language model that's been trained over maybe the entire Internet and all of the books ever written and, and whatever it is. And then maybe let's customize it to, you know, to a more specific need. Maybe I want to have Yeah. Specific knowledge about some topic that's, that's very relevant to to to my business problem.
You can basically use, a set of of documents as a way to fine tune the model and and add a certain, a certain specialization to it. There's a bunch of tools that that will help you do that effectively. Most prominently, you know, APIs coming, you you know, coming from the model providers that allow you to to to add a dataset, train a model, and then serve that model. But then also, again, other software tools to to make that easier. The the 4th, technology that's that's emerging in terms of of generating AI applications are, or what I broadly call orchestration technologies.
So think frameworks like langchain, but there's increasingly more and more frameworks, agent frameworks, prop management tools, all kinds of different tools that help you, basically create an application using generative AI. It allows you to, to make a series of, or to orchestrate a series of calls to to models, you know, using the the outputs of a call to, to trigger another call, using, RAG in in in the mids, and combining models and and prompts and and and and and and information from a database and kinda interesting ways to create to create higher level abstractions, if you will, of of these models.
So that's that's been, pretty exciting too. And then, those 4 probably allow, from what I've seen, allow users to create pretty sophisticated applications using Journey of AI. But then as you take these things to production, and expose them to, you know, to a broader set of users, potentially external users outside of your own company, there's there's a couple more, sets of tools that come up. 1st and foremost, security. Journey of AI opens a lot of possibilities in terms of things that could go wrong in terms of security and privacy. And so we're seeing more and more tools that, help you manage that in real time.
Generally speaking, AI firewalls, that's kind of interesting. And then, of course, you know, the topic that that's that's near and dear to my heart, the the reliability and quality tooling, whether it's, you know, observability of the type that that we work on in Monte Carlo. It's, you know, the idea of monitoring the quality of of of the Gen AI system in production, but also, obviously, preproduction tools that help with evaluating new versions of an application or new versions of the model, etcetera, etcetera. And so all these kinda different tools are are coming up. Lots of companies that are trying to build those. Lots of teams are trying to build those in house. Very exciting.
I would generally categorize, you know, I think it's just helpful to remember how you might get those tools. I think there's kinda 3 big options here. You can either buy those tools from cloud providers, like AWS, GCP, Azure, and and increasingly OpenAI. And that that's what most most teams do as far as I can tell. Then, increasingly, there's good, you know, end to end offerings from from the data clouds, from Snowflake and Databricks, both offer a pretty robust set of tools across, all these categories, to help data teams build with AI. And then you'll, of course, find specialized solutions for different different parts of the stack. So whether it's, you know, Pinecone, for vector databases on on the Rag side or or Monte Carlo on the observability side.
But of course, when when you need to upgrade from the from the basic version to the to advanced use cases, you you can definitely find, you know, highly specialized and and professional tools for for each part of the stack now. So, yeah, it's, the the the there's a lot that's coming into the stack to enable, generative AI, and and lots of new technologies, lots of new tools. Having said that, the, I do still think the foundation is is, you know, the classic data pipelines. Right? The the the one thing that that that I think people are realizing is that no matter how you go about Genov AI, the the core piece is, marrying those models with with the data that you manage anyways. Right? If you don't combine your own data with the models, you're basically building a a commodity. Right? Something that ChatGPT could do.
The whole point is how how do you create a a unique, specialized, personalized experience for your own users that, that heavily relies on your own data? And then, what everybody ends up doing is building lots of data pipelines, to feed those models with with their own proprietary data and and and and make those models useful in the context of their own business and and and, you know, their own users.
[00:15:52] Tobias Macey:
There are a lot of different pieces that you talked through there. Some of them are new infrastructure components. Some of them are just new practices and how to actually manage the model and the end user application. I'm curious what you have seen as the ways that teams are thinking about which of those components are the responsibility of the data engineers, which are the responsibility of ML engineers, if you have them, which of them belong to the operations and infrastructure teams, which of them belong to application engineers, and just some of the ways that you see the the breakdown of who owns which piece?
[00:16:33] Lior Gavish:
Such a good question. It's a mele right now. It's so hard. There there isn't I I wouldn't say there's an established path to building generative AI in every company. Does it slightly differently based on, you know, the the specific, you know, org structure and talents that that exist in that team. To be clear, you know, there's you need all, Right? There's an element of of software engineering here because you are building an application, typically a a web service, typically with some form of of user facing application. There's a good element of of data engineering here, with data pipelines as as we as we discussed.
There's some element of of machine learning engineering, of ML engineering and and and data science here when it comes to exploring the models and understanding how to use the data. And and and also, you know, to be honest, generative AI, at least right now is most effective when it's combined with more, traditional ML and and sometimes deterministic approaches. The the combination is actually very powerful and data scientists are actually good at at at and making this whole thing, work nicely together. And, of course, there's, you know, product managers and product designers involved because, again, it's it's it's an application. It needs to work in in a way that makes sense for for its, consumers.
And so what I typically see is all of those teams involved, in various capacities and and and everybody focusing on their, you know, on their own pieces. I think the and and and and and all those teams also employ those different pieces of the stack in in different ways. Just to give you an an example, a a software engineer might use, you know, a a a a a model API, right, something like OpenAI, to, to generate a response sometimes maybe in in real time to to a user prompt. And a data engineer might use that same API to process text documents in bulk as part of a pipeline. Right? And a data scientist might might use it in in, you know, in a in a third way. And so right right now, it's it's a mix. I don't think there's clear ownership of who does what, and we're definitely saying, you know, software engineers is building data pipelines and data engineers building user facing applications.
I think over time, you know, we'll probably get to to some, you know, some best practice or some understanding of of how to, you know, how how to split all these different components, between the between the different teams. And and what I suspect is, you know, we'll definitely see much more or multidisciplinary team tackling this. Right? And, you know, and and and this existed for a while now, teams that are made up of software engineers, data engineers, data scientists working together, but it was probably the the exception, not the rule. I think with generative AI, it's increasingly going to be the the rule, if you will, and and we'll we'll see those teams kind of working together to to build, you know, solutions and and full applications.
[00:20:00] Tobias Macey:
Another interesting aspect of the ways that generative AI is turning everything on its head is that for a long time, it was the case that data engineering was there to support data science, and then it turned into machine learning. And now we're seeing it come full circle where these generative AI technologies are also being used in the data engineering workflow of pipeline design, transformation generation, code generation. And I'm wondering how you're seeing data engineers start to bring generative AI into that development flow and into the work of building and maintaining the data pipelines that then go on to feed the generative AI?
[00:20:44] Lior Gavish:
Yeah. Absolutely. I think the it's kinda what you mentioned, Tobias. Right? Like, it starts from from what a lot of engineers are are doing right now, which is use, you know, various kinds of copilots, to accelerate development. And in in in the case of data engineers, you you can very effectively use generative AI to to build, you know, your pipelines, whether using PySpark, or SQL, or or what have you, generally, I can probably accelerate some some elements of it. We're not quite there in terms of, you know, replacing data engineers with AI. We may we would never get there, but it can certainly make people more more productive.
I think it can even or or the other thing that's happening a little bit, I think it it generally, I democratizes access to to these things. Right? So, you know, it it the the whole, text to SQL thing is working pretty pretty decently, which means that maybe certain things that, used to be delegated to to data engineers, when there's a need to create an, you know, a new pipeline or an or a new analysis. Now maybe someone with less technical skills can do it using generative the ad model. So it's kinda democratizing access to data and to pipelines, and and that actually frees up data engineers to to, to do the things that that they do best rather than, you know, answering to ad hoc, requests.
I think the most exciting thing though for for data engineers is actually that, generative AI unlocks, access to unstructured data. And what I mean by that is imagine a lot or, and and there's plenty of examples, but a lot of enterprises, a lot have a lot of very useful unstructured data documents, basically, generated in the business, you know, whether it's legal documents or, you know, technical documentation or, or lots of other corpus of information that that are useful. If you wanted to to make these things useful for the business in the past, you know, a data engineer can do it. They would need, you know, a data scientist and a machine learning engineer to come in and and process that data and extract, generally, extract structured data from it. Right? If you wanted to process all your legal documents and and and extract information from there, you actually needed to hire, you know, highly specialized data people that would build, you know, NLP algorithms, quite like I used to do 15 years ago in order to do that. And the data engineer would help them kinda string things together and build the pipeline, but but still a lot of the work would would have to be, outsourced in a way.
And and now data engineers can actually do that on their own. Right? Especially with with models being available, natively in in in tools like Snowflake or Databricks. A data engineer can take, you know, a body of legal document documents and extract information from there without getting any help. It's as easy as as creating a prompt and and applying a a function to that dataset. Right? And so I I think that's a that's a force multiplier. Right? It it opens up opportunities for data engineers to use enterprise data much more effectively, and with much less help from from from other teams that that they might have depended on in in the past. So that's that that's a pretty exciting shift and change, you know, in in in my view.
[00:24:40] Tobias Macey:
One of the other major ways that AI and machine learning models have found their way into the data engineering workflow is in particular in that context of retrieval augmented generation where you need to be able to generate the vector embeddings of the data that you want to use for that context. And so you have to have some capacity being able to run those embedding models in that pipeline workflow and pipeline environment, and you also need to be thinking about what are the considerations around how I want to generate the embeddings, what do I need to to be thinking about as far as chunking of the different sizes of embeddings that we're creating. And I'm wondering if you can talk to how those new requirements are being addressed if in data teams and some of the new skills and new training that's necessary to be able to build pipelines to support that RAG use case effectively.
[00:25:39] Lior Gavish:
Yeah. Absolutely. I think the so as I said, incredibly helpful for data teams, right? There's a lot of value in being able to, to process unstructured data. And, you know, to be honest, I would dare say that at this point, there isn't yet, you know, a best practice. Like, you you can't, you know, buy a book and understand how to how to build brag. Although I'm sure someone is selling a book. Nobody really has done that a ton. You can hire, you know, a rag expert that will tell you how to do it. It's mostly about getting hands on experience and being curious and experimenting and finding, you know, what's what's right for your particular use case and your particular, need. And so, you know, I'm I I probably couldn't give people advice on on on how to build RAG pipelines at this at this point, and and I've seen so many different approaches, oftentimes highly dependent on the background of the people building the pipelines. Right? Software engineers stack it in a certain way and then data engineers do it in a completely different way and, you know, and both are very valid right now. I think that the the having said that, there's there's probably a few things that people need to start thinking about, in terms of how to build that effectively.
Right? And, essentially, like, how to go from that prototype phase of, you know, take a bunch of documents, run them through, lang chain and and create a and and put them in a vector database. That's, you know, that's something that that we're all learning how to do, to take it to the next level. There's kind of a higher level questions that need to be that that that we're seeing teams increasingly ask. It's things around security and privacy, for example. Like, how do you make sure, that, whatever access controls apply to that data and, you know, whatever sense of information is there continues to be governed as it as it goes through data pipelines.
Over the years, we've we've developed various methodologies around around that in the in the structured data world, but applying it with unstructured data is is quite different, because it's messy, because you take a, you know, a body of documents and you, honestly, you don't even know what's there and and how it should be protected. So there's much more potential for leak and for people having access to things they're not supposed to see. So, that's definitely an area where where people need to to to develop, you know, a a a knowledge and a set of best practices around it again that that apply to their business.
The other one is cost, efficiency. You know, again, building a prototype will probably not cost you a lot of money and nobody would care. But now, you know, if you process the the entire corpus of of enterprise data and and run-in batch every single day in your pipelines and and, make I don't know how many API calls per document. The bill can run pretty high pretty quickly. And so data teams need to to figure out how to, first, how to create visibility and manage that cost, but also how to optimize it because a lot of use cases end up being exorbitantly expensive. And so you need to think about how to how to minimize the number of calls and then how to potentially use faster, cheaper models where where it applies.
Cost is a thing. And, of course, reliability and quality. Right? These these these pipelines tend to, you know, have have a lot of issues with that. Right? It's first and foremost hallucinations. Right? Everybody that use chat GPT is probably gone to a restaurant that that never existed. They asked for for a recommendation for dinner and true story. And so so you need to figure out how how to deal with those things, how to deal with the fact that it's not deterministic. Right? Like, how how do you even understand the the quality of the output? And even if you painstakingly go and and and look at a lot of results and evaluate them in in various ways, and there are lots techniques around that. How do you make sure that that remains true as you release changes to the pipeline or changes to the underlying model that you're using and and things like that?
And and and so that's, that's something that I think, you know, there's a lot of skills to be learned around. So, yeah, lots of things to learn. I have more questions than answers, around this, but but, yeah, it's exciting times. And then I'm sure that that over the next couple of years, we'll, you know, we'll we'll gradually learn how to how to do all these things effectively.
[00:30:35] Tobias Macey:
Yeah. The it's it's funny how much that is the case right now of I have a lot of questions and I have a lot of ideas, but there aren't any established answers yet because all of it is too new. And so everybody's flailing around in the dark and making their best attempt, and eventually, we'll convene on a set of best practices, but we're not there yet.
[00:30:56] Lior Gavish:
Yeah. Too too early to call, I'd say.
[00:31:01] Tobias Macey:
One of the other interesting side effects of the current stage of AI and the the rise of generative AI in particular is the growth and interest of vector databases, vector indexing. So we're seeing a new growth of that particular segment where it seems like every day I hear of a new vector database that's out there and optimized for some particular use case. Vector databases as a technology predate the current AI craze of generative models for, in particular, things like semantic and similarity searching. And I'm wondering if there are any other aspects of the ways that AI applications and generative models in particular are bringing some of these new technologies or newly created categories into the data engineering ecosystem and how those are being used outside of that context of just supporting generative AI models?
[00:32:04] Lior Gavish:
A vector databases generally use or in addition to being used in kind of rag scenarios, I think people are really excited about, you know, the traditional search. I've, you know, I've talked to this company that that, works you know, that's an ad tech, one of our customers, and and, you know, there there's certain workflows there where you help marketers find relevant places to, to place ads. And, you know, traditionally that's been done with basically elastic search, right? Some some form of of keyword search with with a lot of bells and whistles around it, but and and they've gradually introduced vector embeddings and and and then, vector database that process. Right? And and not not in the context of rag, necessarily, but just, you know, if you wanna find a a place to to put your ad about, I know, fit fitness subscription or or what have you finding, you know, places online that that match that with and and and and vector databases actually do that extremely well, apparently much better than than than traditional search.
And so there's a a resurgence of of of search now with with with vector databases available. And and, of course, those powerful language models that that create the embeddings, it seems to create, you know, better, higher relevance results with kinda deeper a deeper semantic understanding. That's, of course, been available, you know, on on Google or, you know, in other, search engines that that have been highly optimized over the years. But now, you know, probably, you know, smaller use cases, can be built quite rapidly, and have very, very powerful semantic search, using vector databases and embeddings.
[00:34:08] Tobias Macey:
Now bringing it back around to the question of reliability, quality. We have been fighting that battle for a few years now in just the pure business intelligence, data analyst type use case of how do we build more reliable systems. Now with the added stresses and requirements around AI applications and these very probabilistic and not deterministic use cases? How are you seeing that impact the way teams are thinking about reliability, data quality, data observability, and some of the ways that you're thinking about that at Monte Carlo to be able to help support those teams?
[00:34:50] Lior Gavish:
Yeah. Quick question. First of all, like everything, you know, we're still learning. There there's, there there's more question than answers, but I will call out kind of based on on what we're seeing. First of all, I go it it it all goes back to to basics in a sense. Right? Pipelines are pipelines, and you have to make sure they're they're working reliably. Right? You have to make sure that vectors or or tables, you have to make sure they're getting updated on time. You wanna make sure that you have all the dataset there that's that nothing is missing. You also wanna make sure there's no duplications, and duplications are particularly bad in in vector databases because they really hamper your your ability to get, you know, the the k nearest, neighbors and and and get the most, you know, the most relevant doc documents there because you you might get 5 copies of the same document if if you have it in the database.
And so you you wanna make sure these things are working. And, of course, all the all the structural stuff. Right? Do you have the the right vector dimensions that you're expecting, and did you use the right embedding model and a consistent embedding model, right? That that, it's it's very important to use, the same embedding model in the pipeline and engineering retrieval. Right? And of course, the metadata around it. Right? Vectors are always accompanied by metadata that helps, you know, trace them back, to to where they came from or or associate them with with, you know, with with an account or a user or or other things. And you need to make sure that the metadata is there and it's it's complete and and it's accurate in in the same way that structured data, was tested.
So so all these things are there, and and of course, lineage too. Right? In order to effectively manage the quality and reliability, you need to understand, where where the date where those vectors came from, how they're consumed downstream, all that good stuff in in observability. Right? When you have a problem, you you need to understand what its impact downstream is and and and what and and and to find the root cause, you you need to understand what's, what's upstream. So all all these things are kind of classic things that, translate from from structured data to unstructured data pretty directly.
The one thing that's a little bit different is, well, how you how you measure the the quality of the data itself. And and in the structured world, you know, we we we develop a lot of methodologies. Right? Like, we can, calculate a lot of quality metrics, like, you know, how many nodes you have and how many duplications and, and and all sorts of these things. And we can have people that understand the pipeline build, you know, more more precise metrics around it that maybe take into account, you know, a deeper understanding of of the dataset and and the business. And that doesn't translate as well to text or image data that goes into vector databases.
But we're we're seeing, increasingly methods to deal with that. So as an example, you know, you we're we're seeing people doing things like, well, I could probably use generative AI to calculate quality metrics about unstructured data. As an example, I could take all the texts that I have, and I could use a model to, for example, classified into topics, right, and track what topics I have in the dataset and make sure that's stable and that's behaving as expected, over time as the dataset changes and shifts or goes through the pipeline. So that's an example. I could also use generative models to determine whether there's sensitive data or PII in inside the set, right, and and track it over time and so on and so forth. You can get really creative with with how you use generative AI to create quality metrics on top of unstructured data. And and I think it's a very promising avenue, but of course, a lot to be learned there.
It's complicated and it's early days. And, you know, if I'm honest, if you talk to, data teams building generative AI pipelines today, You know, a lot of it still relies on people eyeballing results, and trying to make sure that that they make sense. It's it's not always easy to to to scale and and and and automate these things. And so we're gradually learning how to take bites off of that manual process and and turn it into a more consistent and automated approach.
[00:39:40] Tobias Macey:
As you have been navigating the rapidly changing landscape of data in the face of generative AI and the ways that data teams are working to adapt and ways that you at Monte Carlo are trying to help support them. I'm curious. What are some of the most interesting or innovative or unexpected ways that you're seeing AI impact those data engineering teams?
[00:40:04] Lior Gavish:
Yeah. So many exciting examples here. I'll name a few. I think one, one really exciting trend is I think that the fact that we're unlocking structured data is bringing data engineering closer to the forefront of of decision making. And and the example I I I I like to quote here is, you know, talking to this sports team professional sports team. It's one of our customers. And, you know, in the past, you know, data engineers could provide, some structured data out there about players. Right, certain statistics about how they performed in matches. But but that data is pretty sparse. Right? And it it's only available in the in the top leagues.
But if you're trying to spot the next, you know, the next generation of talent, the people coming from lower leagues or from high schools, and you're trying to scout, you know, the next star, it's it's really, really hard to get structured data about that, and it can be extremely unreliable. But there is a lot of unstructured data. Lots of scouts out there writing reports about players, you know, all around the world. And so data engineers were actually able to use AI to parse out that unstructured data and in fact, structure it. Right? Measure sentiment there, and then user use, methodologies of structured data to extract intelligence out of it. Right? You can benchmark the same player over time. You can benchmark the scouter. You can really get a lot of good insight, and that really brought the the data engineers to the forefront of the, you know, of the of the sporting professionals, and, you know, made them made them superstars, essentially. Right? And automated something that was extremely hard to do prior. The you know, and and another example where generally they actually have data teams to get in front of the customers of that organization.
And and I'm thinking here about a a cybersecurity company. And in cybersecurity, there there is a lot of unstructured data, things like security policies and various documents in the in the enterprise and lots of exchanges of documentation between companies trying to deal with each other and and and and buy technology from each other. And that data team was able to do a bunch of different cool things with with generative AI, whether it's responding to to questions based on, you know, a a body of of documents that that that describe all security policies or whether it's, you know, in compliance, translating policies to controls automatically, things that are really, really high value for that company's customers that have been built almost exclusively by data engineers.
So it kinda really brought them to the to the forefront. No no longer, you know, a platform team working for analysts, but, rather serving the company's, customers directly. That was pretty exciting. And then even saw this one data engineering team that was able to actually generate new revenue for the company in a pretty direct manner. It's an energy company. They they get, requests for quotes all the time. People asking them like to, you know, provide a certain supply of energy at a certain time and location, and they were able to, and and historically, they haven't been able to serve all those requests for quotes.
They get a lot of these and and that's potential revenue. Right? If they're able to to respond to a quote that might turn into a customer. Right? And so they basically turn this from this very manual process where humans, you know, take in, you know, request by request and try to respond to it. They they automated it and and and and using generative AI and data engineers were able to suddenly generate a whole lot of new revenue, just from from automating that process and and and solving the the the manual labor, attached to it. So I I I think the point is, you know, data engineering is is is becoming even more valuable and and impactful in the business, you know, more than it's ever been. So and and that's exciting news.
[00:44:43] Tobias Macey:
And in your own work of building a product that is supporting people in this space, trying to understand the impact that AI is having on data and the ways that teams are building these systems? What are some of the most interesting or unexpected or challenging lessons that you've learned personally?
[00:45:01] Lior Gavish:
I think probably the biggest challenge, a couple of of big channel. One is understanding where and how generative AI can be applied, you know, to serve our customers. You know, we we we brainstormed, I wanna say, over 70 use cases in data observability that the generative AI can support. And it's it's it's it's pretty challenging to understand, well, where does it work well, where can it be applied to the maximum effect, where can it make the most impact, and where can it work as reliably as we need it, The which leads to the other challenge. It's extremely or it's it's it's not incredibly hard to get to, like, really cool demos with generative AI.
Right? It's all it's you you can find, almost any use case you can think of. You can find a few examples where Genovir really blows your mind in terms of what you can do, which I think is why why the world is is so excited about it. It's a whole other ballgame to bring into production. Right? It's a whole other ballgame to do it in an environment where you can't expect to you you you can't predict all the inputs or maybe you can't predict them, but, but models are undeterministic, so they work only part of the time. And and there's a lot of art and science that goes into figuring out how to to make these generative models, work well enough so that they can serve a human trying to to to to accomplish a, you know, a a a a a task in the in the their day to day life. And I think the biggest learning there was really that combination between, you know, generative AI and and more deterministic approaches. For example, you know, using generative AI to to get an answer, but then validating it against, you know, heuristics or statistical models in order to make sure that that, you know, that response makes sense and is and is valid for a human to use. And so there's those are probably the the biggest challenges I've experienced personally while while building with generative AI.
[00:47:20] Tobias Macey:
For people who are trying to navigate this current stage of the data ecosystem, what are some of the cases where where you're seeing AI as just being the wrong choice and not something that is worthwhile to invest in or try to incorporate into your data stack?
[00:47:41] Lior Gavish:
Good question. It's not the wrong choice, but I I see a lot of or or the common pitfall I see is, is is the Microsoft approach of, let's throw a a copilot or a chat interface on on on every problem. Right? And just assume that what people want is to interact with in natural language with whatever they were doing before. And that's sometimes valid, and and it can be useful in certain scenarios. Right? And and, you know, in certain cases, yeah, I'll use an example from data engineer. Text to SQL can be very effective, but you really have to think about, about the end user. Right? They may not or natural language questions is not the answer to any problem.
And sometimes people would just rather interact with with with interfaces the way they are. Right? Sometimes that's easier and simpler and works more reliably. And so I'd I'd I'd probably say it's it's important to think about where generative AI really allows people to do something that they've that they haven't been able to do before and where generative AI has an unfair advantage. Right? And we talked about some of these things, but, yeah, extract extracting structured information from unstructured data is is really exciting. It's something that people couldn't do before, but interacting in natural language with a dashboard, maybe not as groundbreaking as as you might think. Right? Sometimes it is, but but oftentimes it's not. And so I would just think about that that kind of idea of of where where can Generative AI really, really stand out and and and have a a competitive advantage, quote, unquote.
And and the answers are not always obvious.
[00:49:40] Tobias Macey:
What are some of the trends that you're keeping an eye on or some of the predictions that you have on the ways that AI is going to impact data engineering in the medium to long term?
[00:49:53] Lior Gavish:
Good question. I I think we'll continue to see some of the things that we've already talked about. You know, it'll get easier to build pipelines, it'll get easier to democratize data. It'll get easier to put, you know, data engineers to work on the on the hardest, most high value problems. So that's, that's definitely happening and will continue to happen. You know, we'll continue to see more unstructured data being processed and and being fully put to use by data engineers. I think it's probably the most exciting element of generative AI.
And then also, you know, as a result of these two things, I think it'll, you know, bring data engineers closer to the customer, closer to the revenue, closer to decision making in a way that that data engineer data engineering teams have have never been before. So we'll continue to see that in the long term.
[00:50:56] Tobias Macey:
Are there any other aspects of this topic of AI and the impact that it's having on data engineers and data engineering teams that we didn't discuss yet that you'd like to cover before we close out the show?
[00:51:08] Lior Gavish:
I'll I'll probably, again, kind going back to to to the basics. Right? General AI is cool and shiny, but in order to make it work in the real world, we have we have to do all the things that we know we have to do. Right? We need to build solid pipelines. We need to make sure they're cost effective, that they're properly governed from a security and compliance perspective, and and, of course, make sure they're they're reliable and and high quality because at the end of the day, you know, those models as smart as they are, can't overcome a security breach.
They can't overcome garbage data, and and and they won't be successful if if they're cost prohibitive. Right? And so we need to make sure all these fundamentals, work well. And and at Monte Carlo, we're we're excited to tackle the the reliability and quality aspects of it. And, you know, we're we're also excited to see how how other vendors and and Saks will help with with with the 2 other challenges. So, yeah, back to the basics.
[00:52:15] Tobias Macey:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or
[00:52:38] Lior Gavish:
think it's, it's it's not necessarily one tool, but, you know, with the theme themes that we've spoken about today in mind, I think we're seeing a consolidation of the the data engineering stack and the software engineering stack. And I'm kinda curious to see how we're going to marry or like to see more, easier ways to marry together those data pipelines and those customer facing, applications, you know, in the same platform working nicely together. I think there's some interesting opportunities in in stitching those two worlds and and making the whole, system work nicely together for, you know, for a team where there's, you know, a software engineer, a data engineer, and the machine learning engineer, all all working together to build a single platform, a single application.
[00:53:33] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share the ways that you're seeing AI be an impact on and for the data engineering teams out there. Appreciate all of the time and energy that you and your team are putting into helping to support those folks. So, thank you again for taking the time today, and I hope you enjoy the rest of your day.
[00:53:55] Lior Gavish:
Thank you, Tobias. Super fun.
[00:54:05] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at data engineering podcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Episode Overview
Guest Introduction: Lior Gavish
Evolution of AI and Its Impact on Data Engineering
New Requirements for Data Platforms with Generative AI
Components of the Generative AI Stack
Roles and Responsibilities in AI-Driven Data Teams
Generative AI in Data Engineering Workflows
Challenges and Considerations in RAG Pipelines
Vector Databases and Their Growing Importance
Ensuring Reliability and Quality in AI Pipelines
Innovative Uses of AI in Data Engineering
Challenges in Implementing Generative AI
When AI is Not the Right Choice
Future Trends in AI and Data Engineering
Final Thoughts and Closing Remarks