How Generative AI Is Impacting Data Engineering Teams

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

Data lakes are notoriously complex.

For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end to end data lake has platform built on Trino, the query engine Apache Iceberg was designed for.

Starburst has complete support for all table formats, including Apache Iceberg, Hive, and Delta Lake.

And Starburst is trusted by teams of all sizes, including Comcast and DoorDash.

Want to see Starburst in action? Go to data engineering podcast.com/starburst

today and get $500 in credits to try Starburst Galaxy, the easiest and fastest way to get started using Trino.

Your host is Tobias Macey. And today, I'm welcoming back Lior Gavish to talk about the impact of AI on data engineers. So, Lior, can you start by introducing yourself?

Hi, Tobias. Thanks for having me today. I'm Lior. I'm the cofounder of of a company called,

Monte Carlo.

We're the

data observability company, which means we we help data teams, data engineers, data analysts, data scientists

create reliable

and and trusted data for for whatever they're using, whether it's analytics or machine learning or increasingly AI.

We've been around for about 5 years. We're now serving over 400

data teams and

and growing quickly. And, you know, we love

everything data, so it's it's always exciting to to be to be on the show.

As I mentioned, you've been on a couple of times before, but for anybody who hasn't heard your past appearances, which I'll link in the show notes, can you just refresh our memories to how you got started in data?

Oh, absolutely. I got started in data, I I wanna say over 15

years ago

as what you would typically call today, a machine learning engineer. I was actually building

NLP models to help

summarize and and and classify,

news articles.

That's how I started. I then

went on to start my own company in in in cybersecurity

and used

analytics and machine learning to, solve certain kinds of of of fraud

use cases.

Company got acquired by a company by by a a bigger, larger than public

cybersecurity

firm, and I went on to to lead the,

the data and engineering teams. There are about a 100 people,

you know, building various kinds of of real time protection from fraud and and cyber attacks

and heavily relied on on analytics and machine learning. So kind of a,

my background is a combination of of engineering

and data,

and therefore also my my interest in in how do you operationalize

data and and how would you make it actually

as reliable and as trusted as it could be when it serves, know, real time applications

and and sometimes millions of users. So

that's that's my, background and gist.

It's always interesting

in this world that we're in now of large language models and generative AI hearing the terms natural language processing and the fact that you were working so hard to be able to summarize those news articles when now people just say, oh, I'll just throw it at chat GPT, and it does it for me. But there are also, of course, the the issues around quality and accuracy in those summarizations.

So

Yeah. Scratches on my back. What I,

spent, you know, months on end trying to to build with very custom and and and and specialized algorithms you can now do with

an API call that that costs,

you know, fractions of of a cent. So, there you go. There's some progress in the world.

And

nowadays, whenever somebody says the word AI,

the assumption is that we're talking about chat TPT and the like. But for purposes of this conversation,

when we're talking about the impact that the current phase of AI is having on data teams and data engineers in particular. I'm wondering if you can

give some clarifying detail about what you mean when we're saying AI for the purposes of this conversation.

It's a great point. I think,

you you know, AI the the debate of AI and ML and what's AI has been going on for for, for as long as I can remember.

And it's always been there's been a lot of opinions on it.

I I I think for to or, you know, for for the sake of this discussion, we could probably think of AI as, generally speaking, the the generative AI models that emerged in the last,

you know,

almost 2 years now. So think, you know, ChatGPT,

OpenAI, and now more recently,

you know,

various kinds of of lamas and

and and, mistral and mistropic and whatever.

I think those

models introduce kind of a a a foundational shift in how we think about using

AI. And

I'm not sure we're super close, but it's it's probably probably the the most promising avenue to

true,

you know, machine intelligence

potentially getting closer and closer to human level intelligence. And so for, you know, for right now, I'm thinking about AI mostly as those large language models that have been introduced pretty recently.

Machine learning

went through a bit of a resurgence

5 years ago thereabouts

with the idea of ML engineers and

the, the work of actually bringing machine learning into the

real time application

experience for end users and MLOps. So there's been a lot of that conversation happening for a little while now.

With the advent

of generative

AI and the new demands that they're placing, particularly thinking in terms of things like prompt engineering, retrieval augmented generation,

the fact of the models themselves being so much bigger. I'm curious if you can start by

giving a bit of an overview of what you're seeing as the new requirements and the new features that are required in data platforms

for being able to support these new categories of model in an operational environment?

Absolutely.

You're right. About 5 or 10 years ago, there's been a a a an explosion of tools to bring,

you know, machine machine learning into production, whether it's,

feature stores and models serving frameworks and

and and model versioning and,

all kinds of different technologies.

Journey of AI has brought in, you know, a new stack and and, I'm happy to go through the different

components of that stack. I think 1st and foremost, and and and and the thing that that probably many of our listeners have have used is or or the model APIs. Right? And OpenAI did something

very cool, which didn't exactly exist in the ML world. It basically made

models

a commodity by serving them,

as an API. Right? You can make a very simple

HTTP request to OpenAI

and and now to many other providers

and get access to the

latest and greatest,

model that's been

trained by, I mean, 500 PhDs

and and a $1,000,000,000

worth of GPUs. Right?

And and and that's a really core part of of the stat

that that everybody's familiar with. Having said that, over the last 2 years, there's there's been a lot of other components,

that kind of emerged,

to help teams build with with generative AI. 1st and foremost,

Rag, you mentioned it. It's this idea of, well, how do we add

long term memory

and other capabilities to to these large language models?

Retrieval,

augmented generation is is the

the probably the most dominant,

way to do so.

And it's this idea of, you know, let's take a lot of,

data, typically unstructured data,

but but sometimes structured data, put it in a database, and make it available to the model,

while,

it is,

responding to to to user prompts.

Right? So for example,

if I a very simple example,

if I want, to use a model to answer,

a lot of questions about,

my, you know, documentation, my developer documentation,

I can put all of these documents that have existed, that they've been built over the years into,

a database.

Specifically, I might use a vector database.

And then,

when the model gets a new question from a user about how to do this or that,

it might use the database to retrieve

relevant documents to the user's prod

and then create, and then, create an answer to the to that question using those documents.

So, you

know, typically by some form of of summarization,

or extraction.

And so RAG is kind of a new

has prompted,

if you will,

a bunch of different technologies, most prominently the the vector databases that that probably many have heard of,

but also,

you know, frameworks and libraries that help build those RAD applications.

Next in line is probably,

various tools for fine tuning.

It's, fine tuning is this idea of,

you know, let's take a large language model that's been trained over maybe the entire

Internet and all of the books ever written and, and whatever it is. And then maybe let's customize it to, you know, to a more specific need. Maybe I want to have Yeah. Specific knowledge about some topic that's,

that's very relevant to to to my business problem.

You can basically use,

a set of of documents as a way to fine tune the model and and add a certain,

a certain specialization to it.

There's a bunch of tools that that will help you

do that effectively.

Most prominently,

you know, APIs

coming, you you know, coming from the model providers

that allow you to to to add a dataset,

train a model, and then serve that model.

But then

also,

again,

other software tools to to make that easier.

The the 4th,

technology that's that's emerging in terms of of generating AI applications are,

or what I broadly call

orchestration technologies.

So think frameworks like langchain,

but there's increasingly more and more frameworks,

agent frameworks,

prop management tools, all kinds of different tools that help you,

basically create an application

using generative AI. It allows you to, to make a series of, or to orchestrate a series of calls to to models,

you know, using the the outputs of a call to, to trigger another call, using,

RAG in in in the mids,

and combining models and and prompts and and and and

and and information from a database and kinda interesting ways to create to create higher level abstractions, if you will,

of of these models.

So that's that's been, pretty exciting too.

And then,

those 4 probably

allow,

from what I've seen, allow users to create pretty

sophisticated

applications using Journey of AI.

But then as you take these things to production,

and expose them to, you know, to a broader set of users, potentially

external users outside of your own company, there's there's a couple more,

sets of tools that come up.

1st and foremost,

security.

Journey of AI opens

a lot of

possibilities in terms of things that could go wrong in terms of security and privacy. And so we're seeing more and more

tools that,

help you manage that in real time.

Generally speaking, AI firewalls,

that's kind of interesting.

And then, of course,

you know, the topic that that's that's near and dear to my heart,

the the reliability and quality tooling,

whether it's, you know, observability

of the type that that we work on in Monte Carlo. It's, you know, the idea of monitoring

the quality of of of the Gen AI system in production,

but also,

obviously, preproduction tools that help with

evaluating new versions of an application or new versions of the model,

etcetera, etcetera. And so all these kinda different tools are are coming up. Lots of companies that are trying to build those. Lots of teams are trying to build those in house.

Very exciting.

I would generally categorize,

you know, I think it's just helpful to remember

how you might get those tools. I think there's kinda 3 big options

here. You can either buy those tools from cloud providers,

like AWS, GCP, Azure,

and and increasingly OpenAI.

And that that's what most most teams do as far as I can tell.

Then,

increasingly,

there's good,

you know, end to end offerings from from the data clouds, from Snowflake and Databricks, both

offer a pretty robust set of tools across,

all these categories,

to help data teams build with AI.

And then you'll, of course, find specialized solutions for different different parts of the stack. So whether it's, you know, Pinecone,

for vector databases on on the Rag side or or Monte Carlo on the observability side.

But of course, when when you need to upgrade from the from the basic version to the to advanced use cases,

you you can definitely find, you know, highly

specialized and and professional tools for for each part of the stack now.

So, yeah, it's,

the the the there's a lot that's coming into the stack to enable,

generative AI,

and and lots of new technologies, lots of new tools.

Having said that,

the,

I do still think the foundation is is, you know, the classic data pipelines. Right?

The the the one thing that that that I think people are realizing is that

no matter how you go about Genov AI,

the the core piece is,

marrying those models with

with the data that you manage

anyways. Right? If you don't combine your own data with the models,

you're basically building

a a commodity. Right? Something that ChatGPT could do.

The whole point is how how do you create a a unique, specialized, personalized experience for your own users that,

that heavily relies on your own data? And then,

what everybody ends up doing is building

lots of data pipelines,

to feed those models with with their own proprietary data and and and

and make those models useful in the context of their own business and and and, you know, their own users.

There are a lot of different pieces

that you talked through there.

Some of them are new infrastructure components. Some of them are just new practices and how to actually manage the model and the end user application.

I'm curious what you have seen as the

ways that teams are thinking about which of those components are the responsibility

of the data engineers, which are the responsibility of ML engineers, if you have them, which of them belong to

the operations and infrastructure teams, which of them belong to application engineers, and just some of the ways that you see the the breakdown of who owns which piece?

Such a good question.

It's a mele right now.

It's so hard. There there isn't I I wouldn't say there's an established

path to building generative AI in every company.

Does it slightly differently based on,

you know, the the specific,

you know, org structure and talents that that exist in that team.

To be clear, you know, there's

you need all,

Right? There's an element of of software engineering here because you are building an application, typically a a web service,

typically with some form of of user facing application.

There's a good element of of data engineering here,

with data pipelines as as we as we discussed.

There's some element of of machine learning engineering,

of

ML engineering and

and and data science here when it comes to

exploring the models and understanding how to use the data. And and and also,

you know, to be honest, generative

AI, at least right now is most effective when it's combined with more,

traditional

ML and and sometimes deterministic approaches.

The the combination is actually very powerful and data scientists are actually good at at at and making this whole thing,

work nicely together. And, of course, there's, you know, product managers and product designers involved because, again, it's it's it's an application.

It needs to work in in a way that makes sense for for its,

consumers.

And so what I typically see is

all of those teams involved,

in various capacities

and and and everybody focusing on their, you know, on their own pieces.

I think

the

and and and and and all those teams also employ

those different pieces of the stack in in different ways. Just to give you an an example,

a a software engineer might use,

you know, a a a a a model API, right, something like OpenAI,

to,

to generate

a

response sometimes maybe in in real time to to a user prompt.

And a data engineer might use that same API to

process

text documents in bulk as part of a pipeline. Right? And a data scientist might might use it in in, you know, in a in a third way. And so

right right now, it's it's a mix. I don't think there's clear

ownership of who does what, and we're definitely saying, you know, software engineers is building data pipelines

and data engineers building

user facing applications.

I think over time, you know, we'll probably get to to some,

you know, some best practice or some understanding of of how

to, you know, how how to split all these different components,

between the between the different teams. And and what I suspect is, you know, we'll definitely see

much more

or multidisciplinary

team tackling this. Right?

And,

you know, and and and this existed for a while now, teams that are made up of software engineers, data engineers, data scientists

working together,

but it was probably the the exception, not the rule.

I think with generative AI, it's increasingly going to be the the rule, if you will, and and we'll we'll see those teams kind of working together to to build, you know, solutions and and full applications.

Another interesting aspect of

the ways that generative AI is turning everything on its head is that

for a long time, it was the case that data engineering was there to support data science, and then it turned into machine learning.

And

now we're seeing it come full circle where these generative AI technologies are also being used in the

data engineering

workflow of pipeline design,

transformation generation, code generation.

And I'm wondering how you're seeing data engineers start to bring generative AI into that development flow and into the work of building and maintaining the data pipelines that then go on to feed the generative AI?

Yeah. Absolutely. I think the

it's kinda what you mentioned, Tobias. Right? Like, it starts from from what a lot of engineers are are doing right now, which is use, you know, various kinds of copilots,

to accelerate

development. And in in in the case of data engineers,

you you can

very effectively use

generative AI to to build, you know, your pipelines, whether using

PySpark,

or SQL,

or or what have you,

generally,

I can probably accelerate some some elements of it.

We're not quite there in terms of, you know, replacing data engineers with AI. We may we would never get there,

but it can certainly make people more more productive.

I think it can even or or the other thing that's happening a little bit,

I think it it

generally, I democratizes

access to to these things. Right? So,

you know, it it the the whole,

text to SQL

thing is working pretty

pretty decently,

which means that maybe certain things that,

used to be delegated to to data engineers,

when there's a need to create an, you know, a new pipeline or an or a new

analysis.

Now maybe someone with less technical skills can do it using generative the ad model. So it's kinda democratizing access to data and to pipelines, and and that actually frees up data engineers to to,

to do the things that that they do best rather than, you know, answering to ad hoc,

requests.

I think the

most exciting thing though for

for data engineers

is actually that,

generative AI unlocks,

access to unstructured data. And what I mean by that is

imagine a lot or,

and and there's plenty of examples, but a lot of enterprises, a lot have a lot of very useful

unstructured

data

documents, basically,

generated in the business,

you know, whether it's legal documents or,

you know, technical documentation

or,

or lots of other corpus of information that that are useful.

If you wanted to to make these things useful for the business in the past, you know, a data engineer can do it. They would need, you know, a data scientist and a machine learning engineer

to come in and and process that data and extract,

generally, extract structured data from it. Right? If you wanted to process all your legal documents and and and extract information from there,

you actually needed to hire, you know, highly specialized data people that would build,

you know, NLP algorithms,

quite like I used to do

15 years

ago in order to do that. And the data engineer would help them kinda string things together and build the pipeline, but but still

a lot of the work would would have to be,

outsourced in a way.

And and now data engineers can actually do that on their own. Right?

Especially with with models being available,

natively in in in tools like Snowflake or Databricks.

A data engineer can take, you know, a body of legal document

documents and extract

information from there without getting any help. It's as easy as as creating a prompt and and applying a

a function

to that dataset. Right? And so

I I think that's a that's a force multiplier. Right? It it opens up

opportunities for data engineers

to use enterprise data much more effectively,

and with much less help from

from from other teams that that they might have depended on in in the past. So that's

that that's a pretty exciting shift and change, you know, in in in my view.

One of the other major ways that

AI and machine learning models have found their way into the data engineering workflow

is in particular in that context of retrieval augmented generation where you need to be able to

generate the vector embeddings of the data that you want to use for that context. And so you have to have some capacity being able to run those embedding models in that pipeline workflow and pipeline environment,

and you also need to be thinking about what are the

considerations around how I want to generate the embeddings, what do I need to to be thinking about as far as chunking of the different sizes of embeddings that we're creating. And I'm wondering if you can talk to

how those new requirements

are being addressed if in data teams and some of the

new skills

and new training that's necessary to be able to

build pipelines to support that RAG use case effectively.

Yeah. Absolutely. I think

the

so

as I said,

incredibly helpful for data teams, right? There's a lot of value in being able to, to process unstructured data.

And, you know,

to be honest,

I would dare say that at this point,

there isn't yet, you know, a best practice. Like, you you can't,

you know, buy a book and understand how to how to build brag. Although I'm sure someone is selling a book.

Nobody really has done that a ton. You can hire,

you know, a rag expert that will tell you how to do it. It's mostly about

getting hands on experience

and being curious and experimenting and finding,

you know, what's what's right for your particular use case and your particular,

need. And so, you know, I'm I I probably couldn't give people advice on on on how to build RAG pipelines at this at this point, and and I've seen

so many different approaches,

oftentimes highly dependent on the background

of the people building the pipelines. Right? Software engineers stack it in a certain way

and then data engineers do it in a completely different way and, you know, and both are very valid right now. I think that the the

having said that, there's there's probably a few things that people need to start

thinking about,

in terms of how to build that effectively.

Right? And, essentially, like, how to go from that prototype phase of, you know,

take a bunch of documents, run them through, lang chain

and and create a and and put them in a vector database. That's, you know, that's something that that we're all learning how to do,

to take it to the next level.

There's kind of a higher level questions that need to be

that that that we're seeing

teams increasingly ask. It's things around

security and privacy,

for example. Like, how do you make sure,

that, whatever

access controls

apply to that data and, you know, whatever sense of information is there

continues to be governed

as it as it goes through data pipelines.

Over the years, we've we've developed various methodologies

around

around that in the in the structured data world, but

applying it with unstructured data is is quite different,

because it's messy, because you take a, you know, a body of documents and you, honestly, you don't even know

what's there and and how it should be protected.

So there's much more potential for leak and for people having access to things they're not supposed to see. So,

that's definitely an area where where

people need

to to to develop, you know,

a a a knowledge and a set of best practices around it again that that apply to their business.

The other one is cost,

efficiency.

You know, again, building a prototype will probably not cost you a lot of money and nobody would care.

But now,

you know, if you process

the the entire corpus of of enterprise data

and and run-in batch

every single day in your pipelines and and, make

I don't know how many API calls per document.

The bill can run pretty high pretty quickly.

And so data teams need to

to figure out how to,

first, how to create visibility and manage that cost, but also how to optimize it because a lot of use cases end up being

exorbitantly expensive. And so you need to think about how to how to minimize the number of calls and then how to potentially use faster, cheaper models where where it applies.

Cost is a thing. And, of course,

reliability and quality. Right?

These these these pipelines tend to,

you know, have have a lot of issues with that. Right? It's first and foremost hallucinations. Right? Everybody that use chat GPT is probably

gone to a restaurant that that never existed.

They asked for for a recommendation for dinner and true story. And

so so you need to figure out how how to deal with those things, how to deal with the fact that it's not deterministic.

Right? Like, how how do you even understand

the the quality of the output? And even if you painstakingly go

and and and look at a lot of results and evaluate them in in various ways, and there are lots techniques around that. How do you make sure that that remains true as you release changes to the pipeline or changes to the underlying model that you're using and and things like that?

And and and so that's,

that's

something that I think, you know, there's a lot of skills to be learned around.

So, yeah, lots of things to learn. I have more questions than answers,

around this,

but but, yeah, it's exciting times. And then I'm sure that that over the next couple of years, we'll, you know, we'll we'll gradually learn how to how to do all these things effectively.

Yeah. The it's

it's funny how much that is the case right now of I have a lot of questions and I have a lot of ideas, but there aren't any established answers yet because all of it is too new. And so everybody's

flailing around in the dark and making their best attempt, and eventually, we'll convene on a set of best practices, but we're not there yet.

Yeah. Too too early to call, I'd say.

One of the other interesting side effects of the current

stage of AI and the the rise of generative AI in particular is the growth and interest of

vector databases, vector indexing.

So we're seeing a new growth

of that particular segment where it seems like every day I hear of a new vector database that's out there

and optimized for some particular use case.

Vector databases

as a technology

predate the current AI craze of generative models

for, in particular, things like semantic and similarity searching.

And I'm wondering

if there are any other aspects of

the ways that

AI

applications and generative models in particular

are

bringing some of these new technologies

or newly created

categories into

the data engineering ecosystem

and how those are being used outside of that context of just supporting generative AI models?

A vector databases

generally use or in addition to being used in kind of rag scenarios,

I think

people are really excited about,

you know, the traditional search.

I've, you know, I've talked to this company that that,

works you know, that's an ad tech,

one of our customers, and

and, you know, there there's certain

workflows there where you help

marketers

find relevant

places to,

to place ads. And,

you know, traditionally

that's been done with basically elastic search, right? Some some form of of keyword

search with with a lot of bells and whistles around it, but and and they've gradually introduced

vector embeddings and and and then, vector database that process. Right? And and not not in the context of rag,

necessarily,

but just, you know, if you wanna find a a place to

to put your ad about, I know,

fit fitness subscription or or what have you finding, you know, places online that that match that with and and and and vector databases actually do that

extremely well,

apparently

much better than

than than traditional search.

And so

there's a a resurgence of of of search

now with with with vector databases available. And and, of course, those powerful

language models that that create the embeddings,

it seems to create, you know, better, higher relevance results with kinda deeper a deeper semantic understanding.

That's, of course, been available, you know, on on Google or, you know, in other,

search engines that that have been highly optimized over the years. But now,

you know, probably, you know, smaller use cases,

can be built quite rapidly,

and have very, very powerful semantic search,

using vector databases and embeddings.

Now bringing it

back around to the question of

reliability,

quality.

We have been

fighting that battle for a few years now in just the pure

business intelligence,

data analyst type

use case of how do we build more reliable systems.

Now with the added stresses and requirements around AI applications and these

very probabilistic and not deterministic

use cases? How are you seeing that impact the way teams are thinking about

reliability, data quality, data observability, and some of the ways that you're thinking about that at Monte Carlo to be able to help support those teams?

Yeah.

Quick question.

First of all, like everything, you know, we're still learning.

There there's,

there there's more question than answers, but I will call out kind of based on on what we're seeing.

First of all, I go it it it all goes back to to basics in a sense. Right?

Pipelines are pipelines, and you have to make sure they're they're working reliably. Right? You have to make sure that

vectors or or tables,

you have to make sure they're getting updated on time.

You wanna make sure that

you have all the dataset there that's that nothing is missing. You also wanna make sure there's no duplications, and duplications are particularly bad in in vector databases because they really

hamper your your ability to get, you know, the the k nearest,

neighbors and and and get the most, you know, the most relevant doc documents there because you you might get 5 copies of the same document if if you have it in the database.

And so you you wanna make sure these things are working. And, of course, all the all the structural stuff. Right?

Do you have the

the right vector dimensions that you're expecting, and

did you use the right embedding model and a consistent embedding model, right? That

that, it's it's very important to use, the same embedding model in the pipeline and

engineering retrieval. Right?

And of course, the metadata around it. Right? Vectors are always accompanied by metadata that helps, you know, trace them

back,

to to where they came from or or associate them with with, you know, with with an account or a user or or other things. And you need to make sure that the metadata is there and it's it's complete and and it's accurate in in the same way that structured data,

was tested.

So so all these things are there,

and and of course,

lineage too. Right? In order to effectively manage the quality and reliability, you need to understand,

where where the date where those vectors came from,

how they're consumed downstream,

all that good stuff in in observability. Right? When you have a problem, you you need to understand what its impact downstream is and and

and what and and and to find the root cause, you you need to understand what's,

what's upstream. So all all these things are kind of classic things that,

translate from from structured data to unstructured data pretty directly.

The one thing that's a little bit different is, well, how you how you measure the the quality of the data itself.

And and in the structured world,

you know, we we we develop a lot of methodologies.

Right? Like, we can,

calculate a lot of quality metrics, like, you know, how many nodes you have and how many duplications

and,

and and all sorts of these things.

And we can have people that

understand the pipeline build, you know, more more precise metrics around it that maybe take into account, you know, a deeper understanding of of the dataset and and the business.

And that

doesn't translate

as well to text or image data that goes into vector databases.

But we're we're seeing,

increasingly methods to deal with that. So as an example,

you know, you

we're we're seeing people doing things like,

well,

I could probably

use generative AI

to calculate quality metrics about

unstructured data.

As an example,

I could take all the texts that I have,

and I could use

a model to,

for example, classified into topics, right, and track what topics I have in the dataset and make sure that's

stable and that's behaving

as expected,

over time as the dataset changes and shifts or goes through the pipeline.

So that's an example. I could also use generative models to determine whether there's

sensitive data or PII

in inside the set, right, and and track it over time and so on and so forth. You can get really creative with with how you

use generative AI to create quality metrics on top of unstructured data. And and I think it's a very promising

avenue,

but of course, a lot to be learned there.

It's

complicated and it's early days.

And, you know, if I'm honest,

if you talk to,

data teams building generative AI pipelines today,

You know, a lot of it still relies on people

eyeballing results,

and trying to make sure that that they make sense. It's it's not always easy to to to scale and and and and automate these things. And so we're gradually learning how to take bites off of that manual process and and turn it into

a more consistent

and automated approach.

As you have been

navigating

the rapidly changing landscape of data in the face of generative AI and the ways that data teams are

working to adapt and ways that you at Monte Carlo are trying to help support them. I'm curious. What are some of the most interesting or innovative or unexpected ways that you're seeing

AI impact those data engineering teams?

Yeah.

So many exciting

examples

here.

I'll name a few.

I think

one,

one really exciting trend is

I think that the fact that we're unlocking structured data is bringing data engineering

closer to the forefront of of decision making.

And and the example I I I I like to quote here is,

you know, talking to this

sports team professional sports team. It's one of our customers.

And,

you know, in the past, you know, data engineers could provide,

some structured data out there about players.

Right, certain statistics about how they performed in matches. But but that data is pretty sparse. Right? And it it's only available

in the in the top

leagues.

But if you're trying to spot the next, you know, the next generation of talent,

the people coming from lower leagues or from high schools, and you're trying to

scout,

you know, the next star,

it's it's really, really hard to get structured data about that, and it can be extremely unreliable.

But there is a lot of unstructured data.

Lots of scouts out there writing reports about players, you know, all around the world.

And so data engineers were actually able to use

AI to parse out that unstructured data

and

in fact, structure it. Right? Measure sentiment there,

and then user use, methodologies

of structured data to extract

intelligence out of it. Right? You can benchmark the same player over time. You can benchmark

the scouter. You can really get a lot of good insight, and

that really brought the the data engineers to the forefront of the, you know, of the of the sporting professionals,

and, you know, made them made them superstars,

essentially. Right? And automated something that was extremely hard to do prior. The you know, and and another example

where

generally they actually have data teams to get in front of the

customers of that organization.

And and I'm thinking here about

a a cybersecurity

company.

And in cybersecurity, there there is a lot of unstructured data, things like

security

policies and various documents in the in the enterprise

and lots of exchanges

of documentation

between

companies trying to deal with each other and and and and buy technology from each other.

And that data team was able to

do a bunch of different cool things with with generative AI, whether it's

responding

to to questions based on,

you know, a a body of of documents that that that describe all security policies or whether it's,

you know, in compliance,

translating

policies to controls

automatically,

things that are really, really high value for that company's

customers

that have been built almost exclusively by data engineers.

So it kinda really brought them to the to the forefront. No no longer,

you know, a platform team working for analysts, but,

rather serving the company's,

customers directly.

That was pretty exciting.

And then

even saw this one data engineering team that was able to actually generate new revenue for the company in a pretty direct manner.

It's an energy company. They

they get, requests for quotes all the time. People asking them like to,

you know, provide

a certain supply of energy

at a certain time and location, and they were able to,

and and historically, they haven't been able to serve all those requests for quotes.

They get a lot of these and and that's potential revenue. Right? If they're able to to respond to a quote that might turn into a customer. Right? And so they basically turn this from this

very manual

process

where humans,

you know, take in, you know, request by request and try to respond to it. They they automated it and and and and using generative AI and data engineers

were able to suddenly generate a whole lot of new revenue,

just from from automating that process and

and and solving the the the manual labor, attached to it. So

I I I think the point is, you know,

data engineering is is is becoming

even more valuable

and and impactful in the business,

you know, more than it's ever been. So and and that's exciting news.

And in your own work of building a product that is supporting people in this space, trying to understand the impact that AI is having on data and the ways that teams are building these systems? What are some of the most interesting or unexpected or challenging lessons that you've learned personally?

I think probably the biggest challenge,

a couple of of big channel. One is understanding

where and how generative AI can be applied, you know, to serve our customers.

You know, we we

we brainstormed, I wanna say, over 70 use cases

in data observability

that the generative AI can support.

And it's it's

it's it's pretty challenging to understand, well,

where does it work well,

where can it be applied to the maximum effect, where can it make the most impact,

and where can it work as reliably as we need it, The which leads to the other challenge.

It's

extremely

or it's it's it's not incredibly hard to get to, like, really cool demos with generative AI.

Right? It's all it's you you can find,

almost any use case you can think of.

You can find a few examples where Genovir really

blows your mind in terms of what you can do,

which I think is why why the world is is so excited about

it. It's a whole other ballgame to bring into production. Right? It's a whole other

ballgame to do it in an environment where you can't expect to you you you can't predict all the inputs or maybe you can't predict them, but,

but models are undeterministic,

so they work only part of the time. And and there's a lot of

art and science that goes into figuring out

how to to make these generative models,

work

well enough

so that they can serve a human

trying to to to to accomplish a, you know, a a a a a task in the in the their day to day life. And I think the biggest learning there was really that

combination

between,

you know, generative AI and and more deterministic

approaches. For example,

you know, using generative AI to

to get an answer, but then

validating it against, you know, heuristics or statistical models

in order to make sure that that, you know, that response makes sense and is and is valid

for a human to use. And so there's

those are probably the the biggest

challenges I've experienced personally while while building with generative AI.

For people who are trying to navigate this

current stage of the data ecosystem,

what are some of the cases where where you're seeing AI as just being the wrong choice and not something that is worthwhile to

invest in or try to incorporate into your data stack?

Good question.

It's not the wrong choice, but I I see a lot of

or or the common pitfall I see is,

is is the Microsoft approach

of,

let's throw a a copilot or a chat interface

on on on every problem. Right? And just

assume that what people want is to interact with in natural language with whatever they were doing before.

And that's sometimes valid, and and it can be useful in certain scenarios. Right? And and,

you know,

in certain cases,

yeah, I'll use an example from data engineer. Text to SQL can be very effective,

but you really have to think about,

about the end user. Right? They may not

or

natural language questions is not the answer to any problem.

And sometimes people would just rather

interact with with with interfaces the way they are. Right?

Sometimes that's easier and simpler and works more reliably.

And so I'd I'd I'd probably say

it's it's important to think about where generative AI really allows people to do something that they've

that they haven't been able to do before and where generative AI has an unfair advantage. Right? And we talked about some of these things, but, yeah, extract extracting

structured information

from

unstructured data

is is really exciting. It's something that people couldn't do before,

but interacting in natural language with a dashboard,

maybe not as

groundbreaking

as as you might think. Right? Sometimes it is, but but oftentimes it's not. And so I would just

think about that that kind of idea of of where where can Generative AI really, really

stand out and and and have a a competitive advantage,

quote, unquote.

And and the answers are not always obvious.

What are some of the

trends that you're keeping an eye on or some of the predictions that you have on the ways that AI is going to impact data engineering

in the medium to long term?

Good question. I I think we'll continue to see

some of the things that we've already talked about.

You know, it'll get easier to build pipelines, it'll get easier to democratize data. It'll get easier to put,

you know, data engineers to work on the

on the hardest, most high value problems.

So that's,

that's definitely happening and will continue to happen.

You know, we'll continue to see more unstructured data being processed

and and being fully

put to use by data engineers.

I think it's probably the most exciting

element of generative AI.

And then

also,

you know, as a result of these two things, I think it'll, you know, bring data engineers closer to the customer, closer to the revenue, closer to decision making

in a way that that data engineer

data engineering teams have have never been before. So

we'll continue to see that in the long term.

Are there any other aspects of this topic of

AI and the impact that it's having on data engineers and data engineering teams that we didn't discuss yet that you'd like to cover before we close out the show?

I'll I'll probably,

again, kind going back to to to the basics.

Right? General AI is cool and shiny,

but in order to make it work in the real world, we have we have to do all the things that we know we have to do. Right?

We need to build

solid pipelines. We need to

make sure they're cost effective, that they're properly governed from a security and compliance perspective, and

and, of course, make sure they're they're reliable and and high quality because at the end of the day,

you know, those models as smart as they are, can't overcome

a security breach.

They can't overcome

garbage data,

and and and they won't be successful if if they're cost prohibitive. Right? And so

we need to make sure all these fundamentals,

work well. And and at Monte Carlo, we're we're excited to tackle the the reliability and quality aspects of it.

And, you know, we're we're also excited to see how how other vendors and and Saks will help with with with the 2 other challenges. So,

yeah, back to the basics.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information

to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or

think it's,

it's it's not necessarily one tool, but,

you know, with the theme themes that we've spoken about today in mind,

I think we're seeing

a consolidation

of the the data engineering stack and the

software engineering stack.

And

I'm kinda curious to see how we're going to marry or

like to see more,

easier ways to

marry together those

data pipelines and those customer facing,

applications,

you know, in the same platform working nicely together. I think there's

some interesting opportunities in in stitching those two worlds and

and making the whole,

system work nicely together for, you know, for a team where there's,

you know, a software engineer, a data engineer, and the machine learning engineer, all

all working together to build a single platform, a single application.

Alright. Well, thank you very much for taking the time today to join me and share the ways that you're seeing AI

be an impact

on and for the data engineering teams out there. Appreciate all of the time and energy that you and your team are putting into helping to support those folks. So, thank you again for taking the time today, and I hope you enjoy the rest of your day.

Thank you, Tobias. Super fun.

Thank you for listening, and don't forget to check out our other shows. Podcast.net

covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering podcast is your guide to the fast moving world of building AI systems.

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at data engineering podcast.com

with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast