Building ETL Pipelines With Generative AI

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Introducing RudderStack Profiles.

RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team.

You specify the customer traits, then profiles runs the joins and computations for you to create complete customer profiles.

Get all of the details and try the new product today at dataengineeringpodcast.com/runnerstack.

You shouldn't have to throw away the database to build with fast changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades old batch computation model for an efficient incremental engine to get complex queries that are always up to date. With Materialise, you can. It's the only true SQL streaming database built from the ground up to meet the needs of modern data products.

Whether it's real time dashboarding and analytics, personalization and segmentation, or automation and alerting, Materialise gives you the ability to work with fresh, correct, and scalable results, all in a familiar SQL interface.

Go to data engineering podcast.com/materialize

today to get 2 weeks free. Your host is Tobias Macey. And today, I'm interviewing Jay Mishra about the applications for generative AI in the ETL process. So, Jay, can you start by introducing yourself?

Absolutely.

Thanks for having me, Tobias.

This is Jay Mishra. I am the chief operating officer at Astera. I have been in this field for over, 2 decades,

specifically with the data management over 13 years, and I've been party to a lot of implementations

at fortune 500 companies,

for the ETL implementations,

for data warehousing implementations,

for various other use cases under the umbrella of data management. I have participated,

from beginning all the way to the end, including

implementations using our toolset.

And do you remember how you first got started working in data? Interesting story.

We had

a small module in our product,

that did

user friendly data mapping.

So it was basically a very simple ETL tool that is designed for

non programmers.

Back then, it was a novel idea to give

a GUI based tool to the people who are doing ETL.

It was mostly about schema mapping and transformations.

We presented it to 1 of our largest customers and they liked it. And, that's how our journey started.

And, of course, over the years, we got a lot of feedback from customers, market, kept adding features,

took the feedback of our customers very seriously,

and,

kept building upon it. And,

yeah, 12 years later, here we are. We have a full data stack, that is able to ingest data,

transform, of course, load into any

architectures of your choice, whether it is a data vault or,

star schema or

any of the, choice that you have for architecture for data warehousing,

and, then publish it to your your end users. So using,

no code,

low code APIs.

So we have at this point a full platform that is able to ingest data all the way to publish your data and everything in between.

And in the context of ETL, what are some of the different ways that you're seeing generative AI applied and some of the the types of impact that you would expect

for practitioners who

want to be able to just say, give me this data, bring it over there. I don't wanna have to care about the details. Excellent question. So this is something that we are seeing,

being asked of us, actually as a as an ETL vendor,

very frequently. It started about couple of years ago. The impact of AI

on the entire data space, I would say. Not only ETL,

overall data management. That started to happen about a couple of years ago. We also jumped in around the same time.

And,

we see various areas getting impacted,

by AI, specifically generative AI. So starting with, the data ingestion itself,

the e in ETL is extraction.

And, we see that extraction piece

is heavily impacted with the AI.

Wherever you have unstructured data, data that is text based,

data that has images,

all those areas are getting

helped quite a bit with the AI.

And,

extraction piece,

if it is structured data, not so much because the structured data AI still has inaccuracies.

But unstructured data is where this helping in a big way. Now coming to transformations,

there also we see that some teams, of data engineers

are using actually AI to generate code for transformations,

data quality as well.

And, schema mapping is another 1 where, AI is impacting.

And then, overall

automation.

In my opinion, this is the area where AI has made the the biggest impact, where you can look at repetitive tasks

in the entire solution design and development

and, use AI to automate it. So the usability has gone up. User experience

has changed significantly.

So the user interfaces to the,

what used to be GUI, now it is going 1 level up. We are seeing actually chatbot style

interfaces,

to various applications including some areas in ETL. For example, Dataprep. In fact, our own Dataprep,

functionality

now has an interface where, you know, you can speak, or you can just chat in plain English and give instructions about what you need to do with your data.

And, AI is able to generate the right,

script or right, metadata

for it and do the work for you. So this is how we are seeing,

AI impacting various areas. But to me, the 2 standouts would be usability

and helping repetitive task automation. So these 2 will stand out and, of course, other areas,

are also getting impacted by decision making

that, that is done by typical users.

AI is doing helping, actually, in parts to make the decision for you. And for

the integration of these

AI capabilities

in the ETL process,

how does that shift the

intended

user of that technology where in a straight ETL environment,

typically, you would see that be the responsibility of a data engineer. But as you were mentioning, there are also tools or scenarios where there are nontechnologists

who are domain experts or business experts who want to be able to do that work.

How does the application of AI shift that equation of who is responsible for actually doing this data integration work? Yeah.

That is also changing

rapidly, actually. That whole,

responsibilities

distribution,

is changing. We see that more and more business facing people. People

who have, no background in coding,

no technical background,

they are able to take, some responsibilities

off the shoulders of,

for lack lackability on data people. So people who are responsible for data, they are delegating some of the responsibilities. So the cross functional teams, the nature of that team is changing as well. And, in terms of,

the data engineers

and

the ETL developers, that's the original word, original term for the people who are actually developing the solution. The rule is shifting a little bit. And actually, it is shifting in the right direction.

In my opinion, they should not be tasked with doing,

the same task again over again for several I don't know. We have seen in implementations

for several weeks, several months, you are doing very similar tasks. So for example, if you're doing a data integration task and you have

dozens of tables on the source side and you're building a pipeline, you are building the similar pipelines for each of these entities

and it takes days, sometimes weeks. So this is the kind of task where you don't want to spend your data engineers' time. Rather, they should be focusing on the task that is really, interesting and it is adding more value. So the repetitive,

any any task that is being, that is repetitive,

that is, being automated. And, data engineers are able to focus on more interesting and more valuable tasks. So that's how we we see the the rule shifting.

And, the subject matter experts, they are also coming

into the picture. So they're working closely with the data engineers. So that's how we see the dynamics changing,

in the teams that are implementing, the data solutions.

You mentioned a little bit as to the

specific types of projects or specific types of data where generative AI is going to provide the most impact. But I'm wondering if we can dig a bit deeper into that where you were saying that for highly structured use cases, it's, you know, maybe an incremental win, but with unstructured data is where you're going to see the largest gains. I'm wondering if you can talk to some of the reasons that that is and some of the ways that teams should be thinking about their initial forays into applying AI to their ETL use cases.

Right.

So structured versus unstructured,

that debate has been going on for some time. And we see that, unstructured data, when you're extracting

data from that, it's mostly insight. You're looking at taking a portion of it and,

a little bit of approximation is okay. So So for example, if you have a document and you want to get a summary of that document, that summary doesn't have to be exact. Whereas, if you're looking at a table that where you have structured,

rows and columns, in aggregation, even a small

difference is not okay.

So

that's how I I look at it, and we see that,

AI by nature is going to be nondeterministic.

It has

sometimes seemingly errors.

But if it is 95% accurate, is that acceptable if you are dealing with a structured data? Whereas with unstructured data, 95% accurate is pretty good. So that's the key difference between, the structured versus, unstructured data. And,

unstructured data scenarios,

in fact, we saw that in recent past. All the rule based solutions,

they are being replaced completely. So rule based solutions used to be that, hey, I'm looking for in NL using maybe, NLP then in proximity of this keyword, look at these other keywords.

And if they're matching with the context, give me this in this information.

That's how it used to be. But now with the with the AI, specifically generative AI, you do the similarity search and you you create and you put in your vector d v and then you create, like, top 5

matching ones and send it to AI.

Let's say, for example, OpenAI,

GPT, and, the results are pretty good. So we did experiments and, the insights coming out of those calls are really good for unstructured data.

Now we did the same experiment with us, with structured data, sending a table and asking for certain calculations and all that, and it will have hallucinations. And those are kind of, indicators of,

with the structured data that approximation is not going to work.

So with the structured data, where we see AI helping

is,

where AI can be used to configure the existing ETL solutions

and, automate it. So instead of x number of hours, you're spending only half x numbers of hours,

or 1 1 quarter maybe to configure the solution and your savings are there. So AI on,

unstructured data is involved completely. The structured data, it is helping in configuring the solution and making the usability

or user experience go up. To your point of needing to

bring your data into the context of the vector DB for doing that similarity search, Not quite yet, but a little further along, I wanna talk through some of the architectural aspects of being able to integrate AI into the ETL process. But another aspect of that similarity search brings up the question of entity extraction and entity resolution. I'm curious what you're seeing as the impact of these

AI models for being able to simplify or accelerate the process of doing that,

entity resolution or master data management for the ETL end result?

Yes. The metadata. You bring up a very, important point.

That is something that,

we see where AI is helping in a big way. Not the data, but the metadata itself. Because metadata, again, is a decision making

process. So when someone is,

let's say, a data architect is looking at the data, doing the the initial stages of

data discovery, looking at what I'm dealing with. So a lot of, concepts of metadata are coming from from that stage where you're looking at your data and trying to figure out what you're dealing with. And a lot of decision making is involved there. So not only in MDM, we are seeing it in all different areas where you're looking at the data and, letting AI do the at least initial cut for you and then review it.

And if you like, you move forward with that. That is a model that we see being applied at design time, specifically applied to metadata. 1 example that comes to my mind is, data modeling. So related to MDM, of course, if you're looking at your your source data and you want to

create your entity,

the ER framework that is, you're looking at, building a model where you want to decide that what makes sense and,

what kind of, entities you want to create. And also moving forward,

let's say that you want to design a data warehouse.

In data warehouse, what could be a candidate

for

your facts, for your dimensions?

If you're dealing with a data vault, what could be your hubs and satellites and links? All those things basically take time. So practitioners, they spend quite a bit of time in making those determinations

and then, curating it. And that's how the design process works. And we are seeing that

AI applied can do the first cut for you in a matter of few seconds. And then the first cut, if it is 90% there, your work is reduced by 90%. So that's the gain that we are seeing with the AI in all the metadata based decisions. So wherever you're trying to handle,

metadata related decision making,

AI is is helping in a big way. And now as far as the architectural aspects and the workflow,

you mentioned needing to have some of these,

contextual cues to the language models. I'm wondering if you can just talk to some of the architectural and workflow aspects of being able to bring AI into the process of ETL

rather than ETL being used to feed the feed the AI. So the ETL workflows,

they don't change much. AI comes into

the ETL flow in different stages, at least for now. It's changing rapidly, and we'll we'll we'll come to that question a little later. But at least at this point, what we are seeing is that the ETL workflow

stays the same. And for each of the steps, we see AI being applied. So let's, have a look at, a typical ETL workflow. The first step is data extraction.

And in data extraction,

we do have,

traditional connectors that will go into your source, try to figure out the metadata. And if you have the metadata based on that, it it will do the parsing of your data, bring in the right data into your pipeline, and then then the mapping starts. So in the first step where you are trying to figure out the layout of your source data, when you're trying to read the data. So reading, of course, AI is not helping as much, but figuring out what is the structure of your data. That part is, metadata building, and we see AI being used. In fact, we have released a feature

on,

addressing exactly that point where it is automatically able to figure out what source you're dealing with, what is the structure of it, what are the columns and their data types and so on and so forth. It can handle all of that. So this is AI being applied to the reading part. Now if you have unstructured data, sometimes you are dealing with, for example, let's say,

we have tons of PDF documents contains unstructured data that has paragraphs and tables hidden inside it. And, you have a specific prompt based on which you want to get certain data from each of these documents. Now that becomes your source. So this is your ingestion. You can apply

maybe,

a pretrained,

model or let's say that, a fine tuned LLM.

And we have done experiments with those as well, and it results so beautiful. So you can apply those at the ingestion stage itself

to get you more quality data, more meaningful data. So that is the ingestion state.

Now we go to the next step that is,

mapping, data mapping. In data mapping, again, it's, it is a task that is meant for a combination of subject matter experts and people, of course, who are developing the solution. It takes a lot of time, a lot of,

trial and errors about, hey. This field, does it go here versus here? How do we combine it and so on and so forth? And there also, we are seeing,

AI helping in a big way. You can give

the

context to AI about this is the subject matter and, here is a list of,

my source fields. Here is a list of my destination

candidates

and let it figure out. And we have seen that results again are pretty good. It can do the work in a matter of a second a few seconds instead of,

going through a few iterations and coming up with this, this map, and then you're going to be verifying it,

with your teams and all that. The first cut again is done very quickly. Of course, the verification step is going to be there. But the mapping can be, done, let's say, 1 tenth of the time that was required earlier. Then also in transformations, we are seeing that sometimes there are tools, of course, like ours where the transformations are drag and drop.

And you can map and, you can get going. But there are some tools that use coding. And, in those tools, of course, AI can help you with generating the transformation code automatically. So that is the transformation side and loading as well.

If you're loading into, let's say, a data warehouse,

loading is not easy.

You have to write very complicated SQL code, with the inserts and updates and and so on and so forth. And,

some tools, they do have rule based solution that that can automatically generate the code for you. But if you don't, you can use AI to generate the code for you. So AI is being used as kind of a

assistant or or a helping hand in all the different stages at this point, keeping the same workflow for the ETL. Documentation is where, the last step I would say where, again, AI is helping,

where you can,

generate the documentation,

by giving it the context and giving your models and other pipelines. And it can describe your pipelines in a in a pretty,

I would say, reasonable way,

where you can generate doc document that is going to be describing your ETL pipeline. So that's how we see, AI,

getting plugged into different,

stages of ETL workflow at this point. Later,

we are seeing this area advance pretty quickly.

So,

we are expecting it to not only help in all different areas, but also the decision making that is, at this point, kinda limited to

localized area, it may kind of, grow from there

and make bigger decisions

such as, hey. Now I am looking at these 5 sources.

Can I automatically join these 5 and give you meaningful data in my destination?

So dynamic ETL, real time ETL, and all that, that's where the future is going. But then again, that is, we are not there yet, but we see that the AI can help in, those areas as well in future.

This episode is brought to you by DataFold, a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production.

DataFold leverages data diffing to compare production and development environments and column level lineage to show you the exact impact of every code change on data, metrics, and BI tools,

keeping your team productive and stakeholders happy.

DataFold integrates with DBT, the modern data stack, and seamlessly plugs in your DataCI for team wide and automated testing.

If you are migrating to a modern data stack, DataFold can also help you automate data and code validation to speed up the migration.

Learn more about DataFold by visiting data engineering podcast.com/datafold

today.

And as far as

the model selection,

that is an area that is constantly moving with the OpenAI GPT models

grabbing a lot of the headlines and attention, but the open source models are also very quickly

leapfrogging each other and catching up with the OpenAI models as well as some of the ones available from Anthropic, etcetera.

Has there been any feedback that you've seen as far as which models perform best for which use cases or for particular technology stacks, and I'm wondering what you see as some of the useful benchmarks or metrics for teams who are starting that evaluation process

of, I want to bring in AI, but I need to be aware of some of the platform risk of depending on this for all of my day to day operations.

So this area is changing, rapidly.

When OpenAI started out and we got access to open, OpenAI's APIs,

we did some experiments. And followed by that quickly, we started to see a lot of these open source,

LLMs coming out where, of course, the parameters are not as high as OpenAI, but they're reasonable

and you can have it a local copy of it. That was a huge deal because, with OpenAI,

even if you are getting the best results,

there are many issues where you have to,

send everything, to OpenAI APIs,

and the performance sometimes was not great.

Whereas, we started to play with Llama, Netformata,

and, then, of course,

at this point, we have 5 or 6,

different,

models

that we

offer in our toolset. And this is given to the users where they can experiment with any of them, and they can fine tune it with their with their data. And once it is fine tuned to their data, they can use any of them. So that's how we present it. We are agnostic to which 1 do you use. But now coming back to your question about which 1 works better for which kind of data. So 1 pattern that we notice is,

the performance, of course, if you're dealing with the unstructured data, chat gpt, that still stays at the top. The performance or the quality is about the best from there for unstructured data. We did 1 more experiment with,

creating natural language interface for our own,

expression language. So in our product, we have an expression language where you can write formulas to do calculations,

and it looks like, formulas that you see in Microsoft Excel. That kind of,

learning, not a whole lot, but still for some users,

it is a bit difficult that, hey. I have to write these formulas and expressions,

to do some calculations.

So we did experiment

over that, scenario where we gave a natural language interface to generate those formulas.

And there we are seeing that,

llama is a pretty good option. It does pretty well. Llama 2 now.

So, depending on the scenario,

it changes at which 1 is gonna be performing better. So what I would recommend is

that experiment with all top 4 or 5. It doesn't take much.

And there are tools like ours available now that are going to be that will give you a playground

where you can experiment with,

fine tuning the 3 or 4 models and see that which 1 works the best. Because it is not at least in our experience, it is not that winner takes all. In certain areas, we see that OpenAI is doing much better than others, but there are some areas where others are doing better than OpenAI. So I would suggest to experiment and see that which 1 is, the best fit

for your kind of data.

And the process

and, the tools of department,

they are they're getting there already. There are many solutions available out there that are going to let you experiment.

Now the other fun piece of

working with AI is that it is nondeterministic,

and so there is the potential for logical errors, logical bugs to come in, and you're not even necessarily going to get the exact same output for the exact same input, especially if you're dealing with successive generations of models.

I'm wondering what are some of the ways that teams need to be thinking about error handling, error identification,

validating the outputs of the AIs before they put it into production, etcetera?

Great question.

This is something that we have been dealing with even in our internal implementations

and, even our in our own own coding that we did for the product.

So,

of course, the stochastic nature of, the predictions

makes it suitable for certain things, but not suitable

for the other.

We

tend to recommend using

AI

where it is not going to be deterministic

only for the design time.

Don't let it be in the run time. That is actually 1 of the principles that we have agreed upon

that if it is at the run time, it is going to cause issues.

And,

the only way possibly to use it in at run time is to have a strong layer

of data validations

that will reject certain,

things done by AI if it is not meeting your standards.

And then it throws it back to you saying that, okay. Hey. Have a look at it. So that is the only way to use AI at run time.

Otherwise, at design time is where we see

a lot of value.

At design time, we see that,

where,

you are making decisions

and

even,

implementations.

The first cut that is created by AI

is way faster

than even an expert of, let's say, 20 years or 25 years, 30 years,

will will come up with. So we see a lot of gains on that front, And the benefit is,

the biggest, I would say benefit is that once the

recommendation is in front of you, you can review it and override it.

So that capability you have to have.

In your process, make sure that you have that built in,

capability to review the work that is done by AI and override it if need be.

So look at AI as your,

assistant who's helping you in doing some work.

And you're not going to trust it blindly. You're going to look at the work done by

AI, review it.

And then if it looks good, sure. No problem at all. It goes to the next step. But if it doesn't, then you have a way to fix it, and then it goes to the next step.

So mostly at design time, we are using it. And,

we do have, some places where we let AI,

handles certain pieces in the at the runtime.

But there, we make sure that you are using some kind of rule based checking

of the results.

So data validation,

the module, that we have that is,

that is a must that you have to apply

after the AI step. Any step that involves AI, after that, you have to have, data quality checks and data validation,

to make sure that,

at runtime, you have your eyes on AI.

That's how I put it. And you mentioned that you wisely don't incorporate

the AI into the actual runtime behavior, but just in the design and implementation

phase. And so for

ongoing maintenance purposes, what do you see as the ongoing role of the AI as you maintain and evolve the different pipelines or try to implement new pipelines that maybe feed off of some of the ones that are already implemented, things like that. Great. So on that front,

the rule is increasing. Actually, I'll take that analogy of assistant.

Your assistant is being trained.

They are doing

certain tasks now, but once they are trained, they can do bigger tasks.

So that rule is also evolving and,

we do see already

some, some tool sets or some some teams working on it. That is where we increase the responsibility of AI

to be

our assistant to do monitoring

of existing flows.

So monitoring is another part where we see a huge role that AI can play.

So here, it is not impacting your real data,

but it is helping you as a user

that that would be your responsibility

otherwise

to look into

your data pipelines, how they're behaving,

what kind of data you're getting, then any errors that you are receiving,

what is the frequency. So it is pretty good on that front. It can tell you that, hey, most likely this is what is going wrong

and you can go and fix it. So data anomalies inside your metadata at runtime, your information that is coming out from runtime.

If,

we can have an eye into that process, that is very useful. And AI can do that for us. So AI is being used

to

help us monitor the entire

workflow that we have deployed at runtime.

So that is 1 area.

And, also for the areas of,

design or implementation that that we're talking about earlier,

there,

the role is getting more and more advanced, more complex.

So we are seeing that if you have

designed certain flows in past

and you were seeing there are certain errors or certain

issues that you're seeing with those flows,

it can detect it. And, also, it can generate

a recommendation

about, hey. Instead of using this flow, how about you try this flow?

So that is,

again,

for design time, but a huge help

because that is kind of troubleshooting. In troubleshooting, it takes time,

and, AI can do that work for you. So in these these 2 areas, I see that, AI's role is gradually increasing.

As more people start using AI for projects, 2 things are clear. It's a rapidly advancing field, and it's tough to navigate.

How can you get the best results for your use case?

Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI powered apps.

Attend the dev and ML talks at NODES 2023,

a free online conference on October 26th featuring some of the brightest minds in tech.

Check out the agenda and register today at neo4j.com

/nodes.

That's

ne0,

the number 4,

j.com/nodes.

For

teams and individuals and organizations

who are considering this introduction of AI into their critical data flows, what are

the typical motivators that you see and the types of questions and concerns that they need to address before they can feel comfortable with actually putting the results into production?

Yeah. So on this front, we are seeing that,

our customers,

ask us this question,

frequently now that how should we approach it? How should we incorporate AI into our solution,

design and,

implementations?

And,

of course, they have,

some,

reservations about it as well that should we be using AI

or not.

So from our side, we always suggest that, take your time.

Come up with a strategy of how you want

to incorporate AI,

into your data solutions, your architecture, or overall,

the organization wide, how you want to,

to start using AI.

And, there are a few things that are important. The first 1 is that you have to identify

the areas or the scenarios in which AI is applicable.

Then there are the issues with the the tooling that you need to have the right tool set available because that can make or break. And then, of course, the training of your resources. That is also very important. They need to be trained properly

in using AI because, if it is not done properly, you're not gonna get any value out of it.

Now coming to,

the reservations or objections to AI and,

there are many reasons for that. The first 1 that comes up is

compliance

explainability.

The problem with the AI is that it's like a black box in many scenarios.

And there are many industries where you have to have complete visibility

into anything

that is happening with your data.

And those

if you're using AI in certain scenarios, it is not going to give you that information.

So you must identify the areas where you can use it without being able to explain what is happening inside this black box.

So it's a kind of a design problem where you look at your,

scenario

and figure out that where you can fit in this black box, and that you are going to be still okay from the compliance perspective,

from,

explainability

perspective.

So that is 1, 1 issue. And,

with the right design, of course, it can be addressed.

And, then, of course,

the second

part would be about

the data itself that you have.

Sometimes data is suitable. Sometimes, it is not suitable for AI.

For simple scenarios, AI is an overkill.

If you have,

a simple source database that you want to move it to your data warehouse, you don't need AI for that. You can simply plug in a standard ETL, and it's going to be much more cost effective.

And it is going to be able to bring your data from source to destination,

much quicker

compared to using AI.

So that's what we'll recommend that look at your scenario.

Do the evaluation about

how you want to use AI, if it is even applicable.

If it is, then where does it go inside your use case? And that's how we go about designing

a tailor made solution

for each of our customers.

In your experience

of working with generative

AI in the ATL context

and onboarding people into these workflows, what are the most interesting or innovative or unexpected ways that you have seen the AI

used in the context of ETL?

Yeah. Interesting, question. So this is, I would say all of it that we have seen or what what I have said so far.

When it started out pretty much every other week, we'll see something

that we have never thought about

being addressed by, generative AI.

So that was a phase for about,

6 months where

every,

couple of weeks, we see something that we never thought could be done, and it was there. So,

of course, it is still going on, but we see this pace kind of,

slowing down on on the new innovations, using generative AI in ETL.

But the most common ones, I would say, is

use of AI

to

get the insight from unstructured data.

And I'll give you 1 example for that. So we had

a use case where we had to get certain answers

from our documents as if you're asking a question that, okay. Hey.

Read this document and tell me

what is the answer to this question.

And, this is a very common pattern in,

data insight gathering.

And we had a solution that, of course, worked pretty reasonably.

But this scenario to implement this scenario, it took us about,

good 6 man months to write the right, solution, build it, test it, and all that.

And,

we started to experiment

with,

OpenAI's APIs. And, of course, we had to to experiment with the prompts and,

when solution when the data comes in,

we had we run into some issues on that front as well. But, eventually, once it was done,

the results were better than what we had earlier, and it was done within 1 week.

So that's the extent of,

savings we are talking about, 6 months versus 1 week.

And solution is even and and results are better. So

when it comes to unstructured data, text, images, and all that, it's beautiful. The solution,

based on generative AI, they are way better than what we had earlier.

And,

then, the next trend was

semantic

mapping. This is another 1 where we saw that

our users

struggled.

Back in the day, we had, basically, auto map

features where we will,

try to figure out,

that, column name in the source, what should it belong to in the

we used to call it smart mapping.

And, the smart mapping

was okay. It used to do, like, I would say, about 60, 70 percent accurate, but still there was a lot of errors.

Coming to picture,

AI again. And, we started to use the semantic mapping and give the context, and now the accuracy goes up to about 95%.

So we have seen kind of, almost like a,

magical results in certain areas.

1 more that,

that is very,

interesting experience, at least, personally for me, was

creating

the data models,

for

data warehousing.

Data warehousing, data modeling is not easy. For practitioners who have been in the field for decades, they still take time. Figuring out what should be,

a fact, what should be a dimension, how they should be connected, and what what kind of,

configurations you wanna do for facts and dimensions, and also for other architectures such as data port and all that.

So we did another experiment

where we took

our transactional database

and

created a data model out of it. So we have a reverse engineering functionality that can create the model automatically.

No problem on that front. Once we have the data model,

we let the AI decide for us that how should you convert this by denormalizing

and creating a star schema for you.

And, again, the results were amazing. It was almost, like, magical.

We, you know, we get the right prompts, we get the right information, and there comes a data model,

that looks like

a perfect star schema that would otherwise have taken several days of iterations and

back and forth with your subject matter experts and the data architect,

was done in a matter of few seconds again. So these are the few,

interesting,

usage or scenarios that I can think of, but there are many.

To summarize, I would say that, wherever,

you have patterns

and patterns have been applied in past,

It is known to the practitioners,

and it is repetitive.

Let AI do the decision making for you, and you will not be

disappointed. It is going to do

pretty good job. At least the first cut is going to be amazing.

And then, of course, you know, as we talked about earlier, you can take that and override it if need be. And in your experience

of working in this space of ETL for so long and the introduction of generative AI as as a solution in this process, what are the most interesting or unexpected or challenging lessons that you've learned?

Yes.

On,

early on, I would say, we

and actually, like like anyone else, we

we thought that it can do,

much more,

in terms of even, looking at the data

and

being more deterministic, I would say. The solutions were not that,

or the results were not that satisfying.

So

we

had to kind of,

take a step back and, redo the all of the all that work.

So it will sometimes make mistakes about simple things. Like, you know, I'm asking for what is the location of a specific field in my file and we'll make a mistake in that. That you can basically do with the naked eyes. You can see that, hey, it is in line number 2 and a 10th character. It can't even figure out that.

It will make mistakes. And if you run it twice, 3 times, 4 times, maybe once it'll give you the right results.

So anything where you're relying on

giving any deterministic

answers,

forget about it. So we decided that wherever,

precision is required, it is not a good fit,

because we're trying with pretty much everything that is out there. We are trying out,

even the the figuring out the locations.

So we have actually an unstructured data extraction,

module

that can,

based on templates, it can

extract data points from your document.

And,

we were trying to build a template using AI.

And for that, you need to get the locations of certain fields and patterns using which you can build, the template.

And it was not a good experience. So we would try to do that and,

every single time you run, you get a different result.

And the whole algorithm will be messed up. So

anything that is, indeterministic,

it's okay to use,

for those scenarios. But where do you expect,

precise results,

it's it's not gonna be a good fit.

So barring that, I think it has been, pretty useful.

But just keep in mind whenever you're designing something,

go make sure that,

you look at it carefully and see that are you expecting precise results? If yes, then be careful.

Otherwise, approximate results. If you're okay with 95 percent if you're okay with 90%,

you're good. But if you want 100%,

please do not use it.

You already mentioned this a little bit, but what are the cases where AI is the wrong choice for ETL applications

and you should just go with just just do it manually, write the code,

use the low code tool?

Yes.

So

apart from what I said already, that is for simple scenarios.

So if you have just a handful of sources,

structured data mostly,

and your destination is

also neatly defined,

ETL is still going to be more cost effective,

and it is going to be actually a better choice.

Whereas if you have,

more complex data, you have unstructured data, you have

scalability

issues, as in your scale can grow,

the volume of data can grow.

And then, your overall data

ecosystem or the entire,

data management,

platform that you're looking at, It has to be more,

able to handle more complexity in future.

Then

we would recommend that you use AI based solutions or start using

AI. But if it is a standard,

a simple,

data pipeline that is going to be building your data warehouse or a data integration use case, it is not gonna be cost effective.

Also, it depends on, other aspects of implementation that is,

how

well trained is your team, how big is your team,

kind of resources you have at your disposal.

If you have a small team,

it is, again, it is not, going to be,

that,

applicable to your scenario.

And, also,

it, depends on your strategy. That is 1 more thing,

that I would like to add.

We have seen some smaller teams, but their leadership, they want to bring in AI. If that is a scenario, definitely go for it. Even if it is an overkill in the beginning, but it's going to help you in future.

So in that scenario,

we do recommend that,

you start using it from the beginning itself so that when the time arrives, you're ready for it. And as you continue to invest in this area, work with customers,

work with your own technology stack, what do you see as the future applications for AI either in ETL specifically, but also in just the broad application of data engineering as a role?

Yeah. This is a, a question that,

that we discuss

routinely.

This is our design meetings. This is what we talk about, where the market is going, where what we should be doing.

And 1 topic that is particularly fascinating is to

let the AI

do the real time or dynamic ETL.

That is

the decision making that we do at design time,

can that happen in real time?

Of course, it comes with, it's all all, caveats and all that, but

it is,

most definitely going in the direction where

AI

can make some decisions about what data streams or what, datasets

can be merged and how they can be,

transformed.

So transformation

tools and all that, maybe it can use your existing tool.

But decision making about how to build

the pipeline, what pipeline makes sense,

that can be given in hands of AI. So kind of dynamic,

ETO pipelines.

This is where I think, the future is going where it can automatically generate those pipelines.

And, it is going to be more

declarative,

where

you can declare that,

instructions given to the AI

that,

hey. If you look at,

any new

datasets coming in for my customers,

They must you must apply this system of records

and put this into this destination. This is instruction given to AI. Now it has to do the design,

the internal tool sets. It may be using,

a low code a low code solution,

but it is the user of that,

that that tool set. And it can make the decision for you about how to build that pipeline automatically.

So this is, something that we we see that it is going to happen. And,

of course, so,

at we are still in early stages. We have done some experiments too in that on that front.

There are some,

obvious issues on that front. Again, we talked about how the decision making, if it goes wrong, what do you do about it? So there's still some questions we have to answer on that front,

but it is definitely going in the direction where,

AI can do a little bit more decision making at a higher level, and it can,

dynamically build the ATL flows for you.

Alright. Are there any other aspects of the applications for generative AI in the ETL workflow or just the overall

ecosystem

of data engineering and the ways that AI is going to impact it that we didn't discuss yet that you'd like to cover before we close out the show?

So 1 area that we see where AI is particularly good at, that is,

anomaly detection.

So and it has been,

not only the generative AI,

even,

the predecessors of AI,

generative AI that is. In past, I would say almost like a decade, they have been used

in certain scenarios like,

fraud detection. So if you give,

a time series data, it can figure out where something is wrong, and it can,

help you with that.

So that specific use case now is being u is applied to ETL as well. That is something that, I I see that it is being, more frequently used.

And, briefly, we touched upon,

in the context

of, the runtime

errors and monitoring of your ETL workflows.

So this is, of course, just 1 part of it, but we can use

AI to detect anything that is going on wrong. So here's my pattern. Here is how we use it. If you see something wrong, let us know.

So it can be your eyes into,

the runtime.

And,

also, of course, for the datasets itself.

So not only the, the runtime, you can have a site process

that can be looking at your incoming datasets.

And beforehand,

in profiling stage, if you see something

drastically wrong or or different from what you expect,

it,

you can,

kind of short circuit the entire pipeline

and handle it in different way. So anomaly detection in the data itself and the metadata, these are the 2 areas where,

I see that

AI can, be a huge help.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

This area is, changing rapidly and, a lot of tools at this point, they are early beta stage, sometimes alpha stage.

And, new tools are coming pretty much every every week we see something new being launched.

But 1 thing that I would love to see is

more

natural language based,

UX.

So

as of now,

the standard is, again, drag and drop UI. That is a golden standard where, for any ETL,

platform

or all data management tools, basically.

The graphical user interface based tools, the drag and drop is the standard.

I would like to see the natural language being implemented

where you should be able to to speak with your toolset, and it does the work for you.

So that is something that is,

of course, in, in some products, including our product, it is

starting out in certain areas,

But,

that can

be taken to the next level, and I'm expecting that in within the next few months, it should be there. That will be the bridge between the AI,

specifically generative AI, and ETL toolsets.

So that is going to be kind of connecting

the users with the tool sets in a much more meaningful way.

Alright. Well, thank you very much for taking the time today to join me and share the experiences you've had with bringing AI into the ETL workflow. It's definitely a very interesting topic area, definitely 1 that is constantly moving. So I appreciate you,

sharing your perspective on that and helping people get a leg up on that journey. So, thank you again, and I hope you enjoy the rest of your day. My pleasure. Thank you for having me.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links