Summary
Artificial intelligence applications require substantial high quality data, which is provided through ETL pipelines. Now that AI has reached the level of sophistication seen in the various generative models it is being used to build new ETL workflows. In this episode Jay Mishra shares his experiences and insights building ETL pipelines with the help of generative AI.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
- This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold
- You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
- As more people start using AI for projects, two things are clear: It’s a rapidly advancing field, but it’s tough to navigate. How can you get the best results for your use case? Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI-powered apps. . Attend the dev and ML talks at NODES 2023, a free online conference on October 26 featuring some of the brightest minds in tech. Check out the agenda and register at Neo4j.com/NODES.
- Your host is Tobias Macey and today I'm interviewing Jay Mishra about the applications for generative AI in the ETL process
Interview
- Introduction
- How did you get involved in the area of data management?
- What are the different aspects/types of ETL that you are seeing generative AI applied to?
- What kind of impact are you seeing in terms of time spent/quality of output/etc.?
- What kinds of projects are most likely to benefit from the application of generative AI?
- Can you describe what a typical workflow of using AI to build ETL workflows looks like?
- What are some of the types of errors that you are likely to experience from the AI?
- Once the pipeline is defined, what does the ongoing maintenance look like?
- Is the AI required to operate within the pipeline in perpetuity?
- For individuals/teams/organizations who are experimenting with AI in their data engineering workflows, what are the concerns/questions that they are trying to address?
- What are the most interesting, innovative, or unexpected ways that you have seen generative AI used in ETL workflows?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on ETL and generative AI?
- When is AI the wrong choice for ETL applications?
- What are your predictions for future applications of AI in ETL and other data engineering practices?
Contact Info
- @MishraJay on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Astera
- Data Vault
- Star Schema
- OpenAI
- GPT == Generative Pre-trained Transformer
- Entity Resolution
- LLAMA
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png) You shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. That is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing. Go to [materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access) today and get 2 weeks free!
- Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png) This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting [dataengineeringpodcast.com/datafold](https://www.dataengineeringpodcast.com/datafold) today!
- Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png) Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)
- Neo4J: ![NODES Conference Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/PKCipYsh.png) NODES 2023 is a free online conference focused on graph-driven innovations with content for all skill levels. Its 24 hours are packed with 90 interactive technical sessions from top developers and data scientists across the world covering a broad range of topics and use cases. The event tracks: - Intelligent Applications: APIs, Libraries, and Frameworks – Tools and best practices for creating graph-powered applications and APIs with any software stack and programming language, including Java, Python, and JavaScript - Machine Learning and AI – How graph technology provides context for your data and enhances the accuracy of your AI and ML projects (e.g.: graph neural networks, responsible AI) - Visualization: Tools, Techniques, and Best Practices – Techniques and tools for exploring hidden and unknown patterns in your data and presenting complex relationships (knowledge graphs, ethical data practices, and data representation) Don’t miss your chance to hear about the latest graph-powered implementations and best practices for free on October 26 at NODES 2023. Go to [Neo4j.com/NODES](https://Neo4j.com/NODES) today to see the full agenda and register!
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/runnerstack. You shouldn't have to throw away the database to build with fast changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades old batch computation model for an efficient incremental engine to get complex queries that are always up to date. With Materialise, you can. It's the only true SQL streaming database built from the ground up to meet the needs of modern data products.
Whether it's real time dashboarding and analytics, personalization and segmentation, or automation and alerting, Materialise gives you the ability to work with fresh, correct, and scalable results, all in a familiar SQL interface. Go to data engineering podcast.com/materialize today to get 2 weeks free. Your host is Tobias Macey. And today, I'm interviewing Jay Mishra about the applications for generative AI in the ETL process. So, Jay, can you start by introducing yourself?
[00:01:32] Unknown:
Absolutely. Thanks for having me, Tobias. This is Jay Mishra. I am the chief operating officer at Astera. I have been in this field for over, 2 decades, specifically with the data management over 13 years, and I've been party to a lot of implementations at fortune 500 companies, for the ETL implementations, for data warehousing implementations, for various other use cases under the umbrella of data management. I have participated, from beginning all the way to the end, including implementations using our toolset.
[00:02:04] Unknown:
And do you remember how you first got started working in data? Interesting story.
[00:02:09] Unknown:
We had a small module in our product, that did user friendly data mapping. So it was basically a very simple ETL tool that is designed for non programmers. Back then, it was a novel idea to give a GUI based tool to the people who are doing ETL. It was mostly about schema mapping and transformations. We presented it to 1 of our largest customers and they liked it. And, that's how our journey started. And, of course, over the years, we got a lot of feedback from customers, market, kept adding features, took the feedback of our customers very seriously, and, kept building upon it. And, yeah, 12 years later, here we are. We have a full data stack, that is able to ingest data, transform, of course, load into any architectures of your choice, whether it is a data vault or, star schema or any of the, choice that you have for architecture for data warehousing, and, then publish it to your your end users. So using, no code, low code APIs.
So we have at this point a full platform that is able to ingest data all the way to publish your data and everything in between.
[00:03:28] Unknown:
And in the context of ETL, what are some of the different ways that you're seeing generative AI applied and some of the the types of impact that you would expect for practitioners who want to be able to just say, give me this data, bring it over there. I don't wanna have to care about the details. Excellent question. So this is something that we are seeing,
[00:03:50] Unknown:
being asked of us, actually as a as an ETL vendor, very frequently. It started about couple of years ago. The impact of AI on the entire data space, I would say. Not only ETL, overall data management. That started to happen about a couple of years ago. We also jumped in around the same time. And, we see various areas getting impacted, by AI, specifically generative AI. So starting with, the data ingestion itself, the e in ETL is extraction. And, we see that extraction piece is heavily impacted with the AI. Wherever you have unstructured data, data that is text based, data that has images, all those areas are getting helped quite a bit with the AI.
And, extraction piece, if it is structured data, not so much because the structured data AI still has inaccuracies. But unstructured data is where this helping in a big way. Now coming to transformations, there also we see that some teams, of data engineers are using actually AI to generate code for transformations, data quality as well. And, schema mapping is another 1 where, AI is impacting. And then, overall automation. In my opinion, this is the area where AI has made the the biggest impact, where you can look at repetitive tasks in the entire solution design and development and, use AI to automate it. So the usability has gone up. User experience has changed significantly.
So the user interfaces to the, what used to be GUI, now it is going 1 level up. We are seeing actually chatbot style interfaces, to various applications including some areas in ETL. For example, Dataprep. In fact, our own Dataprep, functionality now has an interface where, you know, you can speak, or you can just chat in plain English and give instructions about what you need to do with your data. And, AI is able to generate the right, script or right, metadata for it and do the work for you. So this is how we are seeing, AI impacting various areas. But to me, the 2 standouts would be usability and helping repetitive task automation. So these 2 will stand out and, of course, other areas, are also getting impacted by decision making that, that is done by typical users.
AI is doing helping, actually, in parts to make the decision for you. And for
[00:06:30] Unknown:
the integration of these AI capabilities in the ETL process, how does that shift the intended user of that technology where in a straight ETL environment, typically, you would see that be the responsibility of a data engineer. But as you were mentioning, there are also tools or scenarios where there are nontechnologists who are domain experts or business experts who want to be able to do that work. How does the application of AI shift that equation of who is responsible for actually doing this data integration work? Yeah.
[00:07:07] Unknown:
That is also changing rapidly, actually. That whole, responsibilities distribution, is changing. We see that more and more business facing people. People who have, no background in coding, no technical background, they are able to take, some responsibilities off the shoulders of, for lack lackability on data people. So people who are responsible for data, they are delegating some of the responsibilities. So the cross functional teams, the nature of that team is changing as well. And, in terms of, the data engineers and the ETL developers, that's the original word, original term for the people who are actually developing the solution. The rule is shifting a little bit. And actually, it is shifting in the right direction.
In my opinion, they should not be tasked with doing, the same task again over again for several I don't know. We have seen in implementations for several weeks, several months, you are doing very similar tasks. So for example, if you're doing a data integration task and you have dozens of tables on the source side and you're building a pipeline, you are building the similar pipelines for each of these entities and it takes days, sometimes weeks. So this is the kind of task where you don't want to spend your data engineers' time. Rather, they should be focusing on the task that is really, interesting and it is adding more value. So the repetitive, any any task that is being, that is repetitive, that is, being automated. And, data engineers are able to focus on more interesting and more valuable tasks. So that's how we we see the the rule shifting.
And, the subject matter experts, they are also coming into the picture. So they're working closely with the data engineers. So that's how we see the dynamics changing, in the teams that are implementing, the data solutions.
[00:08:58] Unknown:
You mentioned a little bit as to the specific types of projects or specific types of data where generative AI is going to provide the most impact. But I'm wondering if we can dig a bit deeper into that where you were saying that for highly structured use cases, it's, you know, maybe an incremental win, but with unstructured data is where you're going to see the largest gains. I'm wondering if you can talk to some of the reasons that that is and some of the ways that teams should be thinking about their initial forays into applying AI to their ETL use cases.
[00:09:31] Unknown:
Right. So structured versus unstructured, that debate has been going on for some time. And we see that, unstructured data, when you're extracting data from that, it's mostly insight. You're looking at taking a portion of it and, a little bit of approximation is okay. So So for example, if you have a document and you want to get a summary of that document, that summary doesn't have to be exact. Whereas, if you're looking at a table that where you have structured, rows and columns, in aggregation, even a small difference is not okay.
So that's how I I look at it, and we see that, AI by nature is going to be nondeterministic. It has sometimes seemingly errors. But if it is 95% accurate, is that acceptable if you are dealing with a structured data? Whereas with unstructured data, 95% accurate is pretty good. So that's the key difference between, the structured versus, unstructured data. And, unstructured data scenarios, in fact, we saw that in recent past. All the rule based solutions, they are being replaced completely. So rule based solutions used to be that, hey, I'm looking for in NL using maybe, NLP then in proximity of this keyword, look at these other keywords. And if they're matching with the context, give me this in this information.
That's how it used to be. But now with the with the AI, specifically generative AI, you do the similarity search and you you create and you put in your vector d v and then you create, like, top 5 matching ones and send it to AI. Let's say, for example, OpenAI, GPT, and, the results are pretty good. So we did experiments and, the insights coming out of those calls are really good for unstructured data. Now we did the same experiment with us, with structured data, sending a table and asking for certain calculations and all that, and it will have hallucinations. And those are kind of, indicators of, with the structured data that approximation is not going to work.
So with the structured data, where we see AI helping is, where AI can be used to configure the existing ETL solutions and, automate it. So instead of x number of hours, you're spending only half x numbers of hours, or 1 1 quarter maybe to configure the solution and your savings are there. So AI on, unstructured data is involved completely. The structured data, it is helping in configuring the solution and making the usability
[00:12:06] Unknown:
or user experience go up. To your point of needing to bring your data into the context of the vector DB for doing that similarity search, Not quite yet, but a little further along, I wanna talk through some of the architectural aspects of being able to integrate AI into the ETL process. But another aspect of that similarity search brings up the question of entity extraction and entity resolution. I'm curious what you're seeing as the impact of these AI models for being able to simplify or accelerate the process of doing that, entity resolution or master data management for the ETL end result?
[00:12:45] Unknown:
Yes. The metadata. You bring up a very, important point. That is something that, we see where AI is helping in a big way. Not the data, but the metadata itself. Because metadata, again, is a decision making process. So when someone is, let's say, a data architect is looking at the data, doing the the initial stages of data discovery, looking at what I'm dealing with. So a lot of, concepts of metadata are coming from from that stage where you're looking at your data and trying to figure out what you're dealing with. And a lot of decision making is involved there. So not only in MDM, we are seeing it in all different areas where you're looking at the data and, letting AI do the at least initial cut for you and then review it. And if you like, you move forward with that. That is a model that we see being applied at design time, specifically applied to metadata. 1 example that comes to my mind is, data modeling. So related to MDM, of course, if you're looking at your your source data and you want to create your entity, the ER framework that is, you're looking at, building a model where you want to decide that what makes sense and, what kind of, entities you want to create. And also moving forward, let's say that you want to design a data warehouse.
In data warehouse, what could be a candidate for your facts, for your dimensions? If you're dealing with a data vault, what could be your hubs and satellites and links? All those things basically take time. So practitioners, they spend quite a bit of time in making those determinations and then, curating it. And that's how the design process works. And we are seeing that AI applied can do the first cut for you in a matter of few seconds. And then the first cut, if it is 90% there, your work is reduced by 90%. So that's the gain that we are seeing with the AI in all the metadata based decisions. So wherever you're trying to handle, metadata related decision making,
[00:14:45] Unknown:
AI is is helping in a big way. And now as far as the architectural aspects and the workflow, you mentioned needing to have some of these, contextual cues to the language models. I'm wondering if you can just talk to some of the architectural and workflow aspects of being able to bring AI into the process of ETL rather than ETL being used to feed the feed the AI. So the ETL workflows,
[00:15:12] Unknown:
they don't change much. AI comes into the ETL flow in different stages, at least for now. It's changing rapidly, and we'll we'll we'll come to that question a little later. But at least at this point, what we are seeing is that the ETL workflow stays the same. And for each of the steps, we see AI being applied. So let's, have a look at, a typical ETL workflow. The first step is data extraction. And in data extraction, we do have, traditional connectors that will go into your source, try to figure out the metadata. And if you have the metadata based on that, it it will do the parsing of your data, bring in the right data into your pipeline, and then then the mapping starts. So in the first step where you are trying to figure out the layout of your source data, when you're trying to read the data. So reading, of course, AI is not helping as much, but figuring out what is the structure of your data. That part is, metadata building, and we see AI being used. In fact, we have released a feature on, addressing exactly that point where it is automatically able to figure out what source you're dealing with, what is the structure of it, what are the columns and their data types and so on and so forth. It can handle all of that. So this is AI being applied to the reading part. Now if you have unstructured data, sometimes you are dealing with, for example, let's say, we have tons of PDF documents contains unstructured data that has paragraphs and tables hidden inside it. And, you have a specific prompt based on which you want to get certain data from each of these documents. Now that becomes your source. So this is your ingestion. You can apply maybe, a pretrained, model or let's say that, a fine tuned LLM.
And we have done experiments with those as well, and it results so beautiful. So you can apply those at the ingestion stage itself to get you more quality data, more meaningful data. So that is the ingestion state. Now we go to the next step that is, mapping, data mapping. In data mapping, again, it's, it is a task that is meant for a combination of subject matter experts and people, of course, who are developing the solution. It takes a lot of time, a lot of, trial and errors about, hey. This field, does it go here versus here? How do we combine it and so on and so forth? And there also, we are seeing, AI helping in a big way. You can give the context to AI about this is the subject matter and, here is a list of, my source fields. Here is a list of my destination candidates and let it figure out. And we have seen that results again are pretty good. It can do the work in a matter of a second a few seconds instead of, going through a few iterations and coming up with this, this map, and then you're going to be verifying it, with your teams and all that. The first cut again is done very quickly. Of course, the verification step is going to be there. But the mapping can be, done, let's say, 1 tenth of the time that was required earlier. Then also in transformations, we are seeing that sometimes there are tools, of course, like ours where the transformations are drag and drop.
And you can map and, you can get going. But there are some tools that use coding. And, in those tools, of course, AI can help you with generating the transformation code automatically. So that is the transformation side and loading as well. If you're loading into, let's say, a data warehouse, loading is not easy. You have to write very complicated SQL code, with the inserts and updates and and so on and so forth. And, some tools, they do have rule based solution that that can automatically generate the code for you. But if you don't, you can use AI to generate the code for you. So AI is being used as kind of a assistant or or a helping hand in all the different stages at this point, keeping the same workflow for the ETL. Documentation is where, the last step I would say where, again, AI is helping, where you can, generate the documentation, by giving it the context and giving your models and other pipelines. And it can describe your pipelines in a in a pretty, I would say, reasonable way, where you can generate doc document that is going to be describing your ETL pipeline. So that's how we see, AI, getting plugged into different, stages of ETL workflow at this point. Later, we are seeing this area advance pretty quickly.
So, we are expecting it to not only help in all different areas, but also the decision making that is, at this point, kinda limited to localized area, it may kind of, grow from there and make bigger decisions such as, hey. Now I am looking at these 5 sources. Can I automatically join these 5 and give you meaningful data in my destination? So dynamic ETL, real time ETL, and all that, that's where the future is going. But then again, that is, we are not there yet, but we see that the AI can help in, those areas as well in future.
[00:20:10] Unknown:
This episode is brought to you by DataFold, a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. DataFold leverages data diffing to compare production and development environments and column level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. DataFold integrates with DBT, the modern data stack, and seamlessly plugs in your DataCI for team wide and automated testing. If you are migrating to a modern data stack, DataFold can also help you automate data and code validation to speed up the migration.
Learn more about DataFold by visiting data engineering podcast.com/datafold today. And as far as the model selection, that is an area that is constantly moving with the OpenAI GPT models grabbing a lot of the headlines and attention, but the open source models are also very quickly leapfrogging each other and catching up with the OpenAI models as well as some of the ones available from Anthropic, etcetera. Has there been any feedback that you've seen as far as which models perform best for which use cases or for particular technology stacks, and I'm wondering what you see as some of the useful benchmarks or metrics for teams who are starting that evaluation process of, I want to bring in AI, but I need to be aware of some of the platform risk of depending on this for all of my day to day operations.
[00:21:39] Unknown:
So this area is changing, rapidly. When OpenAI started out and we got access to open, OpenAI's APIs, we did some experiments. And followed by that quickly, we started to see a lot of these open source, LLMs coming out where, of course, the parameters are not as high as OpenAI, but they're reasonable and you can have it a local copy of it. That was a huge deal because, with OpenAI, even if you are getting the best results, there are many issues where you have to, send everything, to OpenAI APIs, and the performance sometimes was not great. Whereas, we started to play with Llama, Netformata, and, then, of course, at this point, we have 5 or 6, different, models that we offer in our toolset. And this is given to the users where they can experiment with any of them, and they can fine tune it with their with their data. And once it is fine tuned to their data, they can use any of them. So that's how we present it. We are agnostic to which 1 do you use. But now coming back to your question about which 1 works better for which kind of data. So 1 pattern that we notice is, the performance, of course, if you're dealing with the unstructured data, chat gpt, that still stays at the top. The performance or the quality is about the best from there for unstructured data. We did 1 more experiment with, creating natural language interface for our own, expression language. So in our product, we have an expression language where you can write formulas to do calculations, and it looks like, formulas that you see in Microsoft Excel. That kind of, learning, not a whole lot, but still for some users, it is a bit difficult that, hey. I have to write these formulas and expressions, to do some calculations.
So we did experiment over that, scenario where we gave a natural language interface to generate those formulas. And there we are seeing that, llama is a pretty good option. It does pretty well. Llama 2 now. So, depending on the scenario, it changes at which 1 is gonna be performing better. So what I would recommend is that experiment with all top 4 or 5. It doesn't take much. And there are tools like ours available now that are going to be that will give you a playground where you can experiment with, fine tuning the 3 or 4 models and see that which 1 works the best. Because it is not at least in our experience, it is not that winner takes all. In certain areas, we see that OpenAI is doing much better than others, but there are some areas where others are doing better than OpenAI. So I would suggest to experiment and see that which 1 is, the best fit for your kind of data.
And the process and, the tools of department, they are they're getting there already. There are many solutions available out there that are going to let you experiment.
[00:24:36] Unknown:
Now the other fun piece of working with AI is that it is nondeterministic, and so there is the potential for logical errors, logical bugs to come in, and you're not even necessarily going to get the exact same output for the exact same input, especially if you're dealing with successive generations of models. I'm wondering what are some of the ways that teams need to be thinking about error handling, error identification, validating the outputs of the AIs before they put it into production, etcetera?
[00:25:06] Unknown:
Great question. This is something that we have been dealing with even in our internal implementations and, even our in our own own coding that we did for the product. So, of course, the stochastic nature of, the predictions makes it suitable for certain things, but not suitable for the other. We tend to recommend using AI where it is not going to be deterministic only for the design time. Don't let it be in the run time. That is actually 1 of the principles that we have agreed upon that if it is at the run time, it is going to cause issues. And, the only way possibly to use it in at run time is to have a strong layer of data validations that will reject certain, things done by AI if it is not meeting your standards.
And then it throws it back to you saying that, okay. Hey. Have a look at it. So that is the only way to use AI at run time. Otherwise, at design time is where we see a lot of value. At design time, we see that, where, you are making decisions and even, implementations. The first cut that is created by AI is way faster than even an expert of, let's say, 20 years or 25 years, 30 years, will will come up with. So we see a lot of gains on that front, And the benefit is, the biggest, I would say benefit is that once the recommendation is in front of you, you can review it and override it.
So that capability you have to have. In your process, make sure that you have that built in, capability to review the work that is done by AI and override it if need be. So look at AI as your, assistant who's helping you in doing some work. And you're not going to trust it blindly. You're going to look at the work done by AI, review it. And then if it looks good, sure. No problem at all. It goes to the next step. But if it doesn't, then you have a way to fix it, and then it goes to the next step. So mostly at design time, we are using it. And, we do have, some places where we let AI, handles certain pieces in the at the runtime.
But there, we make sure that you are using some kind of rule based checking of the results. So data validation, the module, that we have that is, that is a must that you have to apply after the AI step. Any step that involves AI, after that, you have to have, data quality checks and data validation, to make sure that, at runtime, you have your eyes on AI.
[00:27:57] Unknown:
That's how I put it. And you mentioned that you wisely don't incorporate the AI into the actual runtime behavior, but just in the design and implementation phase. And so for ongoing maintenance purposes, what do you see as the ongoing role of the AI as you maintain and evolve the different pipelines or try to implement new pipelines that maybe feed off of some of the ones that are already implemented, things like that. Great. So on that front,
[00:28:28] Unknown:
the rule is increasing. Actually, I'll take that analogy of assistant. Your assistant is being trained. They are doing certain tasks now, but once they are trained, they can do bigger tasks. So that rule is also evolving and, we do see already some, some tool sets or some some teams working on it. That is where we increase the responsibility of AI to be our assistant to do monitoring of existing flows. So monitoring is another part where we see a huge role that AI can play. So here, it is not impacting your real data, but it is helping you as a user that that would be your responsibility otherwise to look into your data pipelines, how they're behaving, what kind of data you're getting, then any errors that you are receiving, what is the frequency. So it is pretty good on that front. It can tell you that, hey, most likely this is what is going wrong and you can go and fix it. So data anomalies inside your metadata at runtime, your information that is coming out from runtime.
If, we can have an eye into that process, that is very useful. And AI can do that for us. So AI is being used to help us monitor the entire workflow that we have deployed at runtime. So that is 1 area. And, also for the areas of, design or implementation that that we're talking about earlier, there, the role is getting more and more advanced, more complex. So we are seeing that if you have designed certain flows in past and you were seeing there are certain errors or certain issues that you're seeing with those flows, it can detect it. And, also, it can generate a recommendation about, hey. Instead of using this flow, how about you try this flow?
So that is, again, for design time, but a huge help because that is kind of troubleshooting. In troubleshooting, it takes time, and, AI can do that work for you. So in these these 2 areas, I see that, AI's role is gradually increasing.
[00:30:40] Unknown:
As more people start using AI for projects, 2 things are clear. It's a rapidly advancing field, and it's tough to navigate. How can you get the best results for your use case? Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI powered apps. Attend the dev and ML talks at NODES 2023, a free online conference on October 26th featuring some of the brightest minds in tech. Check out the agenda and register today at neo4j.com /nodes.
That's ne0, the number 4, j.com/nodes. For teams and individuals and organizations who are considering this introduction of AI into their critical data flows, what are the typical motivators that you see and the types of questions and concerns that they need to address before they can feel comfortable with actually putting the results into production?
[00:31:44] Unknown:
Yeah. So on this front, we are seeing that, our customers, ask us this question, frequently now that how should we approach it? How should we incorporate AI into our solution, design and, implementations? And, of course, they have, some, reservations about it as well that should we be using AI or not. So from our side, we always suggest that, take your time. Come up with a strategy of how you want to incorporate AI, into your data solutions, your architecture, or overall, the organization wide, how you want to, to start using AI.
And, there are a few things that are important. The first 1 is that you have to identify the areas or the scenarios in which AI is applicable. Then there are the issues with the the tooling that you need to have the right tool set available because that can make or break. And then, of course, the training of your resources. That is also very important. They need to be trained properly in using AI because, if it is not done properly, you're not gonna get any value out of it. Now coming to, the reservations or objections to AI and, there are many reasons for that. The first 1 that comes up is compliance explainability.
The problem with the AI is that it's like a black box in many scenarios. And there are many industries where you have to have complete visibility into anything that is happening with your data. And those if you're using AI in certain scenarios, it is not going to give you that information. So you must identify the areas where you can use it without being able to explain what is happening inside this black box. So it's a kind of a design problem where you look at your, scenario and figure out that where you can fit in this black box, and that you are going to be still okay from the compliance perspective, from, explainability perspective.
So that is 1, 1 issue. And, with the right design, of course, it can be addressed. And, then, of course, the second part would be about the data itself that you have. Sometimes data is suitable. Sometimes, it is not suitable for AI. For simple scenarios, AI is an overkill. If you have, a simple source database that you want to move it to your data warehouse, you don't need AI for that. You can simply plug in a standard ETL, and it's going to be much more cost effective. And it is going to be able to bring your data from source to destination, much quicker compared to using AI.
So that's what we'll recommend that look at your scenario. Do the evaluation about how you want to use AI, if it is even applicable. If it is, then where does it go inside your use case? And that's how we go about designing a tailor made solution for each of our customers.
[00:34:45] Unknown:
In your experience of working with generative AI in the ATL context and onboarding people into these workflows, what are the most interesting or innovative or unexpected ways that you have seen the AI used in the context of ETL?
[00:35:01] Unknown:
Yeah. Interesting, question. So this is, I would say all of it that we have seen or what what I have said so far. When it started out pretty much every other week, we'll see something that we have never thought about being addressed by, generative AI. So that was a phase for about, 6 months where every, couple of weeks, we see something that we never thought could be done, and it was there. So, of course, it is still going on, but we see this pace kind of, slowing down on on the new innovations, using generative AI in ETL. But the most common ones, I would say, is use of AI to get the insight from unstructured data.
And I'll give you 1 example for that. So we had a use case where we had to get certain answers from our documents as if you're asking a question that, okay. Hey. Read this document and tell me what is the answer to this question. And, this is a very common pattern in, data insight gathering. And we had a solution that, of course, worked pretty reasonably. But this scenario to implement this scenario, it took us about, good 6 man months to write the right, solution, build it, test it, and all that. And, we started to experiment with, OpenAI's APIs. And, of course, we had to to experiment with the prompts and, when solution when the data comes in, we had we run into some issues on that front as well. But, eventually, once it was done, the results were better than what we had earlier, and it was done within 1 week.
So that's the extent of, savings we are talking about, 6 months versus 1 week. And solution is even and and results are better. So when it comes to unstructured data, text, images, and all that, it's beautiful. The solution, based on generative AI, they are way better than what we had earlier. And, then, the next trend was semantic mapping. This is another 1 where we saw that our users struggled. Back in the day, we had, basically, auto map features where we will, try to figure out, that, column name in the source, what should it belong to in the we used to call it smart mapping.
And, the smart mapping was okay. It used to do, like, I would say, about 60, 70 percent accurate, but still there was a lot of errors. Coming to picture, AI again. And, we started to use the semantic mapping and give the context, and now the accuracy goes up to about 95%. So we have seen kind of, almost like a, magical results in certain areas. 1 more that, that is very, interesting experience, at least, personally for me, was creating the data models, for data warehousing. Data warehousing, data modeling is not easy. For practitioners who have been in the field for decades, they still take time. Figuring out what should be, a fact, what should be a dimension, how they should be connected, and what what kind of, configurations you wanna do for facts and dimensions, and also for other architectures such as data port and all that.
So we did another experiment where we took our transactional database and created a data model out of it. So we have a reverse engineering functionality that can create the model automatically. No problem on that front. Once we have the data model, we let the AI decide for us that how should you convert this by denormalizing and creating a star schema for you. And, again, the results were amazing. It was almost, like, magical. We, you know, we get the right prompts, we get the right information, and there comes a data model, that looks like a perfect star schema that would otherwise have taken several days of iterations and back and forth with your subject matter experts and the data architect, was done in a matter of few seconds again. So these are the few, interesting, usage or scenarios that I can think of, but there are many.
To summarize, I would say that, wherever, you have patterns and patterns have been applied in past, It is known to the practitioners, and it is repetitive. Let AI do the decision making for you, and you will not be disappointed. It is going to do pretty good job. At least the first cut is going to be amazing. And then, of course, you know, as we talked about earlier, you can take that and override it if need be. And in your experience
[00:40:07] Unknown:
of working in this space of ETL for so long and the introduction of generative AI as as a solution in this process, what are the most interesting or unexpected or challenging lessons that you've learned?
[00:40:19] Unknown:
Yes. On, early on, I would say, we and actually, like like anyone else, we we thought that it can do, much more, in terms of even, looking at the data and being more deterministic, I would say. The solutions were not that, or the results were not that satisfying. So we had to kind of, take a step back and, redo the all of the all that work. So it will sometimes make mistakes about simple things. Like, you know, I'm asking for what is the location of a specific field in my file and we'll make a mistake in that. That you can basically do with the naked eyes. You can see that, hey, it is in line number 2 and a 10th character. It can't even figure out that. It will make mistakes. And if you run it twice, 3 times, 4 times, maybe once it'll give you the right results.
So anything where you're relying on giving any deterministic answers, forget about it. So we decided that wherever, precision is required, it is not a good fit, because we're trying with pretty much everything that is out there. We are trying out, even the the figuring out the locations. So we have actually an unstructured data extraction, module that can, based on templates, it can extract data points from your document. And, we were trying to build a template using AI. And for that, you need to get the locations of certain fields and patterns using which you can build, the template. And it was not a good experience. So we would try to do that and, every single time you run, you get a different result.
And the whole algorithm will be messed up. So anything that is, indeterministic, it's okay to use, for those scenarios. But where do you expect, precise results, it's it's not gonna be a good fit. So barring that, I think it has been, pretty useful. But just keep in mind whenever you're designing something, go make sure that, you look at it carefully and see that are you expecting precise results? If yes, then be careful. Otherwise, approximate results. If you're okay with 95 percent if you're okay with 90%, you're good. But if you want 100%, please do not use it.
[00:42:49] Unknown:
You already mentioned this a little bit, but what are the cases where AI is the wrong choice for ETL applications and you should just go with just just do it manually, write the code, use the low code tool?
[00:43:02] Unknown:
Yes. So apart from what I said already, that is for simple scenarios. So if you have just a handful of sources, structured data mostly, and your destination is also neatly defined, ETL is still going to be more cost effective, and it is going to be actually a better choice. Whereas if you have, more complex data, you have unstructured data, you have scalability issues, as in your scale can grow, the volume of data can grow. And then, your overall data ecosystem or the entire, data management, platform that you're looking at, It has to be more, able to handle more complexity in future.
Then we would recommend that you use AI based solutions or start using AI. But if it is a standard, a simple, data pipeline that is going to be building your data warehouse or a data integration use case, it is not gonna be cost effective. Also, it depends on, other aspects of implementation that is, how well trained is your team, how big is your team, kind of resources you have at your disposal. If you have a small team, it is, again, it is not, going to be, that, applicable to your scenario. And, also, it, depends on your strategy. That is 1 more thing, that I would like to add.
We have seen some smaller teams, but their leadership, they want to bring in AI. If that is a scenario, definitely go for it. Even if it is an overkill in the beginning, but it's going to help you in future. So in that scenario, we do recommend that, you start using it from the beginning itself so that when the time arrives, you're ready for it. And as you continue to invest in this area, work with customers,
[00:45:00] Unknown:
work with your own technology stack, what do you see as the future applications for AI either in ETL specifically, but also in just the broad application of data engineering as a role?
[00:45:14] Unknown:
Yeah. This is a, a question that, that we discuss routinely. This is our design meetings. This is what we talk about, where the market is going, where what we should be doing. And 1 topic that is particularly fascinating is to let the AI do the real time or dynamic ETL. That is the decision making that we do at design time, can that happen in real time? Of course, it comes with, it's all all, caveats and all that, but it is, most definitely going in the direction where AI can make some decisions about what data streams or what, datasets can be merged and how they can be, transformed.
So transformation tools and all that, maybe it can use your existing tool. But decision making about how to build the pipeline, what pipeline makes sense, that can be given in hands of AI. So kind of dynamic, ETO pipelines. This is where I think, the future is going where it can automatically generate those pipelines. And, it is going to be more declarative, where you can declare that, instructions given to the AI that, hey. If you look at, any new datasets coming in for my customers, They must you must apply this system of records and put this into this destination. This is instruction given to AI. Now it has to do the design, the internal tool sets. It may be using, a low code a low code solution, but it is the user of that, that that tool set. And it can make the decision for you about how to build that pipeline automatically.
So this is, something that we we see that it is going to happen. And, of course, so, at we are still in early stages. We have done some experiments too in that on that front. There are some, obvious issues on that front. Again, we talked about how the decision making, if it goes wrong, what do you do about it? So there's still some questions we have to answer on that front, but it is definitely going in the direction where, AI can do a little bit more decision making at a higher level, and it can, dynamically build the ATL flows for you.
[00:47:36] Unknown:
Alright. Are there any other aspects of the applications for generative AI in the ETL workflow or just the overall ecosystem of data engineering and the ways that AI is going to impact it that we didn't discuss yet that you'd like to cover before we close out the show?
[00:47:55] Unknown:
So 1 area that we see where AI is particularly good at, that is, anomaly detection. So and it has been, not only the generative AI, even, the predecessors of AI, generative AI that is. In past, I would say almost like a decade, they have been used in certain scenarios like, fraud detection. So if you give, a time series data, it can figure out where something is wrong, and it can, help you with that. So that specific use case now is being u is applied to ETL as well. That is something that, I I see that it is being, more frequently used. And, briefly, we touched upon, in the context of, the runtime errors and monitoring of your ETL workflows.
So this is, of course, just 1 part of it, but we can use AI to detect anything that is going on wrong. So here's my pattern. Here is how we use it. If you see something wrong, let us know. So it can be your eyes into, the runtime. And, also, of course, for the datasets itself. So not only the, the runtime, you can have a site process that can be looking at your incoming datasets. And beforehand, in profiling stage, if you see something drastically wrong or or different from what you expect, it, you can, kind of short circuit the entire pipeline and handle it in different way. So anomaly detection in the data itself and the metadata, these are the 2 areas where, I see that AI can, be a huge help.
[00:49:41] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:49:57] Unknown:
This area is, changing rapidly and, a lot of tools at this point, they are early beta stage, sometimes alpha stage. And, new tools are coming pretty much every every week we see something new being launched. But 1 thing that I would love to see is more natural language based, UX. So as of now, the standard is, again, drag and drop UI. That is a golden standard where, for any ETL, platform or all data management tools, basically. The graphical user interface based tools, the drag and drop is the standard. I would like to see the natural language being implemented where you should be able to to speak with your toolset, and it does the work for you.
So that is something that is, of course, in, in some products, including our product, it is starting out in certain areas, But, that can be taken to the next level, and I'm expecting that in within the next few months, it should be there. That will be the bridge between the AI, specifically generative AI, and ETL toolsets. So that is going to be kind of connecting the users with the tool sets in a much more meaningful way.
[00:51:13] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the experiences you've had with bringing AI into the ETL workflow. It's definitely a very interesting topic area, definitely 1 that is constantly moving. So I appreciate you, sharing your perspective on that and helping people get a leg up on that journey. So, thank you again, and I hope you enjoy the rest of your day. My pleasure. Thank you for having me.
Introduction and Overview
Guest Introduction: Jay Mishra
Generative AI in ETL: Applications and Impact
Shifting Responsibilities with AI
Structured vs. Unstructured Data
Architectural Aspects of Integrating AI
Model Selection and Evaluation
Error Handling and Validation
Ongoing Maintenance and AI's Role
Motivations and Concerns for AI Adoption
Innovative Uses of AI in ETL
Lessons Learned from AI in ETL
When AI is the Wrong Choice
Future Applications of AI in Data Engineering
Anomaly Detection and AI
Biggest Gaps in Data Management Tooling
Closing Remarks