Summary
Gleb Mezhanskiy, CEO and co-founder of DataFold, joins Tobias Macey to discuss the challenges and innovations in data migrations. Gleb shares his experiences building and scaling data platforms at companies like Autodesk and Lyft, and how these experiences inspired the creation of DataFold to address data quality issues across teams. He outlines the complexities of data migrations, including common pitfalls such as technical debt and the importance of achieving parity between old and new systems. Gleb also discusses DataFold's innovative use of AI and large language models (LLMs) to automate translation and reconciliation processes in data migrations, reducing time and effort required for migrations.
Announcements
Parting Question
Gleb Mezhanskiy, CEO and co-founder of DataFold, joins Tobias Macey to discuss the challenges and innovations in data migrations. Gleb shares his experiences building and scaling data platforms at companies like Autodesk and Lyft, and how these experiences inspired the creation of DataFold to address data quality issues across teams. He outlines the complexities of data migrations, including common pitfalls such as technical debt and the importance of achieving parity between old and new systems. Gleb also discusses DataFold's innovative use of AI and large language models (LLMs) to automate translation and reconciliation processes in data migrations, reducing time and effort required for migrations.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Imagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!
- Your host is Tobias Macey and today I'm welcoming back Gleb Mezhanskiy to talk about Datafold's experience bringing AI to bear on the problem of migrating your data stack
- Introduction
- How did you get involved in the area of data management?
- Can you describe what the Data Migration Agent is and the story behind it?
- What is the core problem that you are targeting with the agent?
- What are the biggest time sinks in the process of database and tooling migration that teams run into?
- Can you describe the architecture of your agent?
- What was your selection and evaluation process for the LLM that you are using?
- What were some of the main unknowns that you had to discover going into the project?
- What are some of the evolutions in the ecosystem that occurred either during the development process or since your initial launch that have caused you to second-guess elements of the design?
- In terms of SQL translation there are libraries such as SQLGlot and the work being done with SDF that aim to address that through AST parsing and subsequent dialect generation. What are the ways that approach is insufficient in the context of a platform migration?
- How does the approach you are taking with the combination of data-diffing and automated translation help build confidence in the migration target?
- What are the most interesting, innovative, or unexpected ways that you have seen the Data Migration Agent used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on building an AI powered migration assistant?
- When is the data migration agent the wrong choice?
- What do you have planned for the future of applications of AI at Datafold?
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
- Datafold
- Datafold Migration Agent
- Datafold data-diff
- Datafold Reconciliation Podcast Episode
- SQLGlot
- Lark parser
- Claude 3.5 Sonnet
- Looker
[00:00:11]
Tobias Macey:
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Imagine catching data issues before they snowball into bigger problems. That's what DataFold's new monitors do. With automatic monitoring for cross database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time right at the source. Whether it's maintaining data integrity or preventing costly mistakes, DataFold monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production?
Learn more at dataengineeringpodcast.com/datafold today. Your host is Tobias Macey, and today I'm welcoming back Gleb Mezhanskiy to talk about Datafold's experience bringing AI to bear on the problem of migrating your data stack. So, Gleb, for folks who haven't heard you previously, can you give a refresher on some of your background and how you get started in data? Yeah. Absolutely.
[00:01:06] Gleb Mezhanskiy:
Great to be back, Tobias. Thanks for hosting me. Yeah. I am Gleb, CEO and cofounder of DataFold. DataFold is a data quality platform. We build tools to help data teams manage various problems around data quality and data migrations being one of them. I started in data over 10 years ago and as a data practitioner, first as a data engineer, but also were heads of data scientist and data PM. And it so happened that in my career as a hands on data practitioner, I got to build and scale 3 data platforms at 3 very different companies. One was Autodesk, obviously, very large company corporation.
One was Lyft, where I joined as one of the first data team members. And what was very interesting about that experience is, we started as a very small data team, and we had to scale it from about 10 people to over 200 and built lots of tools and advanced processes just to support that growth. And then I also built data platform from scratch for a series a company called Phantom Auto, and I think that gave me a very interesting perspective of the challenges of building data platforms and scaling them at different stages of companies.
But, ironically, also, a lot of the problems that I saw firsthand just solving as data partitioner were the same, And, that was the big inspiration for starting DataFold to help data teams regardless of the size of data and the size of team to help solve data quality issues.
[00:02:42] Tobias Macey:
And in that experience, in your career and in your work at DataFold, what is it about the problem of data migrations that caught your attention and made you decide to invest some time and energy into trying to make them better and less painful?
[00:02:57] Gleb Mezhanskiy:
Yeah. So I personally had a big, I would say, failure experience and a very steep learning experience with data migration, and that happened when, I was at Lyft. So soon after I joined the data team, we decided to migrate from a warehouse at the time based on Amazon Redshift to a data lake that would be supported by multiple engines, including Spark, Trino, and Hive at the time. And it was a 2 petabyte, roughly, scale migration, and I volunteered to be kind of the main architect and project manager behind the whole migration effort. So as a young and ambitious data engineer, I thought it was a really exciting, impactful, high visibility project I could I could tackle to grow in my career. And it was extremely fruitful in terms of learnings, and it was also extremely painful. And in general, migrations are hard because usually by the time that you decide to migrate, you already are in pain as a data team because you reach the ceiling of the scalability of the current system.
You probably accumulated a lot of technical debt in terms of the code that's written. You have a lot of dependencies built by the business on the legacy system. And a lot of success in migration projects depends on how you approach it. And I made 2 big mistakes leading that migration, and one of the mistakes was related to the process and one was related to tooling, which I'm happy to talk about. And, basically, long story short, I spent probably about two and a half years doing the migration and made probably 10% of the progress that I would want to make and the business wanted me to make. And I think, overall, it took about 5 years for this project to complete, which was originally scheduled to complete within, probably about 9 months. And so having that scar, I was really excited to work on tooling to improve migrations for for other teams and also help build the framework for how to address migrations better, not just from the tooling perspective, but from the process perspective as well. That's kind of my personal personal mission. I wanna make it my migrations right, learning from past mistakes, but also, you know, running a company that's focused on building data quality tools. I think migrations is a big opportunity to help data teams move faster and improve data quality. Because if you are in a legacy system or if you are stuck trying to move from a legacy system to a new system, be that a better warehouse, a faster, more scalable data lake, or a better orchestration framework, there are different types of migrations. You know, it's really hard to be successful in other things because you are distracted as a team. You have to run through systems.
A lot of your resources and tension goes into the migration and not in building data products and not in, you know, advancing a business and building exciting things like BI and AI. And so there's a huge cost of running migrations and a really lots of value that you can create if you can help teams move faster with those projects. And as far as the scope of what types of migrations we're talking about, you mentioned data warehouses or query engines. You mentioned some of the tooling and orchestration.
[00:06:14] Tobias Macey:
In that overall process of migrations, you mentioned going from legacy systems to newer architectures or faster warehouses. What do you see as the general motivation for deciding to undertake this process since it can be so fraught with peril of accidentally breaking things that have been working and some of the potential gains that companies are theorizing that they're going to attain and maybe some of the ways that you've actually seen that come to fruition or some of the some of the mistaken gains that they think they're going to get but aren't really feasible when you they get to the other side of that process? It's a great question. I think it ultimately comes down to why does a data team exist for a given business, and what does the business need from the data team. And, typically,
[00:07:03] Gleb Mezhanskiy:
the output of data teams is insights and data products. And data products can take various forms. It can be dashboards. It's gonna be self serve analytical experiences. It could be automation in the form of machine learning, AI, and just operationalizing data. And, typically, the big motivator for doing a migration is that the team feels that they can no longer sustain and deliver the products and insights that the business needs because of the limitations of the infrastructure that they have in place. The limitations that exist could be related to scale. So in order to provide analytics and products that the business needs, we need to integrate more data sources from various systems as well as, you know, logs and events, and our current system just cannot handle all of the raw data that we need to accumulate. That's 1. Or we cannot serve all the analytical workloads, queries from BI, and transformation loads that we need to run on the current system. For example, for us back at Lyft, those were the 2 primary motivators. We could not fit all the data we needed in the warehouse we had, and we could not run concurrently the query loads that the growing business required. There's also another two big reasons. One of them is cost. So a lot of the times, legacy systems have way lower ROI in terms of the scale and throughput of analytical workloads that they provide relative to some of the more modern systems that were architected later and are way more efficient. And the 4th one would be interoperability.
So if you operate on a legacy stack, a lot of the times, you can't take the full advantage of the data ecosystem. So if you have a really old database or, you know, sometimes we even see mainframe databases, we just can't leverage the latest BI tools or other analytical tools and infrastructure that actually would help you build what the business needs. And so the fact that you're in a legacy actually compounds in being way behind in terms of the ability to deliver. And the challenge is that a lot of the times teams choose to migrate kind of too late because the pain becomes unbearable, and that means that there's a lot to tackle while doing a migration. And the other thing I'll say in terms of the motivations for migrating, and that's the trend that I've started seeing probably over the past couple of years, is before maybe, like, in the last 5, 10 years, migrations were kind of 0 to 1, take everything and move everything and shut down the old system type of projects. Today, with the commoditization of compute, right, we have great warehousing technologies, and there are pros and cons of using different vendors and different clouds and a lot of teams operating in a multi cloud environments. There's actually a an emerging need to move workloads freely between engines depending on what engine provides the best compute for a particular type of workload. Like, for example, maybe some engines are better at ingest, but some are better to serve BI queries, and some are really well positioned to help with geospatial analysis or machine learning. And so moving the workloads around can help you unlock the best performance, the best cost. But that means you kind of have to do these migrations of your workloads all the time, and that still is is challenging. I would say that is another motivator for for doing data migrations.
[00:10:34] Tobias Macey:
In terms of the time syncs involved in the process of doing this migration, you mentioned that in your experience of doing the migration at Lyft, you spent the 9 months that you thought it was going to take and maybe got 10% of the way there. From your experience of going through that process and also working with teams at Datafold as customers who you are trying to help work through that process, what are the aspects of the migration that end up taking so much time or moving slower than they would anticipate?
[00:11:05] Gleb Mezhanskiy:
Yeah. Absolutely. So I would say to describe the time sinks, it's important to be clear about what is the success criteria for a migration. And if we take an example of moving from a legacy system to, let's say, warehouse to a new warehouse or moving from a legacy transformation framework to a new transformation framework that runs on top of the warehouse and sometimes doing both at the same time, usually the success of a migration project is defined by we completely moved our analytical workloads to a new system, and we shut down the old system. And if we walk back from that, what do we need for that to happen? Well, we need the business to fully adopt the new system. And what do we need for that to happen? Well, we need the business to trust the new system and the data that provide is provided there and in the numbers. And then what do we need to achieve that trust? And it usually comes down to having parity. Right? So demonstrating that whatever KPIs and reports and datasets and feature sets we were producing on the legacy system are now fully replicated and rebuilt from scratch on the new system under is parity. There is nothing, you know, that's that's changed. And that is where I personally made a mistake leading immigration.
I I talked about the process mistake, and the biggest process mistake is if you don't aim for parity and if you try to improve things in the process, especially if improvement means changing the data model and naming convention and definitions, which is very, very tempting because typically when you do a migration, you, already accumulated sometimes decades worth worth of technical debt and spaghetti code and duplicate sources of truth and definitions. It's very tempting to fix that while you're doing immigration. And the biggest problem is that if you do that, then you can't simply say, oh, we we have parity between old and new. You're saying new is better, and better is different, and different means that you have to drive a lot of consensus with the users of data. You have to ensure that they buy into learning new definitions, and they agree that the definitions are correct, and they are fine using a new schema and new structure. And that is the biggest killer for migration projects in terms of the timeline because you end up instead of actually doing the work, you end up having conversations and driving consensus. That was my first big mistake. So the biggest learning is do lift and shift. Meaning, take everything as is and do the least possible effort and least possible changes to make the whole system run on the new platform. And that doesn't mean you wouldn't maybe fix some, you know, things here and there and maybe rewrite some code to run more efficiently on the new system, but it means that you don't try to change the interface that the data users have. Because otherwise, it's extremely hard to drive consensus and achieve parity and ultimately declare a success. And so if we agree that, okay, this is the definition of the success for immigration project, then the typical workflow and the path for going there is usually consists of 5 stages. The first one is planning. So we try to understand what assets do we have, tables, views, SQL queries that we need to move over. And that, in my experience, takes maybe up to 5% of the entire project. It's basically accounting for everything, sequencing the project. Then you also need to move the data. And moving the data usually is about moving the source or raw data because we ultimately need to recompute everything from scratch and all the derivative transformations on the new system. So we have to make sure that there's parity in raw data. All events, replicas from business systems, from transactional systems, and vendors, we have to have parity in the raw input data. And that, in my experience, is about 10% of the effort. It is definitely a significant project, but usually, it's not that hard because a lot of times you have certain firehoses like your event streams, and you can simply point them to the new system and have your data lake being populated with the same data. And then you start the hard part, which is the translation of your business logic and not just the translation, but also reconciliation of the outputs, making sure that you can run the new code on the new system. You take, let's say, store procedures running on MySQL, and you move them over to a new warehouse, and you have to make sure that the code runs and not only runs but produces the same output. And that's where, basically, the translation and the reconciliation stage, I would say, teams spend about 80% of the time and effort because you have to repeat that almost like a 4 loop for every asset that you have, translation, reconciliation, and you have to repeat it over and over again until you reach the parity. And that is the hardest part. Once you have that and you completed this process for every asset, then you are at the final stage, which is getting signed off from the business and getting them to adopt the new data. And sometimes that means sending emails, sharing their reports about the migration progress, switching sources from legacy to new, and then just getting a thumbs up from whoever is that your data user is. Sometimes it's data science team. Sometimes it's operational team, but making sure that they are fully comfortable and officially signed off on using the new system. An interesting aspect too of data migrations,
[00:16:29] Tobias Macey:
once somebody is using a data lake style environment, is that so many of the other pieces become not quite interchangeable, but you don't have to replace everything wholesale. You can replace the query engine or you can replace the table format, and that requires a different approach to what it actually means to move the data or to translate the data. And I'm wondering how you've seen that factor into the ways that teams are addressing the new architecture that they're targeting. I see that as being one of the main motivators for the lakehouse style architecture that has been gaining a lot of ground and just some of the ways that teams are thinking about that composability of data stacks that is one of the features of the so called modern data stack or the way that people are starting to build systems today as compared to the early 2000?
[00:17:18] Gleb Mezhanskiy:
Yeah. I think, those ability is great, and that enables teams to have better ownership and control of their data and bring whatever is the better ways to transform and operate on this data. And that's why I think we're also seeing some of the major warehousing players like Snowflake embrace and adopt this paradigm, right, investment investing in, supporting the open table formats, whereas previously, they were more a closed ecosystem. So there's definitely a trend to for, like, a more open lake architecture. And I think that actually makes migrations in a way more ubiquitous because you move data between formats or you have the same format, but then you want to swap the engines that operate on the same lake. And that just means you are doing these migrations, you know, maybe more incrementally, but more frequently within your workflow. And no matter what you do, maybe you just move data from one format to another, or we actually have some customers that are leveraging the same vendor for transformations, but they are swapping the underlying cloud provider, like, for example, moving from Azure to Google Cloud and vice versa. In these cases, no matter what you do, no matter what you migrate, you still, as a success criteria, need to prove to the business that you have the same data once you completed the migration. And that's kind of both easy and hard at the same time because no matter what, you cannot escape the user acceptance testing as part of the migration project.
[00:18:47] Tobias Macey:
And so to reduce the time and effort burden on teams who are undergoing these projects, you've invested in building a migration agent at DataFold, bringing to bear the capabilities of LLMs, because that is what is driving a majority of the new development in the technology ecosystem today. And I'm wondering if you can talk to some of the core objectives that you had going into this project, some of the ways that you approached the design, and how you decided whether and when it was ready to actually unleash on people's data stacks given their sensitivity
[00:19:25] Gleb Mezhanskiy:
to letting anything touch their data? Yeah. Absolutely. So we've been supporting teams doing migrations for a few years now. And as I mentioned, the biggest time sinks is the translation of code and the reconciliation. And for a few years, we've been providing reconciliation as a tool called Datadiff that teams could leverage to compare data between the legacy system and the new system at scale, understand the discrepancies, and address them. And Datadiff overall turned out to be a, Swiss army knife for data quality because it a lot of the data quality workflows are about understanding differences in data, whether between staging and production to understand the impacts of deployments of new code changes, comparing data today versus yesterday to understand drifts in the data, comparing data source to target when you do CDC replication and trying to see if you have consistency, and in the context of migration, comparing legacy to new data. Some teams, including Eventbrite that we talked about with Rob, I think, few episodes ago at your show, leveraged DataFold for that reconciliation piece and their migration project. And the translation piece, the team have been doing themselves, mostly relying on manual work. And we saw a lot of opportunity to solve that piece as well because we knew that reconciliation is takes a lot of time, but translation is also very hard. And we also worked on dealing with analytical code and SQL code specifically for years, and we've been, leveraging and extending compilers, both open source and in house, in order to support column level lineage. So we've been actually very deep in the semantic code analysis for years as a company, but we never saw that technology, that approach for dealing with code good enough to support migrations at scale. And the biggest reason there is that the, I would say, previous generation of parsers, including some of the great open source projects like SQL GloT and Lark, they rely on static predefined grammars. So they parse SQL code into abstract syntax tree, and then they apply a grammar to understand what each element of the syntax tree is doing and then convert that into corresponding syntax of the target dialect. And that is a very deterministic approach. So when it works, it works great, but it requires you to predefine grammars for every dialect you try to convert from and every dialect you try to convert to. And to make things even more complex, a lot of the times, teams don't just migrate from one SQL dialect to another, but they also migrate between frameworks. Like, for example, someone may be using an orchestration framework like Airflow, and then they are migrating to a new framework like DBT.
And so you're not just moving your SQL code and translating the code, but you also need to change the wrapper in which the SQL code is built as part of the migration process. And the open source grammar based approach for translating that just doesn't support that type of translations, in a flexible enough manner. Or sometimes you would want to translate from a legacy GUI based orchestration tool into a code first orchestration tool, and that is even harder type of migration to pull off, in a grammar based approach. And so I think what happened over the past few years is obviously the huge development and progress made in large language models that are turned out to be very good at generating code, including SQL. And, obviously, you might think, well, LLMs are there, and let's just have them do the migration. Well, it turns out they are these models are good at generating code, but not necessarily good at generating the correct code. And as we know for migrations, we need to ensure the correctness of the migration project. And that's why we saw the big opportunity to put that new technology together with the ability to do reconciliation at scale, which we've had for years, and turn it into the data migration agent that essentially performs both the translation and reconciliation together in one go, basically replicating process that a data engineer would do manually, but doing it at scale. And so in terms of
[00:23:49] Tobias Macey:
the system design and architecture of this application, I'm curious if you can talk to some of the ways that you're thinking about the sources of knowledge that you're bringing into the system, the ways that you're applying the LLM, the model selection, and just some of the overall evaluation and testing that you had to go through to build confidence in the agent.
[00:24:11] Gleb Mezhanskiy:
Yeah. Absolutely. So the reason why it's called agent is because it doesn't only perform a translation, but it is able to start with a legacy code base and get to the converted code base and ensure the parity. And that does not necessarily happen in one go, but the agent can improve its output and performance as it proceeds with the migration. So if we think about what a data engineer does when they need to migrate a certain piece of transformation logic is they take, let's say, a table that is produced by SQL query, and they attempt to convert that SQL query into the new dialect, maybe also wrap it into a new wrapper like dbt airflow, DAXTER, and then they run it on the new system, and then they see if there are any errors, syntax errors. Does the query even compile? And then once they get that to work, then they say, okay. What's the output of that new query? Does it produce the same data? How do we know it produce the same data? We need to compare it. And if we see any discrepancies, then we try to go back to the code and see, okay, where does those discrepancies come from? Am I using the wrong function? Maybe the function in the new system produces a different output, etcetera, etcetera. And you repeat this process over and over again until you reach parity. And usually, in my experience is also talking to some of our partners in consulting who who work on migration projects. Roughly, we can say it takes 5 to 25 hours per model depending on the complexity of the code base and how it's orchestrated.
But it's a sizable amount of work, and if we multiply that by the complexity of modern data projects, which means probably thousands of these assets, we talk about, you know, 5 to 10 plus 1000 of hours of work to do the migration for a human. Now the agent closely replicates that workflow, but in a way more scalable way. Essentially, it takes the legacy code as an input as well as the data produced by that code in legacy system in a sampled manner, and then it translates the code, gets the new code, runs it on a new system, and then validates the output of that code against the legacy system. And it also takes into account any errors and discrepancies. And if there are any errors and discrepancies, those are fed back into the translation engine for correcting, the translation process. And it runs over and over again until it reaches parity or until we see that it doesn't converge on getting more correct. And, basically, as I mentioned, it closely replicates the process that a human developer would do. But because it's a software agent, it can be scaled and run-in parallel and run 247. And so the compression of timelines that we see for these projects is basically from something that would take 5, 10000 hours for a team of data engineers could be compressed essentially in single weeks that are aware of the process is completely run by software system with basically human in the loop, but not really doing any work for the translation itself. So it could be, you know, 10 plus times speed up of the process.
[00:27:25] Tobias Macey:
As far as the model specifics too, I'm wondering if you're able to get away with a smaller model because of the constrained space of the problem or if you're using one of the larger models that's out there and just some of the ways that you're thinking about the performance versus cost versus engineering effort? When we started on this project, we had 2 constraints. 1, we needed a model that was very good at generating code. We also needed a model that had a large enough context window because sometimes the code that we're operating on can be very long. I wrote
[00:27:58] Gleb Mezhanskiy:
1,000 line SQL queries, in my career, and they definitely still exist. And not only we need to provide the query itself, but we need to provide the context as well as the errors and discrepancies context that I mentioned. And the third requirement was privacy and security. We work with many customers that operate in sensitive verticals such as financial services, health care, and others where they pay extremely close attention to the privacy and security of their data, and it is not acceptable for them for their data or their code to be leveraged by someone else to train the models and risk that data to or code to leak somewhere.
And having those three constraints, we evaluate a number of models. And in our experience, the better performing models in terms of code generation and overall performance were GPT 4 and Cloud 3.5 Sonnets. And we chose Claude because of the privacy and security concerns. So that model was very good at, performing the translation and allowed us to satisfy our customers' very strict security and privacy concerns.
[00:29:10] Tobias Macey:
And so as far as the outcome of building this agent, you've got it running. You've provided it to your customers. I'm wondering if you can talk to some of the ways that you're thinking about the return on investment for building this new capability, the cost of running the system, the cost of marketing it, etcetera?
[00:29:30] Gleb Mezhanskiy:
Absolutely. So I would say we're I'd say we're still early in the process of seeing migrations being automated. But just in the projects that we've already completed, we're seeing tremendous acceleration. So as an example, we recently helped a data team perform a migration from Redshift to Snowflake. And on both systems, the code has to be packaged in dbt, which is also challenging because, essentially, we're taking SQL with the DSL on top and need to translate that into a different dialect of SQL with the DSL on top. And the scale of the project was about 600 models, and our mutual estimate with the customer was that it will take them at least 24 100 to 3000 hours to complete this project manually. So we're talking about at least 1 engineer here to perform this migration by hand. And they already considered leveraging some of the open source tooling for the translation, but they realized that it's not as easy because of the need to package the code into dbt as well as perform the reconciliation. So even if they applied some of the automation in the code conversion itself, it was still it would be still a very manual process. And the acceleration that we're able to achieve with the migration agent is essentially getting to a 90% parity in terms of translated code, validated code that produces the same output in both systems within a couple of weeks. And then the remaining 10% of queries that a lot of them were highly specific to data ops operations and need to be basically looked at by human to provide the correct input for the translation process, those took another couple of weeks to complete. So, essentially, we're talking about a project that would take a year or over a year and doing this in a month. So that is a huge acceleration. And that is just the time acceleration, but there's also a big cost component because having these projects done by humans is very expensive, both in terms of the direct cost of, you know, paying highly trained data engineers to do this work, but also opportunity cost. Because if we think about this, doing this project for a year means that that engineer or engineers do not work on other things that can be directly impactful to the business. We're essentially putting them on this technical debt project. And so the indirect cost of doing migration manually could be even multiple times higher than the direct cost of their hourly work. So I think that, overall, the migrations in the future will be fully automated, and I actually compare it to washing clothes. Right? So, yes, we can wash clothes by hand today, but it is hard, and there is just not no reason to do it where we can use a washing machine. And that is a huge unlock for both our quality of life, but the opportunity cost as well in terms of what we can do with our time. And I'm incredibly bullish on this approach to scale to all sorts of migrations.
And I would say probably in a few years, no one will be doing migrations by hand. On the
[00:32:22] Tobias Macey:
validation side, you mentioned that you've got your reconciliation capabilities. You have Datadiff, which is very well engineered for focusing on that piece of it specifically. And I'm curious how you have combined those two systems to be able to help build up confidence in the final product as well as maybe some of the types of errors that you're able to catch because you have that reconciliation loop as the guard against invalid outputs from the LLM.
[00:32:50] Gleb Mezhanskiy:
Yeah. Absolutely. So the data diff is really critical part of the system. Like I mentioned, large language models are good at generating code, but there's no guarantee that that code is correct in terms of there's no guarantee that it actually runs. Sometimes these models invent functions that don't exist, and there's no guarantee that produces the same output, which is ultimately what we're shooting for. And so having Datadiff, the ability to compare the outputs across engines and do this at scale and fast and feed that context back into the translator is absolutely critical for the agent to work correctly. But the actual translation, as I mentioned before, is not the end of the process because even after you have these, the code base that already is translated and validated based on the sample data, that code base still need to be deployed and run on scale. And that is typically something that our customers do themselves. Once we hand them over the translated code base that's already been pre validated based on sample data, They would actually deploy it, put it into a repo, run it on their, you know, schedule at scale. So now they have this large project of tables, views, whatever is the output running in production. And at that point, they can use the cross database diffing to compare the legacy system versus the new system at full scale, production to production for some period of time to get the full confidence and demonstrated confidence to their stakeholders that the systems are at full parity. And the reason why this step is important is because even when you converted your code base and now you have your new DAG of models running on the new system, you can't call it a day because likely you have a lot of the business dependencies, BI dashboards, reports, machine learning models, one off Excel exports that were running on the legacy system that now need to be switched over. So you need that grace period for the business to transition. And during that period, guaranteeing that you have full consistency is really critical to the process because you can't have can tell stakeholders like, hey. Can you switch this KPI report that was running on this old system to our new warehouse? And then have them find out or wonder whether the numbers have switched because of that or not. Right? So providing that consistent parity report, Datadiff, is really important for the final stage of business acceptance in your migration project. And as far as the
[00:35:10] Tobias Macey:
experience of building the system, working with your customers to put it into production and actually execute some of these migrations, I'm wondering what are some of the interesting or innovative or unexpected ways either that you've seen the system perform or that you've seen people apply these unexpected ways either that you've seen the system perform or that you've seen people apply these capabilities.
[00:35:27] Gleb Mezhanskiy:
I would say that in terms of building on the system, first of all, it's a different paradigm of building software because before large language models, most of the software we wrote was deterministic. We ran you know, we did something. We tested the data. Even we're on a data diff. It produced outputs. And with using large language models, the output of your software system is stochastic, and that changes the paradigm of you as a top the software developer in terms of how you handle inputs and outputs and observability and how do you handle testing and QA and releases and even write documentation for the end users. So it's definitely been a very interesting paradigm shift and learning experience for our team internally building that agent. And then in terms of the overall experience, I would say the process of doing a migration is relatively simple. It's hard and tedious, but it's simple. And so the algorithm that the agent uses is also simple, but the actual implementation is incredibly sophisticated because at scale, you deal with basically lots and lots of edge cases and the long tail of weird things or quirky things or system specific things that require special handling. And so the biggest challenge here is making sure that you build a system that is generally applicable to a wide range of inputs in terms of dialects and frameworks and patterns that the teams use. And you constantly have to balance the configurability with automation because none of these conversion projects can be run completely without the human in the loop. But at the same time, the value that you provide directly depends on the degree of automation that you can achieve. So we developed what I think is a very interesting and powerful configuration framework where the human on the loop of this migration process can provide certain inputs and instructions to the process that make the translation by the agent more successful. And so it's still automated, still runs very fast, but this configurability is essentially is essential to successful execution on these migration projects.
[00:37:29] Tobias Macey:
And in your experience of building this system, leveraging AI for these capabilities, coming to grips with which models to use, how they operate, what are their failure modes? What are some of the most interesting or challenging or unexpected lessons that you learned personally?
[00:37:46] Gleb Mezhanskiy:
I would say it's not surprising, but it's very easy to get to something that seems like it's working and extremely hard to get to a system that works well at scale. And it's especially hard for, like I mentioned, the stochastic sense systems that are based on language models where the in the output of the system is, you know, subject to random distribution. And that's been the most both interesting and challenging part of the project. I would say the other unexpected aspect of working on the migration agent was that we started this thinking that we will be only working with SQL. But we quickly realized, and mostly from just the feedback of our customers, that this approach scales well to all sorts of migrations, including migrations from and between BI systems. For example, we are working with a customer that has a very large Looker project. Looker is a BI tool that has, its own DSL called LookML.
And with any DSL in data, usually, you encode a lot of business logic and engine specific syntax within it. And so one of the biggest hurdles for their migration was actually porting over a 1000 Looker views from one dialect to another. And we also initially were quite challenged by this, but then we quickly realized that the same principle that the data migration agent of a rates on can be applied to these types of migrations as well because, ultimately, most of the data DSLs take dataset as an input, apply some code, and produce a dataset on the other hand even if it's a BI tool. And so that was an interesting realization. And then we had customers who were interested in migrating from GUI systems for defining transformation systems like, you know, Informatica, MicroStrategy, Talent, and others into software first frameworks like DBT, where a lot of the times there is no code exposed to the user at the source system or that code is completely unreadable by by human. You only operate with whatever blocks and squares and functions that you drag and drop in the UI. But it turns out that those types of migrations are also perfect candidates for the migration agent, and leveraging the large language model allows us to generalize very, very well for various types of inputs and, outputs. So it's definitely been a very exciting and intellectually stimulating projects project working on all sorts of different data systems. But I'm sure that as we tap more and more into various migrations, there'll be more insights that we'll unlock, especially as we go into some of the older mainframe systems where we'll we'll see some kind of archaeological code discoveries that,
[00:40:27] Tobias Macey:
will be quite novel. In terms of people who are starting down the path of some sort of migration project, what are the cases where you would say that they shouldn't use your migration agent and they should go down that manual route or some other automation path? Absolutely. I would say the data migration agent
[00:40:46] Gleb Mezhanskiy:
works really well for a lift and shift type of migration where you define success as having priority between the new system and the old system. And, obviously, there are occasional changes. Like, sometimes you need to change the schema. Sometimes you need to, like I mentioned, apply different performance optimizations, but you're not trying to change the interface to your data during the migration. And you can always do this after the fact, but for the migration itself, this is not the goal. If you are tackling rearchitecture as part of the migration project and you say, well, just we don't like the data model. We have to change it. We have to go from this Wild West to a dimensional model or data vault or, you know, we need to change all the definitions how our KPIs are defined. Then it becomes much harder to reach parity, and maybe parity is just not the objective you're optimizing for. Maybe you optimize for the new definitions that you create. And in this case, by definition, you'll be applying a lot of manual interventions and manual conversion. And DMA may they they DMA, meaning the data migration agent, may still be helpful for parts of this migration where you actually have the parity. Maybe your, you know, bronze and silver layer become stay the same, but your gold layer is when you apply those rearchitecture changes or vice versa. Maybe you wanna really change the way that your restructuring of the raw sources works. Maybe you change the sources, but you still wanna have the same interface for the gold layer. You can apply selectively the migration agent to deal with moving the assets where you actually have parity as the objective function.
[00:42:24] Tobias Macey:
And as you continue to build and invest in this agent, you start to expand the types of migrations and transformations that you're able to support. I'm wondering if you can talk to what you have planned for the near to medium term and any particular capabilities or projects you're excited to dig into there. Yeah. I would say the overall goal for us with the migration agent is to strive for
[00:42:48] Gleb Mezhanskiy:
compressing the timelines even further. Like I mentioned, compressing a year into weeks. So this is already great. But, obviously, now that we're doing many of these projects, we are very interested in running this even faster, and we'll probably be investing a lot into creating a even more powerful interface for the human in the loop to be directing this process, so that scales even even further. And I think the other frontier we we, know that we wanna tackle is how can we tackle the performance together with the migration itself. Because for a lot of the data teams, performance and cost efficiency is top of mind in the current environment. And a lot of times when you're moving from a legacy system that's expensive to a scalable data warehouse, they can incur unexpected cost just because, okay, you can you can run everything and you can collect all data you want, but it's actually turns out pretty expensive. And so very quickly, they realized the need to optimize the performance. And because we have a really great insight into the code and we also have this very important forcing function, which is the parity, we can work on optimizing the performance of the output for migration process. So not only give you the code that reaches parity, but also give you the code that runs as efficiently as possible on the new engine. And I think that that doesn't stop there because, again, with a system that really understands the code semantically and is able to guarantee quality, you can take a step further and maybe even simplify some of the code from the, you know, semantic encoded expression standpoint and maybe turn a transformation chain that took 10 steps, 10 tables, and, you know, 5,000 lines of code into something way simpler that is more manageable for data engineers to handle down the road while still maintaining parity of the output and consistency of the data interface for the data users. So I think that's kind of where this is moving to, and I'm very optimistic about the abilities of the modern technologies and DataFold in particular to really help data teams get the most out of not only the migration projects but their code bases. Are there any other aspects of the work that you're doing on the data migration agent or the overall problem space that we didn't discuss yet that you'd like to cover before we close out the show? Yeah. I think that I'm not sure exactly how the future looks, but I think that we will be seeing way more data liquidity for in the data team workflow in terms of teams moving workloads around, the engines, the lakes, the formats all the time, and the portability of workflows.
We tried to tackle that with developing obstructions. So there were multiple attempts to develop obstructions over SQL that could be compiled into different engines. They haven't really taken off as we thought they would. They're still very well engineered, but we just haven't seen that pattern to be adopted. But still, I think there's a big demand and big trend in terms of making the workloads portable. So I'm really excited about making it done here and helping data teams get the most out of the infrastructure
[00:45:54] Tobias Macey:
by reducing the vendor lock in, reducing the technical debt, and being able to leverage the best technologies they have, they can get for their work. Alright. Well, for anybody who wants to get in touch with you, follow along with the work that you're doing, or try out the agent, I'll have you add your preferred contact information to the show notes. And as the typical final question,
[00:46:14] Gleb Mezhanskiy:
I'm interested to see how your perspective on this has changed, but I'm wondering what you see as being the biggest gap on the tooling or technology that's available for data management today. I think there's a lot of exciting technologies out there. I think that data teams and data engineers are still operating on a fairly low level of abstraction. We still worry about a lot of the basic things, and I'm not sure exactly what the future looks like, but what I'm very excited about, all of the data professionals and especially data engineers moving to the higher levels of abstraction where, just like with software, most of the software engineers these days don't need to worry about, you know, memory allocation or other low level concepts that they had to do, you know, 10, 20 years ago unless they really choose to, right, for for their task. And I think that we'll see the same for data where we'll spend most of our time thinking about the business problem and semantics and human collaboration versus dealing with these low low level problems. And migration is one of those aspects that I think we can definitely strike people away from in terms of doing, like, low level work to doing more productive high level work. There's definitely more to do in the actual day to day workflow in terms of creating code and maintaining code and and writing it, and I'm already seeing a lot of these advances in IDs all all over the place. They're already very good at writing helping you write SQL. And I think probably in 5, even sooner, maybe 3 to 5 years, the workflow of data engineer will be very different from what we are used to now. Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing on the data migration agent. It's definitely very interesting product and interesting capability. It's great to see how you're helping to
[00:47:52] Tobias Macey:
let teams move faster and experiment with more of these different technology stacks. So appreciate the time and energy that you've put into that, and I hope you enjoy the rest of your day. Thank you so much, Tobias. Thank you for listening, and don't forget to check out our other shows. Podcast dotnet covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at data engineering podcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts, and tell your friends and colleagues.
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Imagine catching data issues before they snowball into bigger problems. That's what DataFold's new monitors do. With automatic monitoring for cross database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time right at the source. Whether it's maintaining data integrity or preventing costly mistakes, DataFold monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production?
Learn more at dataengineeringpodcast.com/datafold today. Your host is Tobias Macey, and today I'm welcoming back Gleb Mezhanskiy to talk about Datafold's experience bringing AI to bear on the problem of migrating your data stack. So, Gleb, for folks who haven't heard you previously, can you give a refresher on some of your background and how you get started in data? Yeah. Absolutely.
[00:01:06] Gleb Mezhanskiy:
Great to be back, Tobias. Thanks for hosting me. Yeah. I am Gleb, CEO and cofounder of DataFold. DataFold is a data quality platform. We build tools to help data teams manage various problems around data quality and data migrations being one of them. I started in data over 10 years ago and as a data practitioner, first as a data engineer, but also were heads of data scientist and data PM. And it so happened that in my career as a hands on data practitioner, I got to build and scale 3 data platforms at 3 very different companies. One was Autodesk, obviously, very large company corporation.
One was Lyft, where I joined as one of the first data team members. And what was very interesting about that experience is, we started as a very small data team, and we had to scale it from about 10 people to over 200 and built lots of tools and advanced processes just to support that growth. And then I also built data platform from scratch for a series a company called Phantom Auto, and I think that gave me a very interesting perspective of the challenges of building data platforms and scaling them at different stages of companies.
But, ironically, also, a lot of the problems that I saw firsthand just solving as data partitioner were the same, And, that was the big inspiration for starting DataFold to help data teams regardless of the size of data and the size of team to help solve data quality issues.
[00:02:42] Tobias Macey:
And in that experience, in your career and in your work at DataFold, what is it about the problem of data migrations that caught your attention and made you decide to invest some time and energy into trying to make them better and less painful?
[00:02:57] Gleb Mezhanskiy:
Yeah. So I personally had a big, I would say, failure experience and a very steep learning experience with data migration, and that happened when, I was at Lyft. So soon after I joined the data team, we decided to migrate from a warehouse at the time based on Amazon Redshift to a data lake that would be supported by multiple engines, including Spark, Trino, and Hive at the time. And it was a 2 petabyte, roughly, scale migration, and I volunteered to be kind of the main architect and project manager behind the whole migration effort. So as a young and ambitious data engineer, I thought it was a really exciting, impactful, high visibility project I could I could tackle to grow in my career. And it was extremely fruitful in terms of learnings, and it was also extremely painful. And in general, migrations are hard because usually by the time that you decide to migrate, you already are in pain as a data team because you reach the ceiling of the scalability of the current system.
You probably accumulated a lot of technical debt in terms of the code that's written. You have a lot of dependencies built by the business on the legacy system. And a lot of success in migration projects depends on how you approach it. And I made 2 big mistakes leading that migration, and one of the mistakes was related to the process and one was related to tooling, which I'm happy to talk about. And, basically, long story short, I spent probably about two and a half years doing the migration and made probably 10% of the progress that I would want to make and the business wanted me to make. And I think, overall, it took about 5 years for this project to complete, which was originally scheduled to complete within, probably about 9 months. And so having that scar, I was really excited to work on tooling to improve migrations for for other teams and also help build the framework for how to address migrations better, not just from the tooling perspective, but from the process perspective as well. That's kind of my personal personal mission. I wanna make it my migrations right, learning from past mistakes, but also, you know, running a company that's focused on building data quality tools. I think migrations is a big opportunity to help data teams move faster and improve data quality. Because if you are in a legacy system or if you are stuck trying to move from a legacy system to a new system, be that a better warehouse, a faster, more scalable data lake, or a better orchestration framework, there are different types of migrations. You know, it's really hard to be successful in other things because you are distracted as a team. You have to run through systems.
A lot of your resources and tension goes into the migration and not in building data products and not in, you know, advancing a business and building exciting things like BI and AI. And so there's a huge cost of running migrations and a really lots of value that you can create if you can help teams move faster with those projects. And as far as the scope of what types of migrations we're talking about, you mentioned data warehouses or query engines. You mentioned some of the tooling and orchestration.
[00:06:14] Tobias Macey:
In that overall process of migrations, you mentioned going from legacy systems to newer architectures or faster warehouses. What do you see as the general motivation for deciding to undertake this process since it can be so fraught with peril of accidentally breaking things that have been working and some of the potential gains that companies are theorizing that they're going to attain and maybe some of the ways that you've actually seen that come to fruition or some of the some of the mistaken gains that they think they're going to get but aren't really feasible when you they get to the other side of that process? It's a great question. I think it ultimately comes down to why does a data team exist for a given business, and what does the business need from the data team. And, typically,
[00:07:03] Gleb Mezhanskiy:
the output of data teams is insights and data products. And data products can take various forms. It can be dashboards. It's gonna be self serve analytical experiences. It could be automation in the form of machine learning, AI, and just operationalizing data. And, typically, the big motivator for doing a migration is that the team feels that they can no longer sustain and deliver the products and insights that the business needs because of the limitations of the infrastructure that they have in place. The limitations that exist could be related to scale. So in order to provide analytics and products that the business needs, we need to integrate more data sources from various systems as well as, you know, logs and events, and our current system just cannot handle all of the raw data that we need to accumulate. That's 1. Or we cannot serve all the analytical workloads, queries from BI, and transformation loads that we need to run on the current system. For example, for us back at Lyft, those were the 2 primary motivators. We could not fit all the data we needed in the warehouse we had, and we could not run concurrently the query loads that the growing business required. There's also another two big reasons. One of them is cost. So a lot of the times, legacy systems have way lower ROI in terms of the scale and throughput of analytical workloads that they provide relative to some of the more modern systems that were architected later and are way more efficient. And the 4th one would be interoperability.
So if you operate on a legacy stack, a lot of the times, you can't take the full advantage of the data ecosystem. So if you have a really old database or, you know, sometimes we even see mainframe databases, we just can't leverage the latest BI tools or other analytical tools and infrastructure that actually would help you build what the business needs. And so the fact that you're in a legacy actually compounds in being way behind in terms of the ability to deliver. And the challenge is that a lot of the times teams choose to migrate kind of too late because the pain becomes unbearable, and that means that there's a lot to tackle while doing a migration. And the other thing I'll say in terms of the motivations for migrating, and that's the trend that I've started seeing probably over the past couple of years, is before maybe, like, in the last 5, 10 years, migrations were kind of 0 to 1, take everything and move everything and shut down the old system type of projects. Today, with the commoditization of compute, right, we have great warehousing technologies, and there are pros and cons of using different vendors and different clouds and a lot of teams operating in a multi cloud environments. There's actually a an emerging need to move workloads freely between engines depending on what engine provides the best compute for a particular type of workload. Like, for example, maybe some engines are better at ingest, but some are better to serve BI queries, and some are really well positioned to help with geospatial analysis or machine learning. And so moving the workloads around can help you unlock the best performance, the best cost. But that means you kind of have to do these migrations of your workloads all the time, and that still is is challenging. I would say that is another motivator for for doing data migrations.
[00:10:34] Tobias Macey:
In terms of the time syncs involved in the process of doing this migration, you mentioned that in your experience of doing the migration at Lyft, you spent the 9 months that you thought it was going to take and maybe got 10% of the way there. From your experience of going through that process and also working with teams at Datafold as customers who you are trying to help work through that process, what are the aspects of the migration that end up taking so much time or moving slower than they would anticipate?
[00:11:05] Gleb Mezhanskiy:
Yeah. Absolutely. So I would say to describe the time sinks, it's important to be clear about what is the success criteria for a migration. And if we take an example of moving from a legacy system to, let's say, warehouse to a new warehouse or moving from a legacy transformation framework to a new transformation framework that runs on top of the warehouse and sometimes doing both at the same time, usually the success of a migration project is defined by we completely moved our analytical workloads to a new system, and we shut down the old system. And if we walk back from that, what do we need for that to happen? Well, we need the business to fully adopt the new system. And what do we need for that to happen? Well, we need the business to trust the new system and the data that provide is provided there and in the numbers. And then what do we need to achieve that trust? And it usually comes down to having parity. Right? So demonstrating that whatever KPIs and reports and datasets and feature sets we were producing on the legacy system are now fully replicated and rebuilt from scratch on the new system under is parity. There is nothing, you know, that's that's changed. And that is where I personally made a mistake leading immigration.
I I talked about the process mistake, and the biggest process mistake is if you don't aim for parity and if you try to improve things in the process, especially if improvement means changing the data model and naming convention and definitions, which is very, very tempting because typically when you do a migration, you, already accumulated sometimes decades worth worth of technical debt and spaghetti code and duplicate sources of truth and definitions. It's very tempting to fix that while you're doing immigration. And the biggest problem is that if you do that, then you can't simply say, oh, we we have parity between old and new. You're saying new is better, and better is different, and different means that you have to drive a lot of consensus with the users of data. You have to ensure that they buy into learning new definitions, and they agree that the definitions are correct, and they are fine using a new schema and new structure. And that is the biggest killer for migration projects in terms of the timeline because you end up instead of actually doing the work, you end up having conversations and driving consensus. That was my first big mistake. So the biggest learning is do lift and shift. Meaning, take everything as is and do the least possible effort and least possible changes to make the whole system run on the new platform. And that doesn't mean you wouldn't maybe fix some, you know, things here and there and maybe rewrite some code to run more efficiently on the new system, but it means that you don't try to change the interface that the data users have. Because otherwise, it's extremely hard to drive consensus and achieve parity and ultimately declare a success. And so if we agree that, okay, this is the definition of the success for immigration project, then the typical workflow and the path for going there is usually consists of 5 stages. The first one is planning. So we try to understand what assets do we have, tables, views, SQL queries that we need to move over. And that, in my experience, takes maybe up to 5% of the entire project. It's basically accounting for everything, sequencing the project. Then you also need to move the data. And moving the data usually is about moving the source or raw data because we ultimately need to recompute everything from scratch and all the derivative transformations on the new system. So we have to make sure that there's parity in raw data. All events, replicas from business systems, from transactional systems, and vendors, we have to have parity in the raw input data. And that, in my experience, is about 10% of the effort. It is definitely a significant project, but usually, it's not that hard because a lot of times you have certain firehoses like your event streams, and you can simply point them to the new system and have your data lake being populated with the same data. And then you start the hard part, which is the translation of your business logic and not just the translation, but also reconciliation of the outputs, making sure that you can run the new code on the new system. You take, let's say, store procedures running on MySQL, and you move them over to a new warehouse, and you have to make sure that the code runs and not only runs but produces the same output. And that's where, basically, the translation and the reconciliation stage, I would say, teams spend about 80% of the time and effort because you have to repeat that almost like a 4 loop for every asset that you have, translation, reconciliation, and you have to repeat it over and over again until you reach the parity. And that is the hardest part. Once you have that and you completed this process for every asset, then you are at the final stage, which is getting signed off from the business and getting them to adopt the new data. And sometimes that means sending emails, sharing their reports about the migration progress, switching sources from legacy to new, and then just getting a thumbs up from whoever is that your data user is. Sometimes it's data science team. Sometimes it's operational team, but making sure that they are fully comfortable and officially signed off on using the new system. An interesting aspect too of data migrations,
[00:16:29] Tobias Macey:
once somebody is using a data lake style environment, is that so many of the other pieces become not quite interchangeable, but you don't have to replace everything wholesale. You can replace the query engine or you can replace the table format, and that requires a different approach to what it actually means to move the data or to translate the data. And I'm wondering how you've seen that factor into the ways that teams are addressing the new architecture that they're targeting. I see that as being one of the main motivators for the lakehouse style architecture that has been gaining a lot of ground and just some of the ways that teams are thinking about that composability of data stacks that is one of the features of the so called modern data stack or the way that people are starting to build systems today as compared to the early 2000?
[00:17:18] Gleb Mezhanskiy:
Yeah. I think, those ability is great, and that enables teams to have better ownership and control of their data and bring whatever is the better ways to transform and operate on this data. And that's why I think we're also seeing some of the major warehousing players like Snowflake embrace and adopt this paradigm, right, investment investing in, supporting the open table formats, whereas previously, they were more a closed ecosystem. So there's definitely a trend to for, like, a more open lake architecture. And I think that actually makes migrations in a way more ubiquitous because you move data between formats or you have the same format, but then you want to swap the engines that operate on the same lake. And that just means you are doing these migrations, you know, maybe more incrementally, but more frequently within your workflow. And no matter what you do, maybe you just move data from one format to another, or we actually have some customers that are leveraging the same vendor for transformations, but they are swapping the underlying cloud provider, like, for example, moving from Azure to Google Cloud and vice versa. In these cases, no matter what you do, no matter what you migrate, you still, as a success criteria, need to prove to the business that you have the same data once you completed the migration. And that's kind of both easy and hard at the same time because no matter what, you cannot escape the user acceptance testing as part of the migration project.
[00:18:47] Tobias Macey:
And so to reduce the time and effort burden on teams who are undergoing these projects, you've invested in building a migration agent at DataFold, bringing to bear the capabilities of LLMs, because that is what is driving a majority of the new development in the technology ecosystem today. And I'm wondering if you can talk to some of the core objectives that you had going into this project, some of the ways that you approached the design, and how you decided whether and when it was ready to actually unleash on people's data stacks given their sensitivity
[00:19:25] Gleb Mezhanskiy:
to letting anything touch their data? Yeah. Absolutely. So we've been supporting teams doing migrations for a few years now. And as I mentioned, the biggest time sinks is the translation of code and the reconciliation. And for a few years, we've been providing reconciliation as a tool called Datadiff that teams could leverage to compare data between the legacy system and the new system at scale, understand the discrepancies, and address them. And Datadiff overall turned out to be a, Swiss army knife for data quality because it a lot of the data quality workflows are about understanding differences in data, whether between staging and production to understand the impacts of deployments of new code changes, comparing data today versus yesterday to understand drifts in the data, comparing data source to target when you do CDC replication and trying to see if you have consistency, and in the context of migration, comparing legacy to new data. Some teams, including Eventbrite that we talked about with Rob, I think, few episodes ago at your show, leveraged DataFold for that reconciliation piece and their migration project. And the translation piece, the team have been doing themselves, mostly relying on manual work. And we saw a lot of opportunity to solve that piece as well because we knew that reconciliation is takes a lot of time, but translation is also very hard. And we also worked on dealing with analytical code and SQL code specifically for years, and we've been, leveraging and extending compilers, both open source and in house, in order to support column level lineage. So we've been actually very deep in the semantic code analysis for years as a company, but we never saw that technology, that approach for dealing with code good enough to support migrations at scale. And the biggest reason there is that the, I would say, previous generation of parsers, including some of the great open source projects like SQL GloT and Lark, they rely on static predefined grammars. So they parse SQL code into abstract syntax tree, and then they apply a grammar to understand what each element of the syntax tree is doing and then convert that into corresponding syntax of the target dialect. And that is a very deterministic approach. So when it works, it works great, but it requires you to predefine grammars for every dialect you try to convert from and every dialect you try to convert to. And to make things even more complex, a lot of the times, teams don't just migrate from one SQL dialect to another, but they also migrate between frameworks. Like, for example, someone may be using an orchestration framework like Airflow, and then they are migrating to a new framework like DBT.
And so you're not just moving your SQL code and translating the code, but you also need to change the wrapper in which the SQL code is built as part of the migration process. And the open source grammar based approach for translating that just doesn't support that type of translations, in a flexible enough manner. Or sometimes you would want to translate from a legacy GUI based orchestration tool into a code first orchestration tool, and that is even harder type of migration to pull off, in a grammar based approach. And so I think what happened over the past few years is obviously the huge development and progress made in large language models that are turned out to be very good at generating code, including SQL. And, obviously, you might think, well, LLMs are there, and let's just have them do the migration. Well, it turns out they are these models are good at generating code, but not necessarily good at generating the correct code. And as we know for migrations, we need to ensure the correctness of the migration project. And that's why we saw the big opportunity to put that new technology together with the ability to do reconciliation at scale, which we've had for years, and turn it into the data migration agent that essentially performs both the translation and reconciliation together in one go, basically replicating process that a data engineer would do manually, but doing it at scale. And so in terms of
[00:23:49] Tobias Macey:
the system design and architecture of this application, I'm curious if you can talk to some of the ways that you're thinking about the sources of knowledge that you're bringing into the system, the ways that you're applying the LLM, the model selection, and just some of the overall evaluation and testing that you had to go through to build confidence in the agent.
[00:24:11] Gleb Mezhanskiy:
Yeah. Absolutely. So the reason why it's called agent is because it doesn't only perform a translation, but it is able to start with a legacy code base and get to the converted code base and ensure the parity. And that does not necessarily happen in one go, but the agent can improve its output and performance as it proceeds with the migration. So if we think about what a data engineer does when they need to migrate a certain piece of transformation logic is they take, let's say, a table that is produced by SQL query, and they attempt to convert that SQL query into the new dialect, maybe also wrap it into a new wrapper like dbt airflow, DAXTER, and then they run it on the new system, and then they see if there are any errors, syntax errors. Does the query even compile? And then once they get that to work, then they say, okay. What's the output of that new query? Does it produce the same data? How do we know it produce the same data? We need to compare it. And if we see any discrepancies, then we try to go back to the code and see, okay, where does those discrepancies come from? Am I using the wrong function? Maybe the function in the new system produces a different output, etcetera, etcetera. And you repeat this process over and over again until you reach parity. And usually, in my experience is also talking to some of our partners in consulting who who work on migration projects. Roughly, we can say it takes 5 to 25 hours per model depending on the complexity of the code base and how it's orchestrated.
But it's a sizable amount of work, and if we multiply that by the complexity of modern data projects, which means probably thousands of these assets, we talk about, you know, 5 to 10 plus 1000 of hours of work to do the migration for a human. Now the agent closely replicates that workflow, but in a way more scalable way. Essentially, it takes the legacy code as an input as well as the data produced by that code in legacy system in a sampled manner, and then it translates the code, gets the new code, runs it on a new system, and then validates the output of that code against the legacy system. And it also takes into account any errors and discrepancies. And if there are any errors and discrepancies, those are fed back into the translation engine for correcting, the translation process. And it runs over and over again until it reaches parity or until we see that it doesn't converge on getting more correct. And, basically, as I mentioned, it closely replicates the process that a human developer would do. But because it's a software agent, it can be scaled and run-in parallel and run 247. And so the compression of timelines that we see for these projects is basically from something that would take 5, 10000 hours for a team of data engineers could be compressed essentially in single weeks that are aware of the process is completely run by software system with basically human in the loop, but not really doing any work for the translation itself. So it could be, you know, 10 plus times speed up of the process.
[00:27:25] Tobias Macey:
As far as the model specifics too, I'm wondering if you're able to get away with a smaller model because of the constrained space of the problem or if you're using one of the larger models that's out there and just some of the ways that you're thinking about the performance versus cost versus engineering effort? When we started on this project, we had 2 constraints. 1, we needed a model that was very good at generating code. We also needed a model that had a large enough context window because sometimes the code that we're operating on can be very long. I wrote
[00:27:58] Gleb Mezhanskiy:
1,000 line SQL queries, in my career, and they definitely still exist. And not only we need to provide the query itself, but we need to provide the context as well as the errors and discrepancies context that I mentioned. And the third requirement was privacy and security. We work with many customers that operate in sensitive verticals such as financial services, health care, and others where they pay extremely close attention to the privacy and security of their data, and it is not acceptable for them for their data or their code to be leveraged by someone else to train the models and risk that data to or code to leak somewhere.
And having those three constraints, we evaluate a number of models. And in our experience, the better performing models in terms of code generation and overall performance were GPT 4 and Cloud 3.5 Sonnets. And we chose Claude because of the privacy and security concerns. So that model was very good at, performing the translation and allowed us to satisfy our customers' very strict security and privacy concerns.
[00:29:10] Tobias Macey:
And so as far as the outcome of building this agent, you've got it running. You've provided it to your customers. I'm wondering if you can talk to some of the ways that you're thinking about the return on investment for building this new capability, the cost of running the system, the cost of marketing it, etcetera?
[00:29:30] Gleb Mezhanskiy:
Absolutely. So I would say we're I'd say we're still early in the process of seeing migrations being automated. But just in the projects that we've already completed, we're seeing tremendous acceleration. So as an example, we recently helped a data team perform a migration from Redshift to Snowflake. And on both systems, the code has to be packaged in dbt, which is also challenging because, essentially, we're taking SQL with the DSL on top and need to translate that into a different dialect of SQL with the DSL on top. And the scale of the project was about 600 models, and our mutual estimate with the customer was that it will take them at least 24 100 to 3000 hours to complete this project manually. So we're talking about at least 1 engineer here to perform this migration by hand. And they already considered leveraging some of the open source tooling for the translation, but they realized that it's not as easy because of the need to package the code into dbt as well as perform the reconciliation. So even if they applied some of the automation in the code conversion itself, it was still it would be still a very manual process. And the acceleration that we're able to achieve with the migration agent is essentially getting to a 90% parity in terms of translated code, validated code that produces the same output in both systems within a couple of weeks. And then the remaining 10% of queries that a lot of them were highly specific to data ops operations and need to be basically looked at by human to provide the correct input for the translation process, those took another couple of weeks to complete. So, essentially, we're talking about a project that would take a year or over a year and doing this in a month. So that is a huge acceleration. And that is just the time acceleration, but there's also a big cost component because having these projects done by humans is very expensive, both in terms of the direct cost of, you know, paying highly trained data engineers to do this work, but also opportunity cost. Because if we think about this, doing this project for a year means that that engineer or engineers do not work on other things that can be directly impactful to the business. We're essentially putting them on this technical debt project. And so the indirect cost of doing migration manually could be even multiple times higher than the direct cost of their hourly work. So I think that, overall, the migrations in the future will be fully automated, and I actually compare it to washing clothes. Right? So, yes, we can wash clothes by hand today, but it is hard, and there is just not no reason to do it where we can use a washing machine. And that is a huge unlock for both our quality of life, but the opportunity cost as well in terms of what we can do with our time. And I'm incredibly bullish on this approach to scale to all sorts of migrations.
And I would say probably in a few years, no one will be doing migrations by hand. On the
[00:32:22] Tobias Macey:
validation side, you mentioned that you've got your reconciliation capabilities. You have Datadiff, which is very well engineered for focusing on that piece of it specifically. And I'm curious how you have combined those two systems to be able to help build up confidence in the final product as well as maybe some of the types of errors that you're able to catch because you have that reconciliation loop as the guard against invalid outputs from the LLM.
[00:32:50] Gleb Mezhanskiy:
Yeah. Absolutely. So the data diff is really critical part of the system. Like I mentioned, large language models are good at generating code, but there's no guarantee that that code is correct in terms of there's no guarantee that it actually runs. Sometimes these models invent functions that don't exist, and there's no guarantee that produces the same output, which is ultimately what we're shooting for. And so having Datadiff, the ability to compare the outputs across engines and do this at scale and fast and feed that context back into the translator is absolutely critical for the agent to work correctly. But the actual translation, as I mentioned before, is not the end of the process because even after you have these, the code base that already is translated and validated based on the sample data, that code base still need to be deployed and run on scale. And that is typically something that our customers do themselves. Once we hand them over the translated code base that's already been pre validated based on sample data, They would actually deploy it, put it into a repo, run it on their, you know, schedule at scale. So now they have this large project of tables, views, whatever is the output running in production. And at that point, they can use the cross database diffing to compare the legacy system versus the new system at full scale, production to production for some period of time to get the full confidence and demonstrated confidence to their stakeholders that the systems are at full parity. And the reason why this step is important is because even when you converted your code base and now you have your new DAG of models running on the new system, you can't call it a day because likely you have a lot of the business dependencies, BI dashboards, reports, machine learning models, one off Excel exports that were running on the legacy system that now need to be switched over. So you need that grace period for the business to transition. And during that period, guaranteeing that you have full consistency is really critical to the process because you can't have can tell stakeholders like, hey. Can you switch this KPI report that was running on this old system to our new warehouse? And then have them find out or wonder whether the numbers have switched because of that or not. Right? So providing that consistent parity report, Datadiff, is really important for the final stage of business acceptance in your migration project. And as far as the
[00:35:10] Tobias Macey:
experience of building the system, working with your customers to put it into production and actually execute some of these migrations, I'm wondering what are some of the interesting or innovative or unexpected ways either that you've seen the system perform or that you've seen people apply these unexpected ways either that you've seen the system perform or that you've seen people apply these capabilities.
[00:35:27] Gleb Mezhanskiy:
I would say that in terms of building on the system, first of all, it's a different paradigm of building software because before large language models, most of the software we wrote was deterministic. We ran you know, we did something. We tested the data. Even we're on a data diff. It produced outputs. And with using large language models, the output of your software system is stochastic, and that changes the paradigm of you as a top the software developer in terms of how you handle inputs and outputs and observability and how do you handle testing and QA and releases and even write documentation for the end users. So it's definitely been a very interesting paradigm shift and learning experience for our team internally building that agent. And then in terms of the overall experience, I would say the process of doing a migration is relatively simple. It's hard and tedious, but it's simple. And so the algorithm that the agent uses is also simple, but the actual implementation is incredibly sophisticated because at scale, you deal with basically lots and lots of edge cases and the long tail of weird things or quirky things or system specific things that require special handling. And so the biggest challenge here is making sure that you build a system that is generally applicable to a wide range of inputs in terms of dialects and frameworks and patterns that the teams use. And you constantly have to balance the configurability with automation because none of these conversion projects can be run completely without the human in the loop. But at the same time, the value that you provide directly depends on the degree of automation that you can achieve. So we developed what I think is a very interesting and powerful configuration framework where the human on the loop of this migration process can provide certain inputs and instructions to the process that make the translation by the agent more successful. And so it's still automated, still runs very fast, but this configurability is essentially is essential to successful execution on these migration projects.
[00:37:29] Tobias Macey:
And in your experience of building this system, leveraging AI for these capabilities, coming to grips with which models to use, how they operate, what are their failure modes? What are some of the most interesting or challenging or unexpected lessons that you learned personally?
[00:37:46] Gleb Mezhanskiy:
I would say it's not surprising, but it's very easy to get to something that seems like it's working and extremely hard to get to a system that works well at scale. And it's especially hard for, like I mentioned, the stochastic sense systems that are based on language models where the in the output of the system is, you know, subject to random distribution. And that's been the most both interesting and challenging part of the project. I would say the other unexpected aspect of working on the migration agent was that we started this thinking that we will be only working with SQL. But we quickly realized, and mostly from just the feedback of our customers, that this approach scales well to all sorts of migrations, including migrations from and between BI systems. For example, we are working with a customer that has a very large Looker project. Looker is a BI tool that has, its own DSL called LookML.
And with any DSL in data, usually, you encode a lot of business logic and engine specific syntax within it. And so one of the biggest hurdles for their migration was actually porting over a 1000 Looker views from one dialect to another. And we also initially were quite challenged by this, but then we quickly realized that the same principle that the data migration agent of a rates on can be applied to these types of migrations as well because, ultimately, most of the data DSLs take dataset as an input, apply some code, and produce a dataset on the other hand even if it's a BI tool. And so that was an interesting realization. And then we had customers who were interested in migrating from GUI systems for defining transformation systems like, you know, Informatica, MicroStrategy, Talent, and others into software first frameworks like DBT, where a lot of the times there is no code exposed to the user at the source system or that code is completely unreadable by by human. You only operate with whatever blocks and squares and functions that you drag and drop in the UI. But it turns out that those types of migrations are also perfect candidates for the migration agent, and leveraging the large language model allows us to generalize very, very well for various types of inputs and, outputs. So it's definitely been a very exciting and intellectually stimulating projects project working on all sorts of different data systems. But I'm sure that as we tap more and more into various migrations, there'll be more insights that we'll unlock, especially as we go into some of the older mainframe systems where we'll we'll see some kind of archaeological code discoveries that,
[00:40:27] Tobias Macey:
will be quite novel. In terms of people who are starting down the path of some sort of migration project, what are the cases where you would say that they shouldn't use your migration agent and they should go down that manual route or some other automation path? Absolutely. I would say the data migration agent
[00:40:46] Gleb Mezhanskiy:
works really well for a lift and shift type of migration where you define success as having priority between the new system and the old system. And, obviously, there are occasional changes. Like, sometimes you need to change the schema. Sometimes you need to, like I mentioned, apply different performance optimizations, but you're not trying to change the interface to your data during the migration. And you can always do this after the fact, but for the migration itself, this is not the goal. If you are tackling rearchitecture as part of the migration project and you say, well, just we don't like the data model. We have to change it. We have to go from this Wild West to a dimensional model or data vault or, you know, we need to change all the definitions how our KPIs are defined. Then it becomes much harder to reach parity, and maybe parity is just not the objective you're optimizing for. Maybe you optimize for the new definitions that you create. And in this case, by definition, you'll be applying a lot of manual interventions and manual conversion. And DMA may they they DMA, meaning the data migration agent, may still be helpful for parts of this migration where you actually have the parity. Maybe your, you know, bronze and silver layer become stay the same, but your gold layer is when you apply those rearchitecture changes or vice versa. Maybe you wanna really change the way that your restructuring of the raw sources works. Maybe you change the sources, but you still wanna have the same interface for the gold layer. You can apply selectively the migration agent to deal with moving the assets where you actually have parity as the objective function.
[00:42:24] Tobias Macey:
And as you continue to build and invest in this agent, you start to expand the types of migrations and transformations that you're able to support. I'm wondering if you can talk to what you have planned for the near to medium term and any particular capabilities or projects you're excited to dig into there. Yeah. I would say the overall goal for us with the migration agent is to strive for
[00:42:48] Gleb Mezhanskiy:
compressing the timelines even further. Like I mentioned, compressing a year into weeks. So this is already great. But, obviously, now that we're doing many of these projects, we are very interested in running this even faster, and we'll probably be investing a lot into creating a even more powerful interface for the human in the loop to be directing this process, so that scales even even further. And I think the other frontier we we, know that we wanna tackle is how can we tackle the performance together with the migration itself. Because for a lot of the data teams, performance and cost efficiency is top of mind in the current environment. And a lot of times when you're moving from a legacy system that's expensive to a scalable data warehouse, they can incur unexpected cost just because, okay, you can you can run everything and you can collect all data you want, but it's actually turns out pretty expensive. And so very quickly, they realized the need to optimize the performance. And because we have a really great insight into the code and we also have this very important forcing function, which is the parity, we can work on optimizing the performance of the output for migration process. So not only give you the code that reaches parity, but also give you the code that runs as efficiently as possible on the new engine. And I think that that doesn't stop there because, again, with a system that really understands the code semantically and is able to guarantee quality, you can take a step further and maybe even simplify some of the code from the, you know, semantic encoded expression standpoint and maybe turn a transformation chain that took 10 steps, 10 tables, and, you know, 5,000 lines of code into something way simpler that is more manageable for data engineers to handle down the road while still maintaining parity of the output and consistency of the data interface for the data users. So I think that's kind of where this is moving to, and I'm very optimistic about the abilities of the modern technologies and DataFold in particular to really help data teams get the most out of not only the migration projects but their code bases. Are there any other aspects of the work that you're doing on the data migration agent or the overall problem space that we didn't discuss yet that you'd like to cover before we close out the show? Yeah. I think that I'm not sure exactly how the future looks, but I think that we will be seeing way more data liquidity for in the data team workflow in terms of teams moving workloads around, the engines, the lakes, the formats all the time, and the portability of workflows.
We tried to tackle that with developing obstructions. So there were multiple attempts to develop obstructions over SQL that could be compiled into different engines. They haven't really taken off as we thought they would. They're still very well engineered, but we just haven't seen that pattern to be adopted. But still, I think there's a big demand and big trend in terms of making the workloads portable. So I'm really excited about making it done here and helping data teams get the most out of the infrastructure
[00:45:54] Tobias Macey:
by reducing the vendor lock in, reducing the technical debt, and being able to leverage the best technologies they have, they can get for their work. Alright. Well, for anybody who wants to get in touch with you, follow along with the work that you're doing, or try out the agent, I'll have you add your preferred contact information to the show notes. And as the typical final question,
[00:46:14] Gleb Mezhanskiy:
I'm interested to see how your perspective on this has changed, but I'm wondering what you see as being the biggest gap on the tooling or technology that's available for data management today. I think there's a lot of exciting technologies out there. I think that data teams and data engineers are still operating on a fairly low level of abstraction. We still worry about a lot of the basic things, and I'm not sure exactly what the future looks like, but what I'm very excited about, all of the data professionals and especially data engineers moving to the higher levels of abstraction where, just like with software, most of the software engineers these days don't need to worry about, you know, memory allocation or other low level concepts that they had to do, you know, 10, 20 years ago unless they really choose to, right, for for their task. And I think that we'll see the same for data where we'll spend most of our time thinking about the business problem and semantics and human collaboration versus dealing with these low low level problems. And migration is one of those aspects that I think we can definitely strike people away from in terms of doing, like, low level work to doing more productive high level work. There's definitely more to do in the actual day to day workflow in terms of creating code and maintaining code and and writing it, and I'm already seeing a lot of these advances in IDs all all over the place. They're already very good at writing helping you write SQL. And I think probably in 5, even sooner, maybe 3 to 5 years, the workflow of data engineer will be very different from what we are used to now. Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing on the data migration agent. It's definitely very interesting product and interesting capability. It's great to see how you're helping to
[00:47:52] Tobias Macey:
let teams move faster and experiment with more of these different technology stacks. So appreciate the time and energy that you've put into that, and I hope you enjoy the rest of your day. Thank you so much, Tobias. Thank you for listening, and don't forget to check out our other shows. Podcast dotnet covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at data engineering podcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts, and tell your friends and colleagues.
Introduction to Data Migrations with Gleb Mojansky
Challenges and Lessons in Data Migrations
Motivations and Benefits of Data Migrations
Time Sinks in Data Migration Projects
DataFold's Migration Agent and AI Integration
Model Selection and System Design
Unexpected Insights and Applications
When Not to Use the Migration Agent
Future Plans and Enhancements for the Migration Agent
Closing Thoughts and Future of Data Engineering