Summary
All software systems are in a constant state of evolution. This makes it impossible to select a truly future-proof technology stack for your data platform, making an eventual migration inevitable. In this episode Gleb Mezhanskiy and Rob Goretsky share their experiences leading various data platform migrations, and the hard-won lessons that they learned so that you don't have to.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
- Modern data teams are using Hex to 10x their data impact. Hex combines a notebook style UI with an interactive report builder. This allows data teams to both dive deep to find insights and then share their work in an easy-to-read format to the whole org. In Hex you can use SQL, Python, R, and no-code visualization together to explore, transform, and model data. Hex also has AI built directly into the workflow to help you generate, edit, explain and document your code. The best data teams in the world such as the ones at Notion, AngelList, and Anthropic use Hex for ad hoc investigations, creating machine learning models, and building operational dashboards for the rest of their company. Hex makes it easy for data analysts and data scientists to collaborate together and produce work that has an impact. Make your data team unstoppable with Hex. Sign up today at dataengineeringpodcast.com/hex to get a 30-day free trial for your team!
- Your host is Tobias Macey and today I'm interviewing Gleb Mezhanskiy and Rob Goretsky about when and how to think about migrating your data stack
Interview
- Introduction
- How did you get involved in the area of data management?
- A migration can be anything from a minor task to a major undertaking. Can you start by describing what constitutes a migration for the purposes of this conversation?
- Is it possible to completely avoid having to invest in a migration?
- What are the signals that point to the need for a migration?
- What are some of the sources of cost that need to be accounted for when considering a migration? (both in terms of doing one, and the costs of not doing one)
- What are some signals that a migration is not the right solution for a perceived problem?
- Once the decision has been made that a migration is necessary, what are the questions that the team should be asking to determine the technologies to move to and the sequencing of execution?
- What are the preceding tasks that should be completed before starting the migration to ensure there is no breakage downstream of the changing component(s)?
- What are some of the ways that a migration effort might fail?
- What are the major pitfalls that teams need to be aware of as they work through a data platform migration?
- What are the opportunities for automation during the migration process?
- What are the most interesting, innovative, or unexpected ways that you have seen teams approach a platform migration?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on data platform migrations?
- What are some ways that the technologies and patterns that we use can be evolved to reduce the cost/impact/need for migraitons?
Contact Info
- Gleb
- Rob
- RobGoretsky on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Datafold
- Informatica
- Airflow
- Snowflake
- Redshift
- Eventbrite
- Teradata
- BigQuery
- Trino
- EMR == Elastic Map-Reduce
- Shadow IT
- Mode Analytics
- Looker
- Sunk Cost Fallacy
- data-diff
- SQLGlot
- [Dagster](dhttps://dagster.io/)
- dbt
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Hex: ![Hex Tech Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zBEUGheK.png) Hex is a collaborative workspace for data science and analytics. A single place for teams to explore, transform, and visualize data into beautiful interactive reports. Use SQL, Python, R, no-code and AI to find and share insights across your organization. Empower everyone in an organization to make an impact with data. Sign up today at [dataengineeringpodcast.com/hex](https://www.dataengineeringpodcast.com/hex} and get 30 days free!
- Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png) Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)
Hello, and welcome to the Data Engineering podcast, the show about modern data management. Introducing RudderStack profiles. RudderStack profiles takes the SAS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable enriched data to every downstream team. You specify the customer traits, then profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack. Modern data teams are using hex to 10 x their data impact.
Hex combines a notebook style UI with an interactive report builder. This allows data teams to both dive deep to find insights and then share their work in an easy to read format to the whole org. In hex, you can use SQL, Python, R, and no code visualization together to explore, transform, and model data. HEX also has AI built directly into the workflow to help you generate, edit, explain, and document your code. The best data teams in the world, such as the ones at Notion, AngelList, and Anthropic use HEX for ad hoc investigations, creating machine learning models, and building operational dashboards for the rest of their company.
Hex makes it easy for data analysts and data scientists to collaborate together and produce work that has an impact. Make your data team unstoppable with HEX. Sign up today at dataengineeringpodcast.com/hex to get a 30 day free trial for your team. Your host is Tobias Macy. And today, I'm interviewing Gleb Mojansky and Rob Guretsky about when and how to think about migrating your data stack. So, Gleb, you've been on a few times, but for folks who haven't listened to any of your past appearances, if you'd like to give a brief introduction.
[00:01:54] Unknown:
Hi, Tobias. Great to be back. My name is Gleb. I am CEO and cofounder of DataFold. We automate data testing. Before starting DataFold, I was a data engineer pretty much my entire career, companies including Autodesk and, Lyft. And, topic of migrations is near and dear to my heart as I spent 2 years helping Lyft migrate from 1 data warehouse to a more scalable data lake and definitely lots of lessons hard lessons learned in the process.
[00:02:28] Unknown:
And, Rob, how about yourself?
[00:02:30] Unknown:
I've been working in the field of data for over 15 years. I got my start working in big finance, for Lehman Brothers, actually, actually, when I began my career. Began working as a data integration developer, an ETL developer, a data architect. Lots of different titles over the years, but I have found my footing and grew in a path working for Major League Baseball Advanced Media for over 15 years where I got to grow out a data team there and go from being the first hands on ETL developer using graphical tools like Informatica to moving all the way forward to being able to realistically call myself a data engineer working with, cloud based data warehouses, orchestration platforms, and the like. So migrations have become a fact of life. It's not something that anybody particularly looks forward to, but if you are at the same company long enough or in this field long enough, you're going to encounter 1 sooner or later.
It's a necessary evil and it's certainly not the thing that anybody is specifically excited to do but it is a means to an end that I have found the end is often where there's the real excitement, when you've moved forward by orders of magnitude. So, thanks for having me on. Excited to chat about it further today. Absolutely.
[00:03:47] Unknown:
And, Gleb, again, for folks who haven't listened to your past appearances, if you wanna give a brief recap on how you first got started working in data.
[00:03:54] Unknown:
Yeah. My first start was back in 2014 when I joined, Autodesk's consumer group. At the time, it was division of Autodesk that was tasked at, creating a portfolio of apps, targeted at consumers for creative work, such as modeling 3 d objects for 3 d printing and, other exciting things for creativity. And I was pretty much tasked with creating a data platform from scratch and, for these portfolio of apps. And it was a really exciting time because I believe that was a year when Airflow was just released out of Airbnb as an open source project. And Snowflake, I believe, came out of stealth and we talked to them very early. And Redshift was the hottest, technology for warehousing.
So a lot of really a lot of things that are currently considered extremely foundational were cutting edge at the time. And starting data platform from scratch, I was able to explore and try out all these different technologies. That was very exciting. And then I moved to Lyft where, we've really tested a lot of these technologies, but at a very large scale and quickly learned that we have to find more scalable solutions, which is, pretty much how I got into doing data migration.
[00:05:18] Unknown:
And, Rob, do you remember how you first got started working in data? I think like many things in life, it was probably a bit happenstance as it was sort of where I began my career, right after college, working, on a project that was involving some back end, purchasing systems, but ultimately then required data integration to be done and it was an area that I found myself really enjoying. And so when the opportunity presented itself to start a team at Major League Baseball, I jumped on that and after 15 years there, really building up the muscle for it, I recently got an opportunity to jump on as a principal engineer at Eventbrite And sort of the the thesis that I had was that all of the hard lessons learned over those years doing several migrations, seeing several evolutions of the stack would allow us to move even more quickly. And jumping into Eventbrite, that's certainly been the case. Been able to really accelerate the timeline around a migration and really see the effects of what moving to a modern data stack can offer an organization in what feels like record time. So I think the journey has been consistently about kind of knowing the change is constant and then being ready to adapt to it as needed.
[00:06:27] Unknown:
And so bringing us now to the subject of data migrations, migration can mean anything from I need to rename this 1 column to be something a little bit more, understandable to I need to move petabytes and petabytes of data from 1 architecture to a completely different paradigm to I need to move all of my batch data into streaming workflows. And I'm wondering if you can just start by describing, for the purposes of this conversation, what constitutes a migration.
[00:06:57] Unknown:
Yeah. I think for, like, Tobias said, there are multiple flavors and facets of data migration, and I think data teams are constantly moving something or evolving their systems. I think that's an interesting 1 to explore today, and I know that that's kind of the scope where both Rob and I have spent a lot of time thinking about is when you migrate your data preparation workloads and your analytical consumption from 1 data warehousing system to another, or when you are adopting a new framework for data preparation. For example, you would be migrating from a system like Airflow to or Informatica for to modern frameworks like DBT, or sometimes you would do both at the same time to modernize your data platform.
I think if we start there, there's definitely plenty of challenges and know how's discussed just within that particular scope. I don't know, Rob. Would would you agree?
[00:08:02] Unknown:
Yeah. I think on sort of defining what is a migration versus what is not, I think you have to consider that you'll be replacing 1 system with another. I think that ultimately when you roll this up to the c suite, they're gonna be looking at this in terms of cost. Right? In terms of are we are we able to replace a system that is costing us a certain amount in either engineering effort to keep the thing alive, which is often the case, or in actual spend with a vendor and what are we replacing that with. And I feel like a migration is certainly not done until you've cut the cord on whatever the legacy system is. So I think it's it's it's defined sort of by the sense that it is not an incremental change so much as it is a a wholesale replacement of 1 system with another, usually with 1, you know, 1 data platform for another in the case of data warehouse migrations.
[00:08:54] Unknown:
And for people who are considering embarking on 1 of these migration projects, what are some of the signals that would suggest that that is necessary, and what are some of the cases where it is actually possible to avoid having to do do that migration and you actually just need to invest a little bit more engineering effort into making your current platform maintainable and sustainable.
[00:09:16] Unknown:
Yeah. I mean, I guess I can jump in here first and say that as I said before, nobody looks forward to doing a migration. Right? I I don't know of any engineers, myself very much included, who say, oh, what I really want to do right now for the next 6, 9, 12, 18 months is a migration project. It is something that is hard to get by in for from executives because you're essentially going to be considerably slowing down the velocity of your team during this time frame. So what signals point to it really are about, is there an order of magnitude improvement available since the system that you are currently working on was, created? That could be because there's a new emerging technology. For me, it was that we went from having, at 1 point, an on premise Teradata server that sat in a big server room to moving to cloud based data warehouses like BigQuery or Snowflake where it's a complete paradigm shift in even how you run the platform. It could be that there's major production outages that are becoming so frequent that your team is constantly fighting fires with a system instead of being able to be productive to it. Could be the emergence of something brand new that we haven't even thought about yet. And I think that's what's so interesting is that I've seen this pattern, over my career. Basically, every 5 to 7 years, there seems to be this evolution.
And I feel like right now we're in this sweet spot where I don't think any of us can necessarily see past what the world of Snowflake and BigQuery and what's next looks like, but I'm pretty confident there'll be something. I just don't know what it is just yet. Yeah. I think if we also try to kind of come up with particular
[00:10:46] Unknown:
reasons or, like, dimensions of the decision making, what I've seen myself, my career, and also working with other teams, it typically comes down to probably 3 factors. 1 is scalability. So can your current system support the scale of your organization in terms of the size of data, obviously, the complexity of data, and also the performance around consumption of data. So can, for example, the users query the datasets and your you know, something as simple as dashboards get refreshed in a meaningful amount of time. And for us at Lyft, when we embarked on migration from Redshift to the Hive based, data lake, scalability was by far the strongest reason. Back in the day, Redshift was a system that had strong coupling with between storage and compute, and that meant that whenever we had to increase the capacity of cluster to just have more data, we'd have to resize the entire cluster. That was a very scary operation that the data ops team had to perform over the weekend, and that took multiple hours to, like, reshuffle the data between the nodes.
And we got to a point where we're operating 1 of the largest retro clusters in the world. And the other facet of scalability where we faced really, really strong pain was around consumption. So, again, back in the day, Redshift had a very concrete limit of how many queries could be executed concurrently. And at the time, Lyft was growing rapidly, and we had a massive adoption of data analysis across the company, primarily through tools like Looker and Mode at the time. And what happened was we'd have extremely high surges of queries starting, like, 9 AM when people came into the office, and we're refreshing the dashboards on reports. And that completely brought the system underwater to the point where data team would have days where everyone was like, let's just go get Boba because Russia will not be able to run any queries for another few hours. So that that's the scalability part. And then the other, I think, important would be interoperability and then general capabilities of systems. I think, Rob mentioned that sometimes you would see a completely new sets of use cases and capabilities that new system would bring as well as the ability to just have multiple systems interact with your data. For example, with Redshift, it was hard for us to use other systems to query that data because the system was, like, very closed With more data lake architecture, so even modern warehouses like Snowflake, that's much, much better. And then final 1 is cost, but that I think is a very tricky 1 because in my experience, it's really easy to get misled by how much your system costs, and that could be a really, misleading factor if that is your primary driver for migrations.
[00:13:39] Unknown:
Yeah. I wanna plus 1 a little bit on that on that feature set, you know, rationale there. I think that's something that my team certainly saw as we moved, at Major League Baseball from Teradata to BigQuery was, you know, we we did it primarily because of, performance. I mean, as you mentioned, scalability again too. Just the ability to not necessarily have to worry about 1 stakeholder running a bad query that brings down the entire system as in the case where a lot of systems that do couple the storage and compute have that problem today. So that was our initial onus for looking at a migration, but then the feature sets is really what ended up making it be as successful as it was. Meaning that BigQuery offered us certain certain things.
Most notably, button press data sharing that at MLB we used this to set up a system that we called wheelhouse where we basically kept our our, data pipeline centralized. So we would ingest data from ticketing vendors like Ticketmaster and Tickets dotcom, bring it all together in 1 place, and then we wanted to share that back out with the 30 MLB Clubs individually. At 1 point, that was a system backed by good old FTP. And as we move forward we started moving that towards s 3 storage but being able to move to BigQuery meant that we just clicked share on the dataset and they had access to it. And so this was like a paradigm shift in how we would work. There was never gonna be a question anymore of why is my data out of sync with your data which was such a common problem with the previous system. So, it really was, yes, performance cost scalability. But also, like, what are the features that this system is capable of that the other 1, for whatever reason, just is not. And I think that was data sharing was a big 1 for us on BigQuery.
You know, certain cloud data warehouses now have the ability to do time travel, which again is something that some others may not have. And that's a huge difference when you're able to not think about backups in terms of something that you have to go and ask a database administrator to pull for you, but you can just write a SQL query for. So those features completely change the way you work with the system.
[00:15:36] Unknown:
And going back to 1 of the things that you said, Rob, about you don't want to bother with doing a migration unless you're going to get an order of magnitude improvement. You listed some of the types of features that will contribute to that measurement, but often, it's not possible to fully comprehend the impact of the migration until after it's complete. And I'm wondering what are some of the ways that you can do some early signals gathering to understand, 1, whether your intended target is actually going to have the impact that you want it to, to what the actual overall cost difference is going to be. That's often the hardest 1 to figure out. And, just some of that early work that can be done to prove out that, yes, this is a good idea and get some buy in from the people who need to actually do the approvals.
[00:16:20] Unknown:
Yeah. I think it's about finding the right proof of concept use cases early. And I mean, the challenge there is identifying a properly representative sample of workloads. At Eventbrite here when I joined, the team was already underway with their evaluation of Snowflake and had already identified some key parts of our, current data pipelines that were lagging, taking a certain amount of time. And we're able to essentially port subsets of those over. Right? Quickly with these cloud data warehouses now at least on this generation, what's nice is you can kind of commit to very little. In fact, you know, when you first start out with BigQuery and Snowflake, you don't need to have a signed contract and a commitment for 5 years. You can just go ahead and do it on demand. And so I think that allows you to really experiment very cheaply at the beginning. Right? You're just using the compute resources that you bring for trying out a workload, and you can hold those workloads up against their current workload counterparts. And you can really bring almost the full size of a subset of all of your datasets over. You don't wanna, of course, bring over all your datasets. You're not actually doing the migration yet. But by finding that representative sample and bringing over the full size of that data, you can very quickly observe, hey. What is the cost of this? What is the speed of this? But, you know, as you hinted at though too, like, actually anticipating what the cost will be is very challenging early on, and you have to kinda keep adjusting that as you continue to port workloads over to see if you're within the expected range.
[00:17:43] Unknown:
Going back to to the question of cost, you mentioned the financial outlay of running this new system, figuring out what are the cost differences of actually operating it, but there's also the harder to account for aspects of time and opportunity cost. And I'm curious how you have approached that process of doing the accounting to figure out how much is it actually costing me right now, how much do I think it's going to cost me once I'm done with this, and how do we manage that differential as we move through the progress of the migration?
[00:18:13] Unknown:
I think what is interesting is that the cost accounting kinda changes, before you embark on the migration, so you're thinking is it worth doing and then as you go? Probably when you arrive at at the realization that you need to migrate, it probably because the current system inflicts a lot of pain on your day to day work. And, for example, in the case of Lyft, our losses for not doing migrations were pretty clearly expressed in the productivity of the entire analytics team as well as in the ability of the team to deliver insights and have the stakeholders query those insights in time.
So 1 example is that, you know, if there are meetings scheduled to review dashboards and metric at 9 AM and you can deliver those because your system is just not able to perform all the necessary computations, then you have, you know, a huge business cost that essentially means company cannot make decisions in time for a business like Lyft That was, extremely high. When we were embarking on migration, it was clear that there is an enormous opportunity cost of having the actual analytics and data engineering team internally perform the migration because then they would not be building data products that would actually propel the business forward in terms of creating new insights and new models. And so I think in general, 1 of the most effective levers here data teams can pull is to try to outsource that work as much as they can. Because otherwise, you may lock down your entire, you know, analytical team in a project that doesn't necessarily have a u immediate user facing impacts that could be challenging both, you know, business wise, but also politically for the team. You don't wanna be in a world where for 6 months, you're just working on this infrastructure project.
And I think the third 1 is, like, once you are on the new system, how do you make sure that or as you anticipated being on the new system, how do you make sure that this economically ultimately is feasible? And I think that is something that is easier to quantify with more modern warehouses, if we're talking about the warehouse migrations like Snowflake or BigQuery because their pricing is transparent and ultimately, it's fairly easy like Rob said to quantify what would it cost to move workloads between those systems. Because you know how many queries you run, you know even how much data those queries are scanning.
When these things get really opaque is when, your the system you're trying to migrate is sort of like a data lake, which is supported by some open source systems like TriNet Spark that you try to run internally. Those can be extremely complex and high cost projects, not only because you can't really estimate the cost of a query because you're running infrastructure yourself, but also you now bake into that the cost of ownership in terms of, like, data ops teams and dev ops teams that now have to maintain those systems. That's definitely 1 of the, I think, mistakes we've got we've done at Lyft is we haven't really counted the true cost of moving to such an architecture.
And, what would be the alternative, for us in terms of, for example, embarking on a vendor that would provide a more, final solution, like a query, for example, like, small flake or other warehouses. So that's probably where the cost, accounting policy can be the strongest.
[00:21:42] Unknown:
Yeah. Doubling down on on that a bit, I think that that personnel cost of of keeping the the legacy platform up and running is 1 that could be very easily missed in in the overall calculations, but it is 1 that was certainly felt at at my current role at Eventbrite when I joined given that the team was self managing, a Hadoop based stack, based on Spark, Hive, Presto. Running it on AWS, so it's not like we were trying to run it on our bare metal, but and using tools like EMR, which, you know, do simplify some of the management of Hadoop. But there's still a lot there. And over time, it accumulates more and more technical debt as well in terms of, you know, sort of being pinned onto an older version of of various tool sets because of certain dependencies you may have. And so, ultimately, for for us, what we saw was that the team was just constantly burdened with on call issues. Right? On call issues that would arise day and night due to the system not performing the things that we just simply expected it to do every day or every hour, on a regular basis. So I think that cost was probably the most felt.
And so I think everyone at that point was ready to welcome a managed service taking that burden off of our hands. And I think that's 1 of those that it may be a little trickier to quantify financially because it's not necessarily like system y that you move to is costing that much less than system x. But when you sort of try to bake in well, it's basically taking us 2 or 3 full time engineers time to just keep this thing's lights on. Let alone try to even improve it, which would be another set of engineers, you know, can we get that for free by just migrating? And that's where, again, that math starts to come into play.
[00:23:18] Unknown:
And another aspect of this question of migrations, when to do them, how to do them, what to migrate to is what are some of the early decisions in terms of the architectural design, the overall workflows that are built on top of these, the dependency chains, and and the hard requirements that you build into the overall architecture that contribute to the eventual need for that migration, and what are some of the lessons that you've learned in the process of going through these migration paths that have helped you understand as you reach that eventual destination, removing some of those potential pitfalls and architecting in some, escape hatches to be able to get some further life out of those, system, system architectures that you design that were, you know, previously major constraints on the sustainability of the previous platforms?
[00:24:14] Unknown:
I think while there are multiple dimensions to plan here, personally, 1 of the 1 of those that probably had the biggest, unfortunately, negative impact on my success is running migration is the the decision of whether you are doing lift and shift migration or you're trying to improve things as you go. And, just a double click into that, lift and shift essentially means that you have, let's say, 100 workloads, like, 100, you know, transformation jobs running on your legacy system, and they can be very ugly. That could be some, like, really big performance bottlenecks. The data model is probably makes you wanna cry because it was built over the last 10, 15 years. But you say, you know what? I don't care. I'm just going to replicate those exact transformations on my new system.
I'm just going to literally copy paste SQL and adjust it for or, you know, Spark or whatever. Like, adjust my code, make sure it runs on the new system, address any bugs. But, ultimately, I'm just aiming for the same party, and I will, take care of refactoring later. That's lift and shift pattern. The second pattern is you look at the system and you realize it is extremely inefficient. It is very badly architected because, again, it's just accumulated all your legacy work from the early days of your business. And you say, well, this is a great opportunity for us to improve things. And I think for data practitioners, it is very tempting to embark on the latter route, and this is what I have attempted to do at Lyft when I was tasked with architecting the migration.
And that is a complete, in my opinion, road to disaster because even if you're doing lift and shift, it is still a very expensive and complex project. If you're trying to rearchitect things in the in the process, you are now essentially bottlenecking all these efforts on the ability to kind of reengineer, refactor things. But most importantly, the larger companies generate consensus about the, you know, new data models or spending time optimizing things, and that just exponentially, you know, increases the complexity of the project. So having burned by that, I strongly recommend everyone to do lift and shift, and I've seen, other data teams. And if we support a data full, do lift and shift, and being extremely successful with getting through migration way, way faster than I was able to at Lyft.
And, I think the other 1 that I would also think is important is whether to outsource your work or not, which we which we touched on. And I think if you're doing lift and shift, that actually makes the migration workload very outsourceable because you're just saying, look, I have this system on the left. I want the same thing on the system on the right, and you can outsource that people who don't have context of your on your business. Again, my mistake was that we brought in consultants, and I try to employ them to make our data model better because, obviously, they're experts and they have seen a lot. And that was also a disaster just because having someone externally come in and work with your stakeholders to create better data models is just not going to work.
[00:27:27] Unknown:
Couple of different threads that I wanna go down. And so 1 of them is in terms of that dependency chain of, oh, we can't retire this old system because x, y, and z relies on it. 1 of the paths that lead to those, you know, very obtuse dependency chains is giving data consumers too much access to the internals of the system to do whatever it is that they want to do or the, making the initial system too clunky so people start embarking on these shadow IT projects where they'll just bring in their own tool, and they have this, you know, deep dependency on something that's in your stack you don't even know that they're touching. And so when you turn it off, everything everybody starts yelling because it's broken. So 1 of the questions is, how do you identify those situations, 1, but also how do you prevent them from occurring so that you don't have these deep dependencies and you can reduce the overall surface area of what you have to maintain and refactor during the migration?
And then maybe as a second step a second question we can move on to from there is we've been talking very broadly about this concept of migrations. You have to make sure that everything works effectively. But I'm curious what are the architectural components that you have typically seen as the major targets for these migrations, whether that's the ingest and load, the transformation layer, the warehousing layer, business intelligence. I think that that makes sense as a second question to move on to after we've addressed. How do you re how do you identify and possibly prevent these deep dependencies that have grown organically because of the fact that the system was so clunky to begin with?
[00:29:01] Unknown:
Yeah. I think 1 of the first things you need to do when embarking on a migration project is spend some time really examining your system. And it's something that will unearth a lot of surprises, including Tobias, as you refer to that, that shadow IT that's occurred over time. You know, tapping back into what Gleb mentioned earlier by doing lift and shift, you're at least not having to contend with the fact that you're going to have to communicate changes to downstream stakeholders at the time. In fact, the pattern that I've seen, and I'm 100% with you there on lift and shift. It's how I've successfully gotten through, 3 of these in my career already is by by sticking to that. In fact, my my current group at Eventbrite very much uses it as a mantra that is regularly chanted whenever anybody suggests even changing a column name. Lift and shift. What are you doing? So I think sticking with that allows you that even if you do have a nested web of dependencies to minimize how much you need to really navigate that because you're going to be basically exposing a system that essentially from the outside from the interface of SQL, at least, assuming that we're sticking with a SQL based system, it looks and feels about the same to the end user. It's not to say that they won't have work to do, but the work can be minimized by the fact that the table names look and smell and feel the same. It's it's why 1 of the first steps that I've usually taken on any migration I've done is to set up basically a replication of the legacy system into the new 1. And when we do that, again, sticking with the lift and shift pattern, 1 of the really big benefits there, you know, we had talked about opportunity cost earlier on, is that as soon as you have a complete replication of the core sets of tables that are in your legacy system, replicating into the new system, And that can be at a daily cadence or an hourly cadence, whatever makes sense, but you set that up first.
You can switch all your consumers over to the new system. And this is what we've done at Eventbrite here where we basically set up Snowflake last year and set up replications from our legacy, Hadoop based platform to it. And then we basically gave all of our consumers a 3 month window to say, listen, you've got 2 systems that look and feel the same right now, But in 3 months, we're gonna cut you off from the old 1. So you've got some time, raise your concerns, raise your complaints now, or forever hold your peace. And that is when the, shadow IT emerged. And that is when we found out about plenty of teams that had their own 1 off Python processes running on a Windows scheduler somewhere that actually were responsible for huge pieces of our business and we just didn't know about them. And so it sort of forced those things to come to the forefront because we at least then needed to talk to them about how they were gonna switch their connection over to use Snowflake now instead. So it sort of, you know, what what was shouting from the rooftops to our organization that like, hey, this change is coming. It should be minimally impactful because things look and feel the same except things will be faster on the new system.
But we need you to self identify now because we're gonna turn the lights off for you as a consumer on the old system early. Now at that point, we're not done with the migration. We haven't even talked about moving the transformation or ingestion layers over to the new system yet. But by doing things in that order, you make sure that you identify very early on in the process who are some of those dependency chains further on down your path. And then you can kind of focus on the ingestion and transformation rewrites, in the background while all your data consumers have already moved over to the new system. And in fact, the biggest challenge we have now is that everybody thinks we're done with the migration already because they're already using the new system, and they're happy on it. And we're like, actually, the data is still going through the old system and then copying to the new system.
But from your end user perspective, you get the fast, you know, improvements and new features of a new system right away. So it becomes more of a carrot than a stick of, like, don't you wanna move on to the new thing? It's faster. It's better. And then eventually the stick comes of, like, alright, we're kicking you out of the old system now.
[00:32:52] Unknown:
Another element of these platforms, 1 that I'm currently dealing with is the question of access control and governance, and those can be implemented very differently depending on what the underlying technologies are, the specific access patterns where, if you're piping everything through the warehouse, you can say, okay. I'm going to implement my control here, or is all of the business intelligence layer, or do I have to do this at some more granular level? Am I relying on some sort of single sign on, whether that's via OAuth or Kerberos or SAML? I'm curious what you have seen as the role that governance and, access control have played in the migration projects you've been involved
[00:33:35] Unknown:
with? Well, you know, I can kinda jump in first and just say lift and shift. Right? Like, I mean, whatever it was before is what we're doing on this 1. And I think it that that mantra helps you get past some of this inertia and say, yes. Like, we know that at Eventbrite moving to Snowflake gives us new opportunities for things like role level access control, PII redaction, things that we couldn't do on our previous system. We're excited about those features, but we want to maintain that parity right now. So that sort of becomes not part of the migration project then. Right? It becomes whatever the access patterns were before. This set of users had access to this set of tables, let's keep that for now because we wanna make sure that the migration worked successfully for those users. And then we can kind of embark on follow ons to further whittle that down. I mean, the, authentication piece as far as how they get into the system might change. Right? But the authorization piece should not. Yeah. I think the the question of governance
[00:34:29] Unknown:
actually is, in my experience, less important in terms of can we implement the governance on the new system, but rather how does the data governance affect the complications of migration? For example, typically, the vast majority of workloads on the system are not related to transformations, which are, let's say, more or less control of the data engineering team. But in the consumption layer. Right? So notebooks, 1 off, create exports, BI, and dashboarding. And the patterns that are used by those workloads and the capabilities and the choice of those systems actually dictate how costly would it be to move those workloads over. For example, at Lyft, we had 2 systems at once.
We had Mode, which is a very notebook driven workflow, and we had Looker for the primary dashboarding, kind of standardized reporting workloads. And moving Looker workloads over was extremely simple because it's mostly templated and it controlled by the internal semantic layer. So, yes, you had to make some adjustments, but those were minor and those were easy to do for the majority of the workloads and the majority of the consumption, you know, dashboards and other nodes. With Vowed, I remember we had a question of, well, analysts love it, so I can't really figure out, like, how many reports we need to migrate. Is it, like, dozens of reports or hundreds of reports?
And I, did, like, a kind of a hockey lineage project where I did analysis of the API and try to figure out just the scale of dependencies. And I uncovered 13, 000 notebooks that were created and actually actively running on the system. And those notebooks had pretty much, like, no governance in the sense that if you wanted to switch them over to a new system, you had to pretty much go into every notebook and every cell that notebook and rewrite the SQL to query, you know, change the dialect from, like, Redshift to Snowflake. So that was not at all possible to do in any kind of centralized fashion, and so we have to pretty much resort to what Rob says. Right? Tell the users to migrate those over, and that was obviously a very high friction exercise for the stakeholders and 1 that actually really slowed down the consumption adoption of the new system.
[00:36:59] Unknown:
Another aspect of migrations is deciding when they are done to your point, Rob. Some people might say, oh, it's done because I'm already using the new thing. But you say, oh, I've actually still got months worth worth of work to do, so I can't actually, fulfill the request that you just sent me that you think is only going to take me a few minutes. And I'm wondering what your what your strategy has been at the outset of the migration to be able to define these are all the things that need that need to be done. These are the ways that we're going to validate completion, and this is what we're going to use to measure overall success. And then also in that path, what are some of the early maybe break glass situations that you have identified to say, actually, this is not going to work out the way I think it's going to be. I'm going to cut bait. I'm not going to fall victim to the sunk cost fallacy and just keep pulling forward even though it's not gonna do work and maybe, you know, step back and reconsider the overall migration effort.
[00:37:54] Unknown:
Yeah. I think in terms of, like, knowing when you're done, the definition of done, it really is that the legacy system is gone. Right? Gone from your budget, gone from your management, gone from being a thought in your mind. Right? And I think, like, anything short of that means that you're not done. However, again, as I described earlier, there are different perspectives that different stakeholders can have on that that help relieve that pressure. I mean, again, by having the replication in place early, there's not this built up expectation of oh, that team is working on a migration.
It's taking forever and I can't see any results. Right? You want to minimize the time that it is something that is invisible to others and you want to start building that internal advocacy for it early or what you described. I haven't encountered this on my own, but but but get that internal feedback that it's not working well for a use case early on. Right? And make decisions about that. So I think it really is about by first doing this really kind of simple lift and shift replication, you can let all of your stakeholders and all of their workloads onto the new system and derisk that a ton because you then find out right away about any challenges, any problems they may be seeing. And you also then relieve a ton of pressure of, alright. I've invested in this new thing. I'm now paying a bill to Snowflake every month. How come I can't see anything from it yet? Right? How come how come nobody's using it but the data team right now? What is that? Right? So, like, you try to minimize that time that you're doing that replication stuff, which I think really helps and also gives you opportunities to, like, throw different types of workload at it and and navigate them. I think I remember in MLB for example we still had some of our data science team using a set of tools from SAS. I mean this was kind of stuff they had used prior to really Python and R becoming as as widespread as they were. And as we moved to BigQuery we kind of learned oh wait a minute the connector from SaaS to BigQuery is not really what we want it to be. And we identified a pretty big risk there of, like, alright. Well, all of these workflows really are not working the way we intend them to. Can we find a fix for that? Right? Can can we get past that? You know, is there an additional cost to license something that will get us past that? So, again, I think it's about just getting your data stakeholders in there as early as possible to at least derisk this.
[00:40:05] Unknown:
Yeah. I fully agree with Rob in terms of setting the, let's turn off the sis like, the old system as the actual goal we're aiming for. Because as long as the old system is running, it probably still incurs high cost in the business, and it's hard to say that, yeah, we've come you know, we've completed the project. I also think that because that kind of 0 to 1 can take months and sometimes years, it's helpful to also kind of set the goals in terms of how do we measure completion or even our progress within the migration. And if you wanna do that, probably the atomic element of migration is, you know, let's say we're migrating between warehouses or migrating between transformation frameworks.
It's probably migrating the tables that are produced in the transformation process, right, as part of our kind of the transformation diagram airflow or DBT. So a single job that produces a single table kind of becomes this atomic unit of work that we can count as, you know, either done in progress or yet to do. And then some of the AI workflows as well. Right? Like, have you migrated the exact KPI dashboard or not? And what I found that if you are doing a migration, all these atomic blocks within a larger organization, the process that becomes really critical is user acceptance testing, which is kind of known in software. But in data, what that means is if you migrating an a workload, let's say, you probably have, like, a set of tables that are really, really important. Right? And you try to get business, use them. First, let's say, like Rob said, if we just replicated the old, well, the legacy produced datasets to the new system, when I try to have people switch them switch their consumption to those tables, we have to convince them that assuming the lift and shift pattern, that that's the same data. Right? They can trust it. There are no regressions. There is no a kind of, like, we optimize the definition of conversion in the in the process. Right? It's still it's still the same data. And that kind of user acceptance, creating a process for that and making sure that stakeholders explicitly get you a thumbs up, or you have a certain way to signal to your stakeholders, yes, this is the same data you can use it, is quite important. Back at Lyft, we were writing a lot of custom build software for that because we just had to like, no 1 would trust that we produced good data. We actually had to present this as, like, almost a dashboard that said, yes. This table is exactly the same as your same table but on the old system. So now can you please switch your report? Can you please switch your machine learning job to it? And that was actually 1 of the inspirations for me to build a tool called data diff at DataFold, which essentially produces the same analysis, but fully automatically for the user. And you can do this within database or across databases to actually answer the question, is your data the same for the purposes of the user acceptance testing? Also, I'm curious to hear how, like, you think about the UAT, Rob, given that you have heard a large company and, like, a lots of a lot of stakeholders, do you have to convince that you've actually completed the work?
[00:43:15] Unknown:
Yeah. I mean, at at our current at my current role at Eventbrite, you know, data diff, both the open source and then, you know, the commercial product of DataFold have been instrumental in something that, like Gleb mentioned, you otherwise end up engineering yourself. At Major League Baseball, we did exactly that. We didn't have data folder, data diffs, so we built tooling, you know, because we're engineers, so we always wanna build a tool to try to get through anything that we might have to do manually. We built tools that generated SQL that attempted to compare the the way the tables look. You know, really that we built a very primitive version of what data diff does. And we did this because we needed to be able to be confident in what we had been replicating. So we use in my current migration project, we've used the data diff and data fold tools in 2 different places. 1 is when we're first doing those initial replications, as Gleb described, to to build confidence that the old and the new systems are matching, that your replication processes themselves are valid. Eventually you gain confidence in that because your replications start to look a lot the same. If you're replicating a 1, 000 tables, once you've got something set up for it, you know that your process is working well. But it does unearth lots of problems along the way that you would have. Time zones are an old favorite. They always seem to give you an issue. Things like that, you know, character encodings, you might have missed otherwise.
Once you're in the midst of the migration and you're actually rewriting your transformation logic, the ability to, again, have a replicated copy of what the output was from the old system and then compare that with the output of the newly written transformation logic. And basically look for row by row matching because you're doing a lift and shift is huge and it's something that honestly and I don't want to go on record and say that I love doing migrations but there is 1 part of it that I like which is that there is a ground truth to hold yourself to which is that if the old system is assumed to be accurate which by the way that assumption can be wrong sometimes we found plenty of bugs in the old system as we move forward but if you assume the old system is accurate then you actually have this ground truth to compare your results against. And I feel like that's something that working in data you very much don't always have. If you're building a brand new data pipeline the best you can do is do heuristics and metrics on it to see if you feel like it looks like it's correct. But in this case, you have an exact copy of what that data should look like to tell you that it's correct. So it's actually a really nice place to be when you're in migrations and to the point you raised earlier, why it's easier to sort of spread out this workload, to others who might not even be so familiar with the domain of the data because they can use tools like this to do an exact comparison.
[00:45:41] Unknown:
And speaking a bit more to the automation question, you talked about being able to use data diff to be able to validate that the old system and the new system are in agreement and that all of the same data is present. What are some of the other places where automation can be useful, and what are some of the ways that you have seen automation employed where it is actually a net negative?
[00:46:04] Unknown:
Speak to 1 other place where I've seen it be incredibly useful, which is in the translation, of a SQL dialect, for example, from 1 system to another, assuming we're talking about a migration from a SQL based data warehouse to another. At Major League Baseball, we had 100 of SQL scripts of thousands of lines of of code each sometimes that were written in Teradata's dialect of SQL. And, at that point, we, we were working with with Google Cloud, and they ended up actually acquiring a company called Compiler Works. I believe they've been on the podcast before as well. And they had a a tool that would understand and parse the SQL syntax tree and then turn it into another version of SQL for you. And it was pretty advanced. We were always blown away by how far it could get because there were certain features, SQL qualify clause that Teradata had that BigQuery just didn't have at the time, that it would then completely rewrite your query into, like, a nested CTE in order to get the same results.
And that's just a game changer when you're not going through the tedium of like trying to find SQL differences. At Eventbrite, I I I was able to kind of engage in a much more, simple version of this because we were moving from PrestoSQL syntax to Snowflake SQL that were actually a lot more alike than they were different. And so here, I basically built an internal tool that used mostly regex expressions just to change out certain things that were slightly different. It certainly wasn't as advanced as something that would parse the entire SQL syntax tree. But it also didn't seem to be necessary. We got 90% of the way there with that. I know there's also a a tool and framework out there SQL Glot. I believe he's also been on the podcast that, same sort of thing. Being able to transpile SQL from 1 dialect to another, can save you, you know, hours of frustration in having to do something that would otherwise be very tedious.
[00:47:46] Unknown:
Yeah. Plus 1 to SQL GloD, I think. If I had this available in the early days of my data engineering career, That would be a a game changer, but I did get to use the compiler works uplift, which was definitely helpful. And, yeah, I think the other aspect of automation that can be helpful in the planning phase is data lineage. Because ultimately, you know, whether you're doing lift and shift or whether you try to refactor things, you have to plan and sequence the migration, and you have to decide of 100 and sometimes 1, 000 of your jobs, which ones you're tackling first, which ones you're tackling last, and which ones you may not even need to tackle and you can probably deprecate because they were, like, 1 off, things that were built.
And having data lineage that essentially allows you to see the dependencies within your data stack, both in terms of the transformation layer, how, models how tables are dependent, as well as going into the consumption layer. So understanding, well, is this table that gets produced actually queried by any of the BI tools? And if it is, are those, you know, tier 1 executive, reports querying it, or maybe those are some 1 off reports that haven't been even touched in months, Well, incredibly valuable. Back at Lyft, we did use, Compiler Works that was, at the time, you know, very cutting edge technology, and that did save us, I would say, unimaginable amounts of time of otherwise having to migrate over 10, 000 models, without any kind of good insight into the dependency tree. This is also a capability that we now offer as part of Dataflow that can be helpful for teams that are planning the migrations between the warehouses between the transformation frameworks.
[00:49:31] Unknown:
You know, a plus 1 about the value of lineage. Although I think the challenge that we've had is that you're then trying to apply lineage onto the same legacy system that you're fighting with as it is to get off of. Right? And so in the case here where we looked at, could we apply a lineage tool onto our legacy platform and get value out of it? And that was an example. You know, Tobias, you asked when we decided maybe that it wasn't worth it. We were looking at what it would take to actually deploy the lineage tool on the legacy system, and then what potential strain that might add. And, you know, when you're looking at a legacy system that you're trying to move away from at that point, you're afraid to look at it the wrong way or breathe on it the wrong way for fear of the whole thing toppling over. And so I think we started looking. It could be apply lineage here. Could we use some of these tools to help us untangle this? And we ended up deciding against it. We just said the the the amount of effort it might take and the risk that it poses to putting that onto the legacy system isn't worth it. So we essentially, in our case, basically, had to manually craft that. You definitely need that lineage graph in order to be able to do this migration. It's just about a matter of whether or not you can bring in tooling to help you do It's just about a matter of whether or not you can bring in tooling to help you do it. Because I think once you have that lineage graph, it then becomes critical in sequencing your order of operations during the migration.
Most importantly, when you look at your set of data jobs as a as a web of dependencies, the things that you can start to migrate and turn off on the legacy system first are those that are the edges, the leaf nodes. The things that no no 1 else depends on them. So if you have table a and table b and table b depends on table a, you can't turn off table a first because table b's stuff will break. Right? So but you can if nothing else depends on table b, you can move that over and actually turn it off first. So really being able to visualize this and it doesn't have to be a fancy graph. It could just be in our case here, we're using a a Google sheet where we just keep track of this and constantly keep looking for, like, alright. Now that we've moved this thing off, these 2 things have now become edges. They've now become leaf nodes. We can now consider moving them next. And so you definitely need that. It's just about whether or not the tooling can help you with that or not. And in terms of your experiences of going through these migration projects
[00:51:41] Unknown:
and working with other people who have gone down similar paths, what are some of the most interesting or innovative or unexpected ways that you have seen these projects approached or implemented?
[00:51:51] Unknown:
You know, I'd say that, honestly, this is 1 of those places where I'm not sure that interesting is better. I think that there's sort of when you've been through a few of these, it's almost like there is sort of a process to it. You know, I kinda alluded to it of just, you know, replicate, then get consumers over, then start working through your injections and transformations, validate along the way. It feels like sort of following that playbook has has led me to success on several of these, whereas, you know, I'd be concerned about trying to deviate from that. You know, even early days just discussing how we would approach the migration here at Eventbrite. We had other ideas, but we kept coming back to this of, like, listen. This is a well trodden path that worked. So, I mean, I think in terms of, like, surprises that we hit while we're working on 1, 1 of them that's a very pleasant surprise usually is is how many things you can deprecate.
You know, Gleb talked earlier, I think, about thinking of those atomic units of migration in terms of tables and jobs. And okay. You might start and take an inventory and say I've got 10, 000 tables and 500 jobs. But then you might think, okay, the work is now I gotta estimate out if each job is gonna take me a week. I've got 500 person weeks. But 1 great thing is you can cut down that denominator by just saying we're not gonna migrate this thing. We're just gonna kill it right now because nobody's actually using it. And I think that self reflection on a data system is something that you usually, as a data team, don't carve a whole lot of time out for.
If a job isn't broken, it can be running along the way for years before anybody ever looks at it again because no one's gonna put the Jira ticket on your board to go look at something that's not broken. Right? So, like, I think the the surprise of, actually, we can cut off you know, in the case of Eventbrite, we had over 10, 000 tables in our Hive schema. We only brought over about a 1000 of them to our Snowflake data warehouse. And, you know, we were concerned in the month or 2 we talked about previously that people were cutting over that they might be complaints. I think we maybe got 2 or 3 hands raised of, like, this 1 table I'm missing. Right? But, like, that meant that the other 9, 000 were not providing anybody any value that we were aware of. And so, like, it was a surprise of just how much is there, how much data is cheap to store, and so you end up just storing it kind of forever in a lot of cases like that. Yeah. I strongly
[00:53:56] Unknown:
agree with Rob on on that 1. The the boring is better. Doing lift and shift and, and really, really taking discipline and not trying to solve problems, other problems, but to move things over is important. And that also with 1 exception of deprecating things that are not used because that's, like, a that can really massively if for example, Rob's example. Right? 10 x, reduce the complexity and cost of the the project. I can say from the perspective of, you know, building Datadiff and and seeing multiple teams leveraging it for migrations, I'd say 1 thing I've been generally very impressed how effective a well ran migration project can be. Like, if you do lift and shift, and if you are very diligent about kind of mapping your dependencies and getting user acceptance testing from stakeholders and you are using tools to automate the process, I I've seen incredible velocity achieved by teams, like, for example, at Eventbrite that I haven't been able to do in my career. That's been definitely very inspiring. And the the second aspect, of, I think, seeing how Datadiff is leveraged in those migrations that I haven't anticipated is when we built the tool initially, we thought of it as a very kind of developer first tool where, let's say, if I'm a developer and I am working in a particular table, I would convert SQL, I would run the table, then I would trigger the data diff, and I would get results.
And we thought of it as a very developer workflow. The interesting use case for that that I've seen was that companies also basically ran, Datadiff on schedule for their entire scope of data models every day. And what that gave them is a pretty good estimate of, okay, what are the tables that are currently migrated? What are the tables that are in progress? And how far off we in saying we have full parity between systems at scale. Because then you can say, okay. Like, this 1 of the you know, this table is, like, 99% there, but there's, like, 1% discrepancy, and we know where the discrepancy is. And pretty much, like, running this for the entire scope of tables
[00:56:05] Unknown:
is something that I haven't thought people would actually do, but it makes a lot of sense to to actually implement during the migration to track your progress on the, like, the overall project level. Yeah. We're 1 of those customers doing that, and we're the ones pinging you and your team, Gleb, if we're throwing 5, 000 API requests against data fold all at once and breaking the system in some way. But, yeah, we did find that we did bake it in very much as part of the validation and basically configured alerts if things were off. And hopefully, we never hear those alerts. So it becomes almost like an ongoing validation, which again is 1 of those beautiful things you can do during a migration because you have 2 different things to compare that, you know, going forward. Obviously, there's tons of talk about data observability and testing tools and it's a whole separate topic to uncover, but it's it's much tougher to figure out exactly what the task where is here. It's just correctness by by matching at a 100% on column by column basis that DataFold can tell us.
[00:56:54] Unknown:
In terms of your own experiences of going through these migration projects, what are some of the most interesting or unexpected or challenging lessons that you learned in the process?
[00:57:02] Unknown:
I think for us and that kind of speaks to the early question of how do you choose the technology to migrate to. 1 of the things where we struggled was the user experience of the new system. So at the time, we knew that we had to migrate from or Russia was was coupled to source compute system into data lake, which will allow, you know, much greater scalability, theoretically infinite scalability for both storage and compute. And the kind of the engine that we chose to do the majority of the lift was Hive. That was back in 20 16. So, you know, right now, some people probably haven't ever heard of Hive, but at that time, it was probably 1 of the more popular data lake engines that was very mature. They were also Presto slash Trino and Spark, but they were kind of less stable and more more cutting edge, still kind of getting getting their ground. And the the trouble we ran into is that Hive was fairly good at tackling jobs at scale, but it had a really poor user experience by modern standards. It was very slow. And because it was slow to run queries, it was also slow to fail. Like, basically, it would take sometimes 90 seconds for it just to return the syntax error. And the the trouble of that is when you're doing a migration, you have to be able to iterate really fast because you're, you know, converting SQL, you're running it. Let's say you're running a data diff and then you, you know, adjust things as you go. If your system is slow and slow to fail, slow to turn error, slow to produce results, that has a massive impact on the speed of the entire project. And then the other aspect of it is that Hive was a system that required a lot of the expertise in using it correctly. For example, it had, like, all these tuning parameters, and no 1 today probably would choose to use Hive. So I think the lesson learned is that today, you can choose between, you know, awesome systems of different kind, but really being cognizant of what is the persona you're optimizing for, who is not gonna be doing the migration, who's gonna be using the system, and making sure that you have a good match between the technology and their technical aptitude.
Because if there's a steep learning curve, right, for example, maybe you just decide to run your transformations and express them in something like Flink. Right? Because it's a very powerful system. But the majority of your team may be accustomed to SQL. And so having this learning curve, having them to learn new system, and not being able to iterate quickly can really, really massively complicate the entire project. So that's definitely a hard lesson learned that I would caution modern data platform architects about.
[00:59:43] Unknown:
We touched on this a little bit, but in your experience of going through these processes, what are either some of the design patterns that can help prevent the need for eventual migrations? And what what are some of the ways that the tooling and technology that is in common use can be improved to make it easier to do these migration projects or, extend the overall runway and lifetime of the technology to prevent the need for these migrations?
[01:00:15] Unknown:
I think sometimes the tooling is almost serving the opposite effect of locking you further into what you're currently on. I mean, I look at, you know, Snowflake just announced their dynamic tables feature, and it looks amazing. But the flip side is, alright, the more we start to use the features that are specific to that platform, the harder ever moving away from it will be. Right? I think in in in using Google Cloud and BigQuery, similar. We got really obsessed with the ability to use connected sheets and have g seats basically be, you know, a very simple interface for our analysts to, do mappings of data and things like that. And I think, you know, if you're not careful you start leaning too far into a vendor specific feature set. You make yourself very tightly coupled to things that are specific to that vendor. Now luckily on the flip side of that these vendors seems to all be in an arms race where whether you call it lakehouse or not it seems like everything is kind of converging on a, union of all possible features. So maybe that won't be such an issue but it's something that I think you have to look at at least a little bit skeptically of like as I move away from something that is just SQL on a set of data to anything else than that, you know, what am I potentially complicating for myself down the road, in moving away from that if I ever need to? So it's just something to think about, you know, how do you sort of decouple
[01:01:25] Unknown:
some of what you need to do from a compute standpoint, from what the system is then offering you in terms of automation on that. Yeah. Absolutely. That that's a, that's a challenge I'm dealing with right now of, hey. This, this vendor is adding all kinds of new features on top of this underlying open source project that is what led me to choose them in the first place. But the more I lean into all these fancy new capabilities that they have, the harder it will be if I ever decide I need to just run the open source project myself.
[01:01:55] Unknown:
Yeah. I I think that today, the vendor lock in is almost ineligible at least at least in certain, level of your stack. Right? So, of course, you have a vendor lock in by just using a platform like AWS or Google Cloud Platform, but no 1 is really saying that I'm gonna just run my data centers and with Linux so that I don't have this vendor lock in. So but at the same time, in the other extreme, you probably don't wanna use a vendor who is has a very closed ecosystem that gets you know, makes it, like, hard to get data out of. So the probably the the the middle ground here is that for critical systems like warehousing or, you know, business intelligence where you can have a very deep vendor lock in a sense that the longer you are using a platform, the higher the cost system migrate out. There are probably certain dimensions that are really important to consider. Like, for example, interoperability with other systems is huge. Right? So is this is this technology just like a really advanced 1 that you love as an engineer, or is there a large ecosystem around it? So, yes, I think with Snowflake and with Databricks and BigQuery, Google Cloud Platform, there's a huge risk of, like, if you really invest in it as a business, it's gonna be extremely expensive to getting out of it.
But at the same time, all of those systems have enormous, you know, enormous ecosystems between them. They're also very competitive in offering features that kind of close the gaps once someone releases new feature than the other 1 will will do soon. And so I almost wanna say something, like, corny and say, like, you can't go wrong with 1 of these big players just because you will probably get, enough of what you need. I think when things get tricky is, once you kind of deviate from, like, okay, you have these foundations and then you have to build kind of less obvious parts of the stack and what technologies you choose and how that supports your business going forward. So for example, when you choose a transformation framework, that also has a fairly high migration cost. So if you, for example, choose, you know, DBT or Airflow or Daxter for expressing your transformations, moving those migrating those could be as expensive as migrating from between the warehouses and making sure that when you choose a framework for transformations, you think that it can actually support your business going forward. And, for example, if you start with DBT and SQL, how does that then map into your longer term road map and types of workloads you wanna do? So and specifically in the dbt's example, they also have a pretty great ecosystem and a lot of the tools that I would say, like, more advanced, example, Dagster, they can actually interoperate with dbt, import dbt project, but then allow you to do other things. So in many ways, the stack assembles itself quite quite nicely in, you know, not letting you down, but it's definitely something that you have to be careful about when you choose parts of your stack.
[01:04:50] Unknown:
Absolutely. Are there any other aspects of this subject of data platform migrations, the technical and organizational aspects thereof, and your experiences
[01:05:01] Unknown:
of embarking on these projects that we didn't discuss yet that you would like to cover before we close out the show? Let me think 1 thing. I I kind of touched on it before, but how important it is to sort of build an internal advocacy for your migration early on. It's so important because then you end up riding that wave all the way through the pain of having to validate everything along the way. And so I think that that's something that maybe, you know, I don't know if I'd say it was a surprise. I think it was somewhat how quickly when you roll out a new system like this and, you know, start initially nagging engineers to switch their things over to use it, how much it becomes less about, you know, sort of forcing that behavior and more about them becoming, you know, advocates on your behalf. And I think that's something that is just really, really important that because it gets you buy in not just from senior leadership, but from the sort of ground level up the other way too. And so I I think that's really been key to success on these things is just being able to let everyone
[01:05:57] Unknown:
see the value of it early and often. Yeah. Can't think we can't think we can't think we more with that. Alright. Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get each of your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today. I feel like for me, it's still about the right development workflow, and I know this is something that everyone has different solutions for depending on the toolset you use.
[01:06:27] Unknown:
And, you know, as we move forward, I've seen evolutionary steps in this. Things like snowflake 0 copy clones allow workflows to be possible that just weren't possible before. But still forming that into, like, sort of a a common set of development patterns, it feels like we're still a ways off from where the software engineering community here here is on this, on being able to press a button and basically create a branch off of your data. And I know there are tools and technologies that do exactly this, but trying to get that in a standard way that then makes development trivial. Makes it easy to onboard developers. Makes it easy to try things out without having to incur the cost of of spinning up a full new environment per developer. I think it's something that there's still a lot of room and it's something that my team at Eventbrite has basically had to build our own version of this. And I think many companies are probably doing the same right now because there isn't a single right way to do this that everyone agrees upon just yet. Yeah. I would strongly agree with Rob. I think the developer
[01:07:18] Unknown:
experience on data is in general way way behind the productivity of software engineers with the tools that support them. And even if we set the kind of the recent advancements of LLM and Copilot aside, just in general, the basic things that software engineers could take for granted, are not available to, you know, data engineers for the most part. Like, even, you know, syntax correction for for SQL or dependency analysis that just doesn't really exist for the vast majority of of workflows. I think the it it's getting better with the standardization on certain frameworks like DBT where the framework is fairly opinionated, which allows to build more opinionated developer experiences and allows, you know, vendors and other consistent players, including, you know, Dataflow to plug it into that standard and then offer developer support in particular parts of the workflow. For example, the the concept of data diffing, we brought to dbt developers and allowing allow them to diff as they go to see how the code changes impact the data, which is available as a open source package and, as a Versus code extension.
But I think we're still scratching overall as the community, the possibilities of how great the data engineer workflow can be with the right tool and support. I think we're just we're just starting on that journey.
[01:08:45] Unknown:
Alright. Well, thank you both for taking the time today to join me and share your, pains and successes in going through these data migration projects. As you said, it's something that all of us are inevitably going to have to deal with. So I appreciate you sharing some of your hard won, lessons and wisdom from that, and I hope you enjoy the rest of your day. Thank you so much. Thanks so much, Tobias.
[01:09:12] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to Guests and Topic
Gleb's Background and Experience
Rob's Background and Experience
Defining Data Migration
Signals for Migration Necessity
Early Signals and Proof of Concept
Cost Accounting and Opportunity Cost
Architectural Decisions and Pitfalls
Identifying and Preventing Deep Dependencies
Access Control and Governance
Defining Completion and Measuring Success
Automation in Migration
Interesting and Unexpected Approaches
Challenging Lessons Learned
Design Patterns and Tooling Improvements
Building Internal Advocacy
Biggest Gaps in Data Management Tooling