Summary
Any software system that survives long enough will require some form of migration or evolution. When that system is responsible for the data layer the process becomes more challenging. Sriram Panyam has been involved in several projects that required migration of large volumes of data in high traffic environments. In this episode he shares some of the valuable lessons that he learned about how to make those projects successful.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
- This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments today to subscribe. My thanks to the team at Code Comments for their support.
- Your host is Tobias Macey and today I'm interviewing Sriram Panyam about his experiences conducting large scale data migrations and the useful strategies that he learned in the process
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by sharing some of your experiences with data migration projects?
- As you have gone through successive migration projects, how has that influenced the ways that you think about architecting data systems?
- How would you categorize the different types and motivations of migrations?
- How does the motivation for a migration influence the ways that you plan for and execute that work?
- Can you talk us through one or two specific projects that you have taken part in?
- Part 1: The Triggers
- Section 1: Technical Limitations triggering Data Migration
- Scaling bottlenecks: Performance issues with databases, storage, or network infrastructure
- Legacy compatibility: Difficulties integrating with modern tools and cloud platforms
- System upgrades: The need to migrate data during major software changes (e.g., SQL Server version upgrade)
- Section 2: Types of Migrations for Infrastructure Focus
- Storage migration: Moving data between systems (HDD to SSD, SAN to NAS, etc.)
- Data center migration: Physical relocation or consolidation of data centers
- Virtualization migration: Moving from physical servers to virtual machines (or vice versa)
- Section 3: Technical Decisions Driving Data Migrations
- End-of-life support: Forced migration when older software or hardware is sunsetted
- Security and compliance: Adopting new platforms with better security postures
- Cost Optimization: Potential savings of cloud vs. on-premise data centers
- Section 1: Technical Limitations triggering Data Migration
- Part 2: Challenges (and Anxieties)
- Section 1: Technical Challenges
- Data transformation challenges: Schema changes, complex data mappings
- Network bandwidth and latency: Transferring large datasets efficiently
- Performance testing and load balancing: Ensuring new systems can handle the workload
- Live data consistency: Maintaining data integrity while updates occur in the source system
- Minimizing Lag: Techniques to reduce delays in replicating changes to the new system
- Change data capture: Identifying and tracking changes to the source system during migration
- Section 2: Operational Challenges
- Minimizing downtime: Strategies for service continuity during migration
- Change management and rollback plans: Dealing with unexpected issues
- Technical skills and resources: In-house expertise/data teams/external help
- Section 3: Security & Compliance Challenges
- Data encryption and protection: Methods for both in-transit and at-rest data
- Meeting audit requirements: Documenting data lineage & the chain of custody
- Managing access controls: Adjusting identity and role-based access to the new systems
- Section 1: Technical Challenges
- Part 3: Patterns
- Section 1: Infrastructure Migration Strategies
- Lift and shift: Migrating as-is vs. modernization and re-architecting during the move
- Phased vs. big bang approaches: Tradeoffs in risk vs. disruption
- Tools and automation: Using specialized software to streamline the process
- Dual writes: Managing updates to both old and new systems for a time
- Change data capture (CDC) methods: Log-based vs. trigger-based approaches for tracking changes
- Data validation & reconciliation: Ensuring consistency between source and target
- Section 2: Maintaining Performance and Reliability
- Disaster recovery planning: Failover mechanisms for the new environment
- Monitoring and alerting: Proactively identifying and addressing issues
- Capacity planning and forecasting growth to scale the new infrastructure
- Section 3: Data Consistency and Replication
- Replication tools - strategies and specialized tooling
- Data synchronization techniques, eg Pros and cons of different methods (incremental vs. full)
- Testing/Verification Strategies for validating data correctness in a live environment
- Implication of large scale systems/environments
- Comparison of interesting strategies:
- DBLog, Debezium, Databus, Goldengate etc
- DBLog, Debezium, Databus, Goldengate etc
- Section 1: Infrastructure Migration Strategies
- What are the most interesting, innovative, or unexpected approaches to data migrations that you have seen or participated in?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on data migrations?
- When is a migration the wrong choice?
- What are the characteristics or features of data technologies and the overall ecosystem that can reduce the burden of data migration in the future?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
Links
- DagKnows
- Google Cloud Dataflow
- Seinfeld Risk Management
- ACL == Access Control List
- LinkedIn Databus - Change Data Capture
- Espresso Storage
- HDFS
- Kafka
- Postgres Replication Slots
- Queueing Theory
- Apache Beam
- Debezium
- Airbyte
- [Fivetran](fivetran.com)
- Designing Data Intensive Applications by Martin Kleppman (affiliate link)
- Vector Databases
- Pinecone
- Weaviate
- LAMP Stack
- Netflix DBLog
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Red Hat Code Comments Podcast: ![Code Comments Podcast Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/A-ygm_NM.jpg) Putting new technology to use is an exciting prospect. But going from purchase to production isn’t always smooth—even when it’s something everyone is looking forward to. Code Comments covers the bumps, the hiccups, and the setbacks teams face when adjusting to new technology—and the triumphs they pull off once they really get going. Follow Code Comments [anywhere you listen to podcasts](https://link.chtbl.com/codecomments?sid=podcast.dataengineering).
- Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png) This episode is brought to you by Starburst - an end-to-end data lakehouse platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, the query engine Apache Iceberg was designed for, Starburst is an open platform with support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. Go to [dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)
Hello, and welcome to the data engineering podcast, the show about modern data management. This episode is supported by CodeCommence, an original podcast from Red Hat. As someone who listens to the data engineering podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In code comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard won lessons in implementing new technologies. I listened to the recent episode, transforming your database, and appreciated the valuable advice on how to approach the selection and integration of new databases and applications into the impact on team dynamics.
There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for Code Comments in your podcast player or go to data engineering podcast dot com/codecomments today to subscribe. My thanks to the team at CodeCommons for their support. Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end to end data lake has platform built on Trino, the query engine Apache Iceberg Iceberg was designed for. Starburst has complete support for all table formats, including Apache Iceberg, Hive, and Delta Lake.
And Starburst is trusted by teams of all sizes, including Comcast and DoorDash. Want to see Starburst in action? Go to data engineering podcast.com/starburst today and get $500 in credits to try Starburst Galaxy, the easiest and fastest way to get started using Trino. Your host is Tobias Macy, and today I'm interviewing Sriram Panayam about his experiences conducting large scale data migrations and the useful strategies that he's learned in the process. So, Shrir, can you start by introducing yourself?
[00:01:51] Unknown:
Yeah. I'm Shriram. I'm currently the CTO at Diagnose. We we build the tooling for, automating runbooks for SRS and dev teams, to kind of get out of all the manual repetitive work. Right now, I probably focused on, I focus on satisfying our platform and building its control plane. But my background has been in data for a while, especially at, at, in large scale systems like, you know, Google and LinkedIn. And and, you know, 1 of the concepts that, you know, make me passionate passionate about is data migrations and data movement at scale. Right? Yeah. So hey. How are you doing?
[00:02:25] Unknown:
I'm doing well. Thanks. And do you remember how you first got started working in data?
[00:02:30] Unknown:
Yeah. Purely randomly. My career began, you know, in, as a network designer long, long, long time ago. I guess what's more, data than packets sitting on, I think. But to me, I think, you know, almost any useful or FX for computing is about moving and processing data. And I guess my real my journey was about 15 years ago when I started doing my first start up, back in Sydney. It was called Street Hawk. It was a location based marketing platform for matching retailers and shoppers, you know, as they as they are walking past based on location. And it was a a really interesting, you know, evolution for me because prior to that, I was a embedded systems kind of engineer, you know, working on device drivers and games. Right?
So the best thing or the close thing to data was how do you load really large asset files for games. When I started getting into web, you know, even though it looks like, hey, you know, user comes in, gets a page, downloads something, goes away, There's so much behind the scenes in terms of how data is managed, loaded, moved, you know, kept constant, kept clean despite bugs happening all around. And and that was kind of the start of my world of startups, into data, into large scale systems, and moving into the US around 12 years ago just made that more fun and accelerated for me.
Later on, I was at LinkedIn working on the profile and feed platforms. Again, a lot of data problems, a lot of fun stuff there. And then, you know, going from a user of these platforms and building the right platforms, I moved on to Google where I was working on Dataflow, where I was, I guess, selling a platform. Yeah. That's kind of my journey into data and, continual, you know, current learning as I go through.
[00:04:15] Unknown:
Yeah. It's interesting when people start from that very low level, in your case, the network dealing with packets, how that colors their overall experience with other technologies. Myself, I started as a systems administrator. It was my first job in tech, and so that is definitely very much colored the way that I think about software and programming because I wanna make sure that things are reliable and aren't gonna break and wake me up in the middle of the night because that's where I started my career.
[00:04:45] Unknown:
I I mean, I saw what you you did some really, you know, interesting things. And I can see that, you know, those experiences, even though they may sound very different and offbeat, they really bring a different perspective for you when you solve these problems. Right? For example, in the networking world, you know, back then, 20 something years ago, it was all about network and switch design. Right? Everything was about how do I relieve load on some processing cluster on some rack on some network. Right? And then if you think about data, it's about, you know, okay, I have this giant, giant, giant blob of data, like, petabytes of data. I can't process them, you know, in a single core. Right? So how do I route it? How do I switch it? How do I break it down? How do I shot it? Like, all that comes in as part of this. Right? And and I think this is certain bit of, gleeful joy, you know, when you when you bring in those completely different unrelated topics into something else you're working on. Right?
[00:05:42] Unknown:
Absolutely. Yeah. And in terms of the topic that we're covering migration projects, obviously, data gets moved around all the time, but when we say migration, that usually means that there's some specific body of work that hopefully has an end, but definitely has a beginning. And I'm warn I'm wondering if you can just give some of your perspective on what you see as the kind of broad categories of data migration as a project and maybe some of the ways that people might think about it that we are explicitly excluding from this conversation.
[00:06:23] Unknown:
Yeah. Yeah. Now, if you are a Seinfeld fan, or if you haven't seen any episodes, there's a really good episode where Josh Costanza talks about risk management, and how there's a really boring introduction, you know. To understand risk management, you have to understand risk. Alright. Nothing can be drier and more boring than that. And if you look at data migrations or database migrations, it's very much like risk management. Right? If you look at it purely as, you know, migrating a database to something else, it can get very dry and very boring as as you as you called out. Right? There's a start, there's an end, and voila, off you go. But so much happens when you bring in other constraints.
Right? If you think of migration or or data migration as, look, I have this big system. I need to move it somewhere else. That's it. Right? The easiest, dumbest thing you can do is, you know, take a backup or or or sorry. 1st, cut off your customers. Take a backup, load it on the database, have them downtime at 3 AM in the morning, move your customers, and restart it. That's kind of not how today's systems or or more importantly, it's not how customers would be happy with you if you did that today. Right? They want, you know, near a 100% uptime or 0 to 0 downtime. They want all kinds of other functionality in between. They want, you know, they have there's there's so much diversity in the kinds of data in terms of cardinality, the the structures and schemes and so on. And, you know, it and you're talking about, you know, global scale, and with global distributions. Right?
You might have data for customers in America being served out of India. Right? So as you start looking at migrations as a data systems problem, right, things get very interesting. Yes. There is a start and there's an end, but it's a very, very interesting middle. Right? We just made data migration tools and strategies and systems really shine and, you know, take it away from being anything of a boring topic. Right? For example, is migration not just on the form of replication? At the nth degree. Right? Yes. You might have 1 database to another database or 1 MySQL to under MySQL or SQL to SQL or then you can go SQL to NoSQL.
Talking about building indexes, you know, as as of all migration. You might not wanna migrate the whole database. You can just look at partial migrations. A good example is when you when you have start ups or any company, right, building a product that's a v 0 over 0.1, everything gets stored in 1 database, served through some kind of ORM, off you go. But as you grow, you know, 1,000,000,000, 10,000,000,000 kind of record a 1,000,000,000,000, 10,000,000,000 kind of records, 1 database is not gonna do. Right? Or you might you will have to shot it. Suddenly, it becomes a set of partial migrations. Right? We talked about, you know, different kinds of databases to and from.
And then, you know, there's things like, cardinality where you can be, you know, are you going from I I I do want to many relationships or many to many or even redistribution of things. Right? So there's a lot of these interesting cases that come in in the middle, when you look at migrations and transformations and moves.
[00:09:27] Unknown:
In terms of migrations, usually, there's some sort of impetus or some hopeful improvement in your overall state at the end of it. I'm curious if you can talk to some of the typical motivations for starting on a migration project, and then we can dig into some of the strategic elements of how to think about it, how to plan for it, how to execute it. Yeah. Yeah. So
[00:09:53] Unknown:
they can be different triggers for migration. Right? You you you might just be at a point where, you know, you put everything in a single data store, and you kinda reach limits. Now storage limits, QPS limits, like like, how many customers are accessing it, you know, like, what what kind of patterns you're doing, they can change. You might have you know, as your application or platform or API or the client facing thing morphs from day 1 to day 100, your schemas change. Your use cases change. Right? As pure as we wanna be, you know, your front end or your application is a reflection of your data, and your data organization is a reflection of your use case. Right? So when either of them so so when the use case changes, your data organization is gonna change. So a good example is, you know, you might suddenly need a whole bunch of indexes just to make certain queries faster. Right?
So you that that might trigger, you know, how often those use cases are exercise exercised. You know, what kind of SLOs do you want on that access? Do you want the same access for viewing my feeds or same SLOs for viewing my my my, you know, LinkedIn feed versus viewing my LinkedIn profile? Right? So those might mean that you're okay with certain sacrifices and certain trade offs you might need, which might mean that, hey. Look. Why am I keeping everything in the same database? Move it around, break it up, break it to multiple microservices that other teams can manage, you know, change ownership, you know, and that results in more reasons for migrating material.
You might even have things like compliance and privacy requirements that might change how your data is stored and, you know, retrieve and processed and what kind of ACLs you might wanna have on your data. Right? And applying those ACLs may have different sorry. Applying those, axe controls might have different, you know, different costs. Right? That may not be homogeneous to who's paying for it. Right? So again, you might wanna break that up to serve different needs and different SLOs. And also, you might have things like, you know, team expertise and team limitations where certain data stores could be useful for certain certain kinds of features and teams that that team would be good at managing, whereas some teams might might be more, oriented towards a different cause of problems. Right?
Then you have your whole offline, online, nearline use cases where, you know, you go dump data to some data lake for analytics purposes and a lot more.
[00:12:18] Unknown:
In the intro, I also had that keyword of large scale because if you're just dealing with the data migration of a 50 megabyte Postgres database to MySQL, there's not a lot of planning that has to happen. You just need to make sure both databases are running. You do, SQL dump, SQL load, and then you cut over. And I'm wondering what are some of the ways that gradations of scale impact the overall time and effort needed to be able to actually plan for and implement 1 of these migration projects?
[00:12:50] Unknown:
Right. So, you know, looking at the grades and sorry, gradations and scales. Right? I would look at it as, look, you know, what kind of consistency, what kind of freshness, what kind of, you know, latency requirements, you know, your access needs. Right? Now take something like, a 50 megawatt, you know, database that you wanna move from 1 to another. It could be SQL to SQL or SQL to SQL, you know, the size here is a big factor here. Right? In this case, a big factor. So, you know, it is fun. Even if you were to kind of do a manual physical, you know, 1 time cutover, the downtime itself is gonna be pretty minimal. Right? You might have more, effort in testing the migration and so on, but the oral downtime is kinda cheap. It's almost a lift and shift or a lift and transform and shift.
And fairly manual, downtime if you for users probably in the order of minutes. Right? As you as as the data sizes kinda grow, right, you might say, look, I have a petabyte, you know, like, few terabytes of data in some data store that I do break and move and and so on. I can either do it from a streaming perspective, but we'll come to that later. But I I can do a full full, you know, dump and reload. Right? We're talking about a few hours, if not days of downtime for the customers. Right? Again, if all you care about is the ease of lift and shift, that's fine. You might be okay with your customers having downtime for a few hours in a very bespoke use case, and it's perfectly fine. In today's system, right, the it's not just 1 access of data that a customer demands. Right? It's not just that, hey. I have a bunch of profiles in this database. I wanna move to database and start using it. What customers wanna do is I have my profiles that I'm using. I also wanna be able to post feeds. I wanna create content. I wanna like and, you know, like and follow other people's content. How can I do all all that all the time without a downtime? And now I need to start thinking about how can I do how can I reduce my downtime as much as possible while make making this process less complex? The other aspect is we talked about index. Right? And if you think about all these secondary, you know, 3rd degree, 4th degree views that you can get from your data. For instance, you might have, you know, a Twitter timeline, which is a collection of the of the of the of the post that you can see of all your friends. Right? I mean, this is a classic system design problem. Right? You know, I'm posting. I wanna see all my all my tweets, but somebody else wants to see mine as quickly as possible. I could be an influencer. What are the consequences of that data getting around here at what point in time? So when you're migrating so if you look at that as a replication problem, how quickly I can see it or or or or or hapically I'm I'm okay to tolerate in terms of the wait time can affect, you know, what hurdles I make in my, migration process. So 1 example is you can use something like a a change capture process where any changes to a database, right, are, extracted, processed, transformed, and written to a target database.
This is the proverbial middle that we're talking about. Right? You're reading from source database. It goes to a whole pipeline of changes right somewhere. And there are many tools for this, right, that are out there that can do this too. The but if you say that I want my freshness to be as fresh as possible, if I write to 1 database, I want that to be visible in the second system immediately, Then you can use something like a dual right approach, which has its own kind of caveats and trade offs. So the planning is important. They got a plan, what kind of what kind of experience the user is okay to tolerate, what that means for our business metrics, and how it will impact our own engagement and other KPIs. And then look at how expensive and costly it is to pick which way you go.
[00:16:30] Unknown:
In terms of the end goals of a migration, there are obviously a large variety. Sometimes it's you want to move from 1 database technology to another because it has some new shiny feature you wanna take advantage of. Maybe you need to move from maybe it's for reasons of scale, as we mentioned, where your single node database can't handle the amount of data that you're dealing with anymore, so you need to move to something that has sharding and replication built in, or maybe you need to move to a completely different style of database, maybe moving from a relational database to something like Cassandra for those reasons of scale. Or it might be for cost purposes where you start off with a cloud data warehouse, but then you start spending $1,000,000 a year on it, and you, aren't getting all the return on investment that you hope for, and so you decide to move to more of a data lake or a lakehouse style architecture.
Or maybe it's a continual migration process where you need to keep a subset of your data in a preproduction environment for development and testing purposes. I'm wondering if you can just talk to some of the ways that those motivations and end goals can color the overall appetite and the organizational processes necessary to be able to get, buy in and support for that migration process.
[00:17:52] Unknown:
Yeah. So, you know, your costs and I mean, your your cost cost play a huge factor in this. Right? And cost can come in a bunch of ways. Like, there's obviously the raw cost. As you call that, putting something in data warehouse, you know, is gonna be expensive from a compute perspective and not just storage. Right? You're topping everything. You need a lot of, you know, transformational compute there to make sense for the consumers of that, of all the data stored in your database. Right? You know, and if depending on who it's available for, that could be an acceptable cost. Now what these warehouses and data list give you is well, I mean, the promise there is that, you know, you you can do all that with very little effort, and there's a whole bunch of ingestion pipelines that give you things out of the box. There's a lot of connectors from all kinds of data sources that you can just dump in and then people can take advantage of it. If the ROI of that exceeds I mean, if it's high enough ROI and if the return exceeds the cost, it's okay for a certain point of time. But if you find that you're doing all that ingestion of data just for 3 or 4 bespoke use cases and you're unnecessarily spending the spending all all the resources just for that, It doesn't quite need that kind of heavy heavy ingestion. Right? And the other thing is, you know, you have you also have technical aspects like who's gonna be managing this, who's gonna be, you know, who's gonna manage the start of it, you know, the the the actual process of, continue continuous or 1 off migration.
And post migration, who's gonna manage the system? Those costs would also have to be looked at, in the overall cost of doing this migration from a to b. We also I mean, 1 thing we our our kind of glossed over was security and compliance. Right? Like, you know, when you're doing this, what is the security, you know, what is current state of the how secure and compliant your data is? What will be the final state in the target and during the migration itself. So I think these play a decent role in how you choose, you know, when and why and how you do the migration.
[00:19:50] Unknown:
Now so it's very easy to talk about a lot of these aspects of data migrations in the abstract where you maybe have various motivations, different organizational structures. But bringing it to a more concrete reference point, I'm wondering if you can pick 1 or 2 experiences that you've had of actually going through this process of understanding the motivations for why and how fast and where you need to migrate data from and to and the overall technical and project management process of that and then the validation and final sign off to say, yes. This is done. I have achieved my overall objective, and now I'm gonna go and get a big bonus because of it.
[00:20:34] Unknown:
Well, I don't think I got that big bonus, but, Yeah. So at at LinkedIn, what if I mean, I joined LinkedIn in 2015 15 and right? And initially, I was part of the, LinkedIn feed platform. And part of the feed was, you know, your user posts, your use users' social actions around it, other kind of shares and comments and likes and, you know, social activity. I mean, LinkedIn itself has has gone through several stages of data evolution. They started off with this you know, I believe it's our goal first for a long, long, long time. And then LinkedIn built its own, like, extremely high performing, you know, espresso storage, which was a very, very, you know, highly shot shottable, you know, or shot at MySQL.
And based on based on based on MySQL, they were sharded back. Right? It was essentially a Key Valley store. And the profile also sorry. And the feed also had many aspects in, many components in 1 of my platforms, the social actions, you know, how to use a Slack comment. We had to move that from 1 espresso store to a different espresso store, right, for a lot of performance reasons. You know? And 2 aspects there were really important. Right? Like, how do you ensure that these actions were generalizable? Previously, they were very much tied to, the feed themselves. So we want to build a platform where we could take these actions and generalize it across any other entity in LinkedIn for the user. Effectively, what we called it was a hedge store or follow store. Now we called it unified follows for UFO. So the simplest thing was I mean, we couldn't do a lifting shift at the time because doing so would have impacted about 100,000,000,000 edges at the time, and that was just not gonna happen in any reasonable amount of time. We, you know, saw a few options. You know, obviously, 1 of them was Lifecn shift, and the other 1 was see if we can somehow do a bulk dump initially followed by some form of CDC or change, data capture.
LinkedIn already has what's called DataBus. It's a it's a very, you know, high again, high performance, CDC system. We yeah. We're coupled with espresso as well. Right? Essentially, espresso, the way it works I mean, CDC for espresso, the way it worked was it would, ride both to the MySQL table as well as the change log in a transactional manner, and that gets picked up by, you know, Dataverse, and it gets listened to by a bunch of other so our, goals for this were quite quite straightforward. Right? We wanna make sure that when this data these edges in the source database was migrated, it was all migrated. We had to make sure that there weren't missing edges in this. Last thing we wanted were last thing we wanted were customers, you know, liking something and finding finding that those lives were missing.
We wanted to make sure that the migration was repeatable, just in case we did something, we wanted to kinda go back and, you know, fix pots and paths of it as opposed to as opposed to go through a full migration process all over again. Our original target was to make sure the client did not have to change as in the clients the LinkedIn user facing clients don't have to change. And I'll get to how we worked around that or worked through parts of it. We also wanted some testability on it and see how it's affecting any impact for the customer. Load times are critical for, you know, engagement on LinkedIn or the impacts or to, improve engagement on LinkedIn. So we had to make sure that those are monitored. And, and, obviously, we wanna make sure that they were consistent. We didn't wanna have out of order out of ordering of rights affect any of the constancy and correctness of the edges before and after. By the way, the schemas were different. So that's that's a key part of it. It wasn't a homogeneous, transfer from the old to new. Right?
So the the series so so we took a 2 part approach to this. 1 was at any point in time, you would take a dump off these databases or or the source database and load it into the target DB. Now the taking a snapshot wasn't your typical snapshot process. We would have hourly and or daily snapshots saved in HDFS from our our espresso stores. Right? So we would use that and the LSI or sequence number in those snapshots as the initial load at any point in time. The CDC would store these changes from the database into I believe it was Kafka at the time. Right? So you can just read those Kafka topics and write back to these systems. Now because we were we were shouting sorry, because our follow source was user. Right? It could be any entity, but we we chose the source as a sharding pattern. So we can pick for any source, only look at changes for that source in the last 1 day, 7 days, 30 days, so on, and replay that when we need it. Now once we set up this this pipeline, we wanted to see if we can play along with more of the freshness of rights because we wanted users to have their own rights available to them on both systems. So we did we did take on a dual right approach in this.
Now dual rights can be I mean, they were simple. You know, you have a single client, the client writes to both the source and target database, and you also do a backfill of data from the source to data as a source to target through some kind of either CDC or some kind of replication method. Now it looks simple, but there are so many things that can go wrong here. Right? If you look at the traditional traditional dual ride, you know, you you you do this right to both, reach from both, validate, do transformations, and off you go. So when you do this, if you had distributed transactions, that can be a problem. Thankfully, you know, on the client or at least on the API facing layer, we knew that rights to a single user's follow, we could send it to the right shot. We are routing at that level. We didn't worry too much about backups to the point, because we we depending on the hourly daily, HDFS dumps to, you know, know, take care of those backups for us. We also didn't focus too much on disaster recovery because those mechanisms existed on individual databases. We could always fall back on these snapshots to do a recovery rather than falling back on the Doctor aspect. Yeah. So this process took about, you know, a month and a half at a time. The initial load was the longest at about a week and then bunch of incremental testing of these CDC, CDC and the other combinations to make sure that our constancy was in place. Now 1 challenge was how do you handle deletes, because in, sometimes, CDCs may not give you the right kind of delete events, and this thing, I'd be careful of. But, actually, Dataverse had a mechanism to store the right delta on deletes as well. So we're able to use that to ensure that deletes were probably taken care of and there weren't any new follows or new edges on this target system that we deleted from the old 1. On the client side, there were a change of schema spread. So we ensured that clients are migrated in a way so so the schema themselves were compatible in a way that both old clients and new clients will continue working. So the old clients would come to the old system and any new fields that are added to the new the new schema would have both the new and the old values in it. So even if the old files reading this 1, they can have the old fields in there until the until the old database will was fully, fully deprecated and removed. This took a I mean, this whole process I mean, the actual migration step took about 6 to 8 weeks roughly with a whole bunch of testing following for a few months. I mean, it was a reasonable success with, I don't think we had any major downtime and similar use. Now the bonus, I don't know. I I should, talk to somebody about that.
[00:27:51] Unknown:
I wonder if there's a statute of limitations there. I hope not. And in terms of these style of projects, there's the inherent risk that it starts, you get underway, you're in the thick of it, you get to what you think is the end, and then the goalposts keep moving or interruptions come along, and so it becomes a perpetual migration. I'm wondering if you could just talk to some of the other risk factors that are involved in embarking down the path of executing a migration and some of the other failure modes that you've experienced.
[00:28:27] Unknown:
I don't know if I can talk about the politics of this, but, so like any project, right, I mean, when we started off doing this, the initial goal was, hey. User has a bunch of things they acted on. And I use that very vaguely. A user can like, share, do x on another entity. And initially, this was user can just do like, comment, share on feed posts. Now being engineers, our you know, both from our own kind of, ambition as well as, you know, trying to get buy in on this from other, you know, parts of the organization, it ended up becoming, you know, any any entity to any entity to any any other entity, you know, action. Effectively an edge store. Like, a very, very general type of edge store. I mean, you can do this, but, you know, it becomes a very it becomes a very very large scale project. Pretty much migrating every entity in LinkedIn, and each and and every 1 of its relationships to every other entity in LinkedIn. That would have been a multiyear project disrupting many, many, many fields, like, like, many, many, projects sorry, features and products with LinkedIn. And we obviously wanted to start with a smaller scope. So we settled on a user to a select select half a dozen entities, a few other actions, along the way. Because there is a temptation to kind of do everything because it is obviously glorious, but you have to kind of, you know, wrap you know, put a put a put a line around that v 0 or v 1 before you start taking more things on. You might find that the other things are nice to have and may never actually materialize for resourcing and prioritization constraints.
And I know I'm being intentionally kind of vague there because it'll be giving you too much No. That's I I totally understand.
[00:30:07] Unknown:
No worries. And through your work of interacting with these systems, planning for and implementing some of these data migration projects, I'm wondering how that has influenced your overall thought process and approach to the design and implementation of data systems, data infrastructure, and also the maybe modeling architecture that's involved and some of the overall initial planning process that you have factored into the early stages of building these types of systems because of that? You know,
[00:30:43] Unknown:
when you do a lot of these, you learn you you learn a few best practices and you learn along the way. And I feel like despite that, we will still make these mistakes and repeat them every now and then just for a sanity hood, just for a reality check. But regardless. Right? I mean, like anything else, you know, there are a few things that can help reduce the risk of any migration. And I think, you know, having a really, really, really good plan and seeing what stakeholders are and knowing what the current state of all your metrics and all these terms is, are is very important. Right?
In the case of, you know, the follows migration or or or the edge migration, we kinda had an idea of who the customers were. It was it was a leading fit. Yes. Later on, we we found a few good use cases for how how can we apply that for companies and pages and so on. But having that initial plan and knowing what a rough endpoint looks like was kinda key. I mean, even with that, we had to analyze what the source like, what the starting point was, what the starting data store and the the type of data and the cardinality. All that was key. And, I mean, we we we give it the right project management rigor like anything else. We came up with a migration plan.
I'd like to think it was detailed, you know. And looking back now, clearly, we missed a few things and it wasn't detailed enough. Right? But starting off with what you thought or what we thought was a detailed plan was a good point. Right? Getting that review vetted by most senior folks, more experienced folks, or even talking to other teams that have done something like this was helpful. Again, having the timelines and how much headcount this would need, all that helped getting buy in from which 2. Right? And, again, you know, understanding, changing this data I mean, going to this target data store, what that would mean for the product. Previously, I I mean, from memory, the data access on the previous data store, I think, was, I believe is low single digit number of queries for a certain use case. This changed, I mean our analysis showed that this would double in the new use case. So we had talked to our product folks who, hey, is that, you know, latency hit, acceptable?
How can we parallelize slash, you know, make it manageable? And are there other ways to handle the same user experience that, you know, can improve, can be improved in other case? But this this these impacts can can be mitigated. Right? The other 1 was how do we do this, you know, how do we do this in multiple stages? We didn't want this to be a, you know, a a full blown 1 giant migration where we start today, there's silence for, like there's the silence from the team for, like, 9 months, and then suddenly, hey, here's an instant and goes to Stewart. Right? We didn't want that to be the case. We want to be something more iterative so we can bring on the front end teams to also AB test the new target system at different points of time and use that feedback and and the impact of those changes on metrics. Right? The other 1 was kind of data quality and testing that the migration actually is kind of going the right way. When we got up to x percent, we marked a few aspects of the original data store that we wanted to be reflected in the target data store or the target data, you know, database. Right? Things like how many users have, how many follows on this kind of entity type. Is it is it the does the same x percent hold in this smaller sample as well? Things like that. Same thing from data quality. Right? Like, you know, how many entries are missing? How many entries have been wrongly transformed.
This should this should ideally be known before you started, but knowing what that criteria is helps you better test, right, as part of transformation. And and really, you know, having a pros around again, there's the usual pros around any other service that you do, like monitoring and, you know, getting feedback and getting all logs and the the the operational observability side of things applies to data migrations too. You find that data migration is also an opportunity to see if you can redesign your entire system better. If you know that you're going from let's say, a very simple example of going from SQL to NoSQL. In SQL, you have 1 to many. What does that map to in Cassandra? There are, you know I mean, it's it's not as it's there are there are no foreign keys where you can join. I mean, you have foreign keys, but there's no if it joins. So how do you how do you how do you store your list objects in the document store? Right? Now does that change your UI and your background picture? So when you start writing these index these index entries, do you have to do it manually or do you wanna do it asynchronously through CDC again? So those decisions kind of serve us up and you can view them from, okay, if I do that, I'm gonna incur a freshness of maybe 1 second or half a second, but I can have 10 x throughput because I'm not blocked on, you know, doing locks across tables to do a index update. So you can rethink these things as part of your architecture. Right? So when you design systems, I would say start with that mindset and see how you can be flexible, how you can be scalable, how you can start putting in checks for data quality in place.
If you go really, really, really old school, queuing theory is your best friend. Ultimately, everything is, you know, pipes and and nodes. But some pipes get bloated, some data some some some lots of data are really gigantic when, how you break them, wrap them, switch through them is key. And modern systems are no different. Right? You just wanna think about it.
[00:35:44] Unknown:
Absolutely. And another aspect of data migrations is that some technologies and it seems like it's been getting better over the years, but there are definitely some technologies that are very resistant to facilitating that type of project where they really want just wanna lock everything into their their little ecosystem into or into their little box and trying to get anything out is like pulling teeth. Whereas other systems are designed to be interoperable, and they're designed to be maybe composable, in particular things like modern lakehouse architectures where you have these column oriented storage formats. You can bring different query engines to bear for different use cases.
And I'm wondering what your experience has been in terms of the progression of years and progression of technologies as to how that maybe simplifies that work of data migrations and some of the areas that you see as, ripe for further investment to improve the end user experience or maybe even reduce the need for migrations because those technologies can be more easily composed together.
[00:36:50] Unknown:
You know, I'm gonna have something, controversial here.
[00:36:54] Unknown:
Alright. Let's hear it.
[00:36:56] Unknown:
And and I'm now scared to be quoted on this. Nope. I like I like controversial. Hey. It's really, really amazing that there are so many tools and so many frameworks and so many platforms that are coming up to address different parts of the data migrations. Right? Again, we talked about, you know, in the in the beginning, there was system a that had to be system b. So you can cut the customers off, do a snapshot, reload, and be done with it oh, and and restart the customers and be done with. And because that's no long no longer acceptable today, we need to do migrations, you know, at at at 0 down downtime with highest levels of consistency and the highest levels of, highest guarantees of privacy and compliance and disaster recovery and everything else. So in that context, right, that that stuff between a and b is the really complex hairy bit. Right? Now you can you can make you can make the middle bit really simple and say I'm gonna tie in any source you have via some kind of CDC or or a pool based mechanism and directly dump it to a target like s 3 or your warehouse or some, you know, target. Right?
Or which gives you, you know, batch kind of freshness. Right? And then you can as you as you start digging into I want flexibility of which source I can pull from. I wanna do a whole bunch of transformations in the middle so I can take n sources and target to m, sorry, and target m destinations. Right? So your things like, for example, data flow is a really, really I mean, data flow powered by Beam, for example, right, is a really amazing pipelining, you know, engine for building the actual transformations from many sources of megahertz. It comes with a lot of it comes with its own cost and complexities. Right? Then you have, you know, the people like, you know, Debizium and I think and I believe Airbyte that give you a very quick and simple an an open source but quick and simple model for configuring these pieces. And then you have, you know, folks like Fivetran who who have a much more Datadog like comprehensive ecosystem where they, you know, they give you, I think, a 1,000 connectors and, you know, a 1000 targets and, you know, many, many pipeline options and tunable things. Right? As I mean, these are getting complicated because there are so many configurations to pick from. So there's no there's no like, anyone you pick, there's gonna be obscuring that you have to do. There's gonna be some maintenance. There'll be cost on the computer storage side. Right? Now for some you know, they might say, well, we'll we'll keep it we'll give you full control, but then you gotta manage your cost on your side. Rules like everything care for you is something that becomes very expensive, right, because there's a lot of, opaque blobs that you have to deal with. So I like to go back to fundamentals, and this is where my controversial thing comes in. Right? If I don't understand it, that's always my problem. I'm not saying I'm I'm I'm you know, I should I shouldn't understand it. But if I don't understand it, I'd like to know what the trade off he is proposing and giving me. Unless I know those trade offs, I'm not gonna decide I'm not able to decide which I wanna use. And maybe it's a it's a maybe it's a I was gonna say maybe I'm pampered for having been at LinkedIn and Google for having those tools available to me. But I think what I consider myself lucky for is understanding how they work and why they work and when and when they don't work. Or or at least, you know, what I have to pay for making them work. I think when I have that, I'd rather fall back to cobbling things myself because I'm okay with it. So I can see why things are going wrong and where things are going wrong. That's not for everybody there. Data moving and data migration is not your bread and butter. It's not your, you know, core business.
So starting off with 1 of these tools is a perfectly good option. Right? Finding the right Upwork talent to manage these tools is not such a bad option. But I would recommend understanding them, knowing what trade offs are, and not be surprised when you when you when you do get hit by the bill. And
[00:40:36] Unknown:
in your experience of working in this industry, being involved in data migration projects, and particularly in your current role of helping organizations automate their runbooks, which I imagine involves some measure of data replication or data migration resolutions, fix fixing things when they break in the process. What are some of the most interesting or innovative or unexpected ways that you've seen teams approach this question of data migrations, whether from understanding whether and how to do it to the technical elements of how they actually executed on it. Interesting.
[00:41:12] Unknown:
So our customers are not your traditional tech companies. Our customers are more on the l 1, l 2 support side, on the NOC and SRE side. Right? So they don't deal as much with data as my as my previous teams have. Right? So in other in in a sense, we are building the tooling so that they don't have to worry about these migrations. Now the scale of our migrations are much, much lower, by the way, in this company. We actually again, it's surprising that we haven't used this word till now. We use AI to come up with the right runbooks to look at their data to see what kind of transformations we should create, for them. And these are these are, you know, these are low scale data migrations, low scale, low frequency data migrations.
I would always call them data fixing because they because of their current engineering engineering practices, data may not always be consistent. So instead of fixing it upstream, they are falling on falling back on tools like us to apply certain rules on what what these data should look like. We are a 1 company that was managing bookings, for for, transportation. And a lot of the deletions were done using our tool, right, leading bookings that were older than, you know, x days. Right? Instead of doing it as a offline job sorry, as a offline or a nearline job running in the background, they would just use that tool every x minutes, hours to clean up, to do any garbage collection. So this so again, given it's a low case thing, we could almost get away with a lift and shift, by the way.
But we're somewhere between a series of small batch lift and shifts on much much more partial datasets than a full blown migration.
[00:42:50] Unknown:
And in your experience of working on these projects and going through these various generational shifts in the tech industry, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:43:06] Unknown:
You know, data migration challenges the challenges around data migrations, there are the there are the technical ones. A big majority of them are actually nontechnical. They're about understanding the end to end ecosystem of where that data or that or that data store fits in. And that means that like any good engineer in any other domain, right, you have to be aware and interested in being aware of the broader impact of what this piece of data, is gonna have or change that or or affecting that piece of data is gonna have. Right? Which means understanding the customer experience, understanding the business use case, understanding the priorities of, you know, who wants this done when and how and where and how much do you wanna spend for it.
So, yes, there are technical challenges. There are you know, we can do synchronization. We can do all these. You can you can read Martin Klettman's book on how to do the best in replication, you know, and charting, I mean, how you can apply the best charting and application approaches. But when you apply the real world, that's like maybe 5%, 10%. The why, what, who it impacts, those are those are the key. And just like that, you know, to manage cost and be efficient, how you manage, how you project manage this, how you communicate this, how you deal with real world surprises, Those are all kind of, you know, there. Right? I don't see them going anywhere. And it's fine. Right? That's okay.
[00:44:25] Unknown:
And in your experience too, there, I'm sure, have been cases where somebody says, we need to move this data from over here in this system to over there in this system. What what are some of the cases or what are some of the heuristics that you use to say, nope. That is the wrong choice. Don't do it. Man, you gotta get me in trouble now.
[00:44:45] Unknown:
Wrong wrong can mean a different it can mean different things. Right? I I like I like numbers. Right? So when I look at a system design, I think of it I mean, I I enjoy going into the numbers about, you know, how it affects latencies, how it affects cost, how it affects, know, your SLOs, how it affects, you know, throughputs, how it affects user metrics. So if you you know, your question can mean why and how. Right? The first thing is why. You know, why you wanna do this? I know it's a shinier, new system that might get you promoted and get you a bonus, but what is the value of it? Right? Now as to the how, are the components architecturally sound? Do they have I mean, are they just better technologies, or are there also impacts on how you're gonna maintain them, who's gonna maintain them, what is gonna like, what is it what is the TCO or the total cost of ownership of this new, you know, of of both the current of of the of the migration process as well as the final effort. And that's where I think being a manager kind of helps. Right? Like, you know, you kind of see that intrinsically. You ask that a lot more naturally. Right? And and I would say that almost I mean, every senior, you know, engineer in almost any area will be trained to ask these questions. And, again, to me, you know, the how is a small part rather than the why. And
[00:45:59] Unknown:
And as you continue to keep an eye on the data industry as you work in your current role as a CTO where you have to have a lot of decision making power over what systems to use, how to use them, what are some of the things that you are keeping in mind as you build out your technology stack, as you work with your customers and help them in their decision making process to reduce the current and future pain that they will experience due to these migration projects?
[00:46:32] Unknown:
I mean, these days, there are so many there are so many technologies out there that I find myself getting all rolled to. I mean and and I love technology. I love reading about things. I love learning about things. And there's just so many that that it's it's it's very hard to, you know, make them out through all the noise. Right? And to to your early point, we talked about how there are so many systems to to help in the middle of migration. Right? And each 1 promised a slightly you know, they promised a lot of lot of delta in certain aspects. And there is the marketing material and it is a reality.
These days, it's not that practical to be able to try everything out. So you had to kind of, you know, go back to your intuition on on what do you think worked, what might work, what is the kind of foundation. I mean, before I go there, it's different people have different different ways of us saying what is the, you know, us saying what is, what are the strengths and weakness of of an approach. Right? My 1 is basically, I don't know if I can understand it, at least from a theoretical foundation perspective. Right? If I see that makes sense, then I'm a bit more willing to, you know, take a trial, take a punt on giving it, you know, on giving it a go. Right?
Even for us, I mean, we played around with a lot of, vector databases, in our product. Right? And, I mean, I I think about a year ago, or year and a half ago, there was such a frenzy about so many vector DBs on the market, like Pinecone, BB8, you know, because, you know, there's getting, like, you know, lots and lots of VC money. Right? And, again, for me, what I didn't quite get was I get it that vector databases are very optimized for putting and getting vectoring for about something. Right? What I didn't quite understand at the time was, look, Postgres is an amazing, amazing engine where you can where where people have written some of the most complicated indexing libraries for it. Now, yes, being that generic is not gonna make Postgres the fastest DB in the world on every benchmark.
But it's good enough for for our use case where we can put in a 1,000 or a 1000000000 documents a month. Again, we're not we're not Google scale. Right? So do I really need something that's that that complicated? And yeah. I know. We've been going just fine with Elastic and Postgres for a lot for a while. Right? I see the same thing with our customers. Our customers get a bit starry eyed when they hear about anything. There's a bit of formal around. If I don't use it, will I miss out? Will I be seen as uncool? But really, going back to what is your business? If your business is about selling, you know, selling, you know, widgets, Why are you trying to use the most fanciest database out there? Right? Is is if your customer is not struggling today or or they wanted the next 10 x use case, what does this typical advance, advancement advance?
Advance. Give you from a business perspective, or or from a customer experience perspective. So I think tying it back to business and customer kind of KPIs, I suppose, is always helpful. You know? I mean I mean, I I still feel dumb that I don't understand how awesome these specific, you know, vector database are. But, you know, I I think I'm okay.
[00:49:45] Unknown:
So the the long and short of this conversation is data migrations are hard, just use Postgres.
[00:49:53] Unknown:
No. No. No. I mean, you know what? So back at SweetHawk 15 years ago, the I think Lamb was the biggest thing, like, you know, Linux, Apache, MySQL, PHP. Right? And MySQL was the most, most, most, most covered, you know, start up DB, you know, for lack of a better word. Right? Again, I was not a database person at the time. I was, you know, an embedded guy. And when I read the 2, I couldn't tell why 1 was better than the other. My switch to using Postgres was for Street Hawk was that it was more SQL compliant. That was the only thing I had to go by. 15 years down, Postgres is still, I mean, you know where Postgres is today. Right?
Again, it's it's not it's not about Postgres, but, you know, I think if you understand the fundamentals, if you if you follow your heuristics, it takes you a long way. You could be wrong, and you'll listen to something out of it. But don't be scared of, missing out on the shiny thing.
[00:50:55] Unknown:
Absolutely. Use boring technology. You sleep better at night.
[00:51:00] Unknown:
I mean, to me to me, boring is sexy. Right? To me, boring that has a lot of good details, you know, and doesn't make things complicated is sexy.
[00:51:10] Unknown:
Are there any other aspects of this overall question of large scale data migrations, your experience of working through them, and some of the challenges that people are, likely to face in the process that we didn't discuss yet that you'd like to cover before we close out the show? You know,
[00:51:29] Unknown:
I mean, this is always gonna be more. Right? Few things that I that I keep thinking about I mean, I've been reading a lot about data virtualization these days. Right? And I think if you look at data lakes and data warehouses, they kinda give you an aspect of virtualization. The idea being that you have all this data, why don't you just create views out of all these sources, without physically moving it anywhere. How practical it is? I'm still you know, my numbers aren't adding up. How I mean, there's always gonna be a use case where that works. If you're okay with the if you're okay with the latency guarantees in terms of let's say, for example, creating expensive create creating 1 off infrequent, detailed reports that need to join a 1000 data sources, sure. That that that that works. If you need a very live user, you know, dashboard being served to millions of customers, it's gonna start. Right? The other 1, I've been, again, hearing about and I'm trying to make the I'm trying to get past the labeling aspect of it is data federation. You know, again, how do you how do you bring different kinds of data in different transformed ways to different customers? Right? Seems like a very broad problem, but I think the details are way it's gonna be interesting.
We spoke about replication. Right? I mean, replication is is is is as opposed if you everything is replication. Everything is a migration. Everything is a move. 1 thing I really wanted to talk about, obviously, briefly was this replication scheme, that I came across from Netflix, called DB log. Right? I mean, you have DBZM, you have Databurst, you have HVR, and all the things that do replication for you. DB log was so simple and beautiful. It it works by creating these watermarks in your in your data stream or or you get a change capture stream so that you can actually do incremental, you know, replications, you know, very live matter instead of waiting for a whole dump of snapshot. What definitely worth reading, I'm I'm happy to share that, link as well. And, yeah, I think I think we we haven't I don't really focus a lot on data. Right? We didn't talk that much about how should the application and application teams and, you know, double quote front end teams react to all this. I think that that's interesting thing as well, which you can yeah. So I can jump into yeah. I suppose, this is so so much here, really. Absolutely.
[00:53:37] Unknown:
And in particular, I like you calling out the, the inclusion of the application teams in the process process because if you're trying to do it all from a single point solution perspective, then you're going to be having everybody in the organization, you know, their whole weight bearing down on you waiting for this to be complete because either you're breaking things or you're making them move slower when really the entire organization should be moving in lockstep to achieve that end goal. And in a lot of cases, the application developer can simplify a lot of that work because maybe you don't necessarily have to have all of the data completely synchronized in both systems.
Maybe they can write to 1 system and accept latency in the read from the other 1, or maybe they can handle the double rights so that you don't have to handle as much of it in the back end or just incorporating them in the overall planning steps as well can help reduce the actual complexity of the move.
[00:54:36] Unknown:
Yes. With a double exclamation. So some I mean, I know I missed this in the trigger earlier. Right? Sometimes data migrations happen because you're trying to save toil for certain teams. Now in a very, very small start up, that that is just 1 team. There, your problem is, should I prioritize this migration now, because deep net of point, can we can we wait for customers 3, 4, 5 customers to come on board and pay for this migration? In really large organizations, this problem is kind of I mean, the the system is similar, but it's framed in a different way.
You have a 1,000 teams with, you know, their own priorities and their own road maps and their own kind of, you know, monthly, quarterly, yearly, you know, OKRs. It becomes a prioritization problem, like like, you know, why now and when and how and so on. So the teams coming together to walk in lockstep, that's ideal. But often, it's a bit more it's a bit more coarse. In terms that, you know, the teams have to see value for them first. Application teams might say, look, my I was using API 1 before. My application is working fine. Why do I care if you change or if you migrate this data now? Now the if the answer is purely because it'll help us manage it better, that's not a good enough answer. But if the answer is doing so will help you unlock those 5 things in your road map that you can target now rather than for 2 years. Is that all good for you? So I think that's where addressing the user concerns, the business priorities, and roadmaps, like, all that kind of all these nontechnical people problems kind of
[00:56:01] Unknown:
are important. Definitely. Yeah. And as technologists, it's definitely easy to fall into the trap of, I'm just going to engineer my way through this when just talking to people can save you a lot of time. That's right. You know, be on the latest and greatest, baby. That's right. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today? You know,
[00:56:31] Unknown:
there are few actually. Right? And, again, I know I don't keep, saying this. The silos in an organization are are problematic. Right? I mean, as organizations grow, as companies grow, you know, for better or worse, they find their own ways of doing things, and that changes, you know, how they manage data, how they manage data, access data, control data, process data, and so on. So, you know, effectively, you you do have data files that have to be managed and have to be worked on. Right? So tooling has to somehow a tooling and framework, right, have to somehow bridge that gap and and and and and and make it easy to, you know, work across ourselves. Also, you know, again, over time, you know, better or worse, you know, organizers end up having a lot of bespoke and manual and nonstandard process, you know, as they grow. I know a a very knee jerk reaction to say, hey. Let's have a central data org. This, again, has become this brings no problems. Right? You're talking about functional versus feature teams and hybrid teams. So on. You know, tools need to kind of ensure that you can take in whatever manual process you have today and onboard them into this tooling rather than force, you know, tooling force everyone to change their process just for tool. Right? Again, I know it sounds like a buzzword but AI actually can go a long way in in transforming those manual things into these tools. Right? The other thing is, you know, collaboration is 1 aspect. Like, I mean, a friend of mine was talking about, you know, data usability and data accessibility and data collaboratively.
Like, how are different roles and responsibilities in your organization coming together on a single piece of, you know, unified, you know, data fabric or whatever. Right? How they're able to make sense of data in a way that's that's secure and compliant. So, and I think here's where, you know, the the idea of virtualization kind of makes sense. And if it is performant, it'll it'll it'll go a long way. So you can actually avoid the data moves. Cloud helps here because, you know, again, cloud is still a vehicle. It's not the final point, but, yeah, it can help you. Right? And I guess, status tooling and standard of APIs, you know, the other way to look forward to. But again, you have a lot of friction in the organization that is more people and process oriented, not necessarily technical oriented. Yeah. I think as these areas kind of, you know, evolve, there's definitely a lot more innovation and lot more tools and lot more frameworks because we all wanna see that world with seamless, data management and integration.
[00:58:46] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share your experiences of going down these hard paths of data migrations and some of the lessons that you've learned in the process. It's definitely always great to hear from people who have been in the trenches and done the hard work. So appreciate all the time and energy you've put into that in your various roles and your time sharing it with us today, and I hope you enjoy the rest of your day. Thank you, Tobias. It was really, really fun being here and, you know, a great walk down memory
[00:59:16] Unknown:
lane.
[00:59:19] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at data engineering podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story.
And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to Data Lakes and Starburst
Interview with Sriram Panayam
Understanding Data Migrations
Triggers and Motivations for Data Migrations
Case Study: LinkedIn Data Migration
Risk Factors and Failure Modes in Data Migrations
Impact of Data Migrations on System Design
Technological Progress and Data Migrations
Innovative Approaches to Data Migrations
Future Considerations and Best Practices