Summary
All of the advancements in our technology is based around the principles of abstraction. These are valuable until they break down, which is an inevitable occurrence. In this episode the host Tobias Macey shares his reflections on recent experiences where the abstractions leaked and some observances on how to deal with that situation in a data platform architecture.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack
- Your host is Tobias Macey and today I'm sharing some thoughts and observances about abstractions and impedance mismatches from my experience building a data lakehouse with an ELT workflow
Interview
- Introduction
- impact of community tech debt
- hive metastore
- new work being done but not widely adopted
- tensions between automation and correctness
- data type mapping
- integer types
- complex types
- naming things (keys/column names from APIs to databases)
- disaggregated databases - pros and cons
- flexibility and cost control
- not as much tooling invested vs. Snowflake/BigQuery/Redshift
- data modeling
- dimensional modeling vs. answering today's questions
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on your data platform?
- When is ELT the wrong choice?
- What do you have planned for the future of your data platform?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- dbt
- Airbyte
- Dagster
- Trino
- ELT
- Data Lakehouse
- Snowflake
- BigQuery
- Redshift
- Technical Debt
- Hive Metastore
- AWS Glue
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png) RudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines. RudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team. RudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again. Visit [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack) to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.
Hello, and welcome to the Data Engineering podcast, the show about modern data management. Legacy CDPs charge you a premium to keep your data in a black box. RudderStack builds your CDP on top of your data warehouse, giving you a more secure and cost effective solution. Plus, it gives you more technical controls so you can fully unlock the power of your customer data. Visitdataengineeringpodcast.com/rudderstack today to take control of your customer data. Your host is Tobias Macy. And today, I'm sharing some thoughts and observances about the abstractions and impedance mismatches that I've seen in my experience of building our data lake house on top of an ELT workflow at my day job. To introduce myself, if this is your first time listening, I'm Tobias Macy. I have been running this show for about 6 years now. I've been working in technology for over 10 years, starting as a sysadmin, working as a software engineer, dev ops engineer, doing data engineering. Now I manage a team focused on infrastructure and data platform engineering.
And in terms of the topic today, I have been building out our data platform for about the past year or so. Nothing massive scale, but a lot of variety in the different data sources that we're working with and the different stakeholders that we're trying to work for. And as far as an overview, I've gone a bit in more detail in terms of the specific architecture in some past episodes, which I'll link in the show notes. But at the top level, we're building a data lake house using an ELT paradigm. So Airbyte for ingest and load, s 3 for storage, AWS Glue for the table metadata, Trino is the query engine, dbt for transformations, and Daxter to wire it altogether.
That has been a good technology stack for us and offers a lot of flexibility and room to grow and experiment. And today, I'm just gonna talk through some of the edge cases that we've run up against, some of the challenges that we've had in terms of building out that platform, figuring out what are the right workflows, abstractions, what are some of the cases where those abstractions break down, what are some of the cases where we're dealing with tech debt, some of that from our own work, but also tech debt at the community and ecosystem level.
And an interesting element of that is that the tech debt that we are dealing with isn't something that we consciously adopted. It's just something that has been an outgrowth of the history of how the data community has evolved over the past decade plus. So for instance, the way that tables are defined if you're dealing with this lakehouse paradigm is largely using the Hive metastore or something compatible with it, which in our case is AWS Glue. And Hive came about as an outgrowth of the Hadoop ecosystem, which came about as an outgrowth of the era of big data where everybody wanted to collect as much data as they could. And so Hadoop came out from the Google from Google's work on MapReduce.
And so it's just interesting that the decision that was made a couple of decades ago led to where we are today, where if you want to be able to run SQL workloads on data that lives in commodity storage, then you're dealing with a technology that was developed over a decade ago. There is work going on to update the way that the meta stores are architected and implemented, particularly with some of these new table formats such as iceberg and hoodie and the, delta format from Databricks. But by and large, if you wanna be able to use something that is supported by a wide variety of query engines and and data engines that you're gonna have to use the Hive metastore. And so in order for us to be able to take advantage of that, we are actually using a connector in air bite where it will load the data from the source systems into s 3 in JSON format and automatically create the different tables in the glue catalog or update them as necessary.
And that's some work that we actually had to do, which I believe I covered in a previous episode, but we had to work with the Airbyte community to add a new connector on top of the existing s 3 capability to build that connection to glue because otherwise, it was a bit of a chasm between I can extract data and load it and then being able to actually query it. And so what a lot of people would do is either use the AWS Glue crawler where then you're introducing latency between when you first load data to when you can actually query it, which if you're doing it from a manual analytics perspective, that's okay. You can deal with that latency. But if you're trying to orchestrate everything together and load the data and then execute transformations on it all in 1 flow, then it's harder to manage that latency.
And so that's why we wanted to be able to have that direct connection from data gets loaded to the table information is created so that we can query it right away Where if you're dealing with something like a Snowflake or BigQuery or Redshift, those are full vertically integrated databases and so the table creation and the data loading both have to happen at the same time so you don't have that disconnect. But because the idea of the data lake house where it's a data lake Technology stack, but you're using it in a warehouse format. That's still a fairly new paradigm, so there isn't as much tooling and capability built up around that. So that's why we had to do some custom engineering to be able to bridge that divide. And that's so that's 1 of the cases of the abstractions have broken down a little bit. There's a bit of impedance mismatch between the set of tools that are out there and the ways that you want to use them and the ways that different areas of the community have grown organically.
So I don't know if I would necessarily call that tech debt per se because it's new development that's happening but it's an area it's an area of investment that hadn't been made yet. So 1 of those things that you wanna think about as you're designing and developing a platform is what are the places where I want to or have the capacity to do that custom engineering to be able to form the platform in the way that I envision it, where when I first came into this, I was assuming, okay, well, we can just extract and load with a tool like Airbyte, and then I'll be able to query it in a tool like Trino, and all the pieces are gonna be there for me. And it wasn't until I started to go down that road and dig a bit deeper and actually start working through that process that I understood that there were these gaps. And so that's 1 of the challenges that exist in most areas of technology where from a surface level, it can appear as though all of the abstractions are in place, All of the implementation is there. You just need to be able to pick and choose which tools you want to work together, and it isn't until you actually make those decisions and really start trying to build something in earnest that you find out what are the places where these abstractions break down, what are the pieces that are missing, what are the pieces that I actually have to do some custom engineering on. And that's also 1 of the cases that's also 1 of the places where we need to be aware of the sunk cost fallacy of, oh, well, I already chose these things. I already started down this path and getting things working.
Now I have to actually keep pushing through until I can get this done where once you hit 1 of those roadblocks, it's a signal that you need to just stop and reevaluate your plans and figure out is this actually going to be worth the effort? What are some of the potential next roadblocks that I need to consider so that I can decide maybe I actually do need to take this as a loss and say, I've learned some things, but this isn't the technology stack that's gonna get me to where I would need to be. I actually do need to reimplement or re architect and go in a different direction.
And that's also where having those clean abstraction boundaries is useful if you're able to say this 1 piece of the stack isn't doing what I needed to do. So in my case, if I had said, I actually don't wanna build this connector to be able to work with glue, I need to be able to write data and query it in a single flow. And so I'm actually going to I could still use air byte but instead of investing further in this lakehouse paradigm of s 3 and glue and Trino, I'm actually going to go with the snowflake option because there is enough technology built around that. It is a well worn path or if I still want that lake house capability, maybe I go with the Databricks because again, there is a lot of investment in that space.
Personally, I wanted to have more control over the entire stack, so that's why I wanted to stick with the fully open source set of options recognizing, of course, that AWS Glue isn't actually open source, but it is implementing a spec of the Hive metastore, which I could run on my own if I wanted to. Part of the reason that this problem of being able to directly create the tables as part of the data integration flow is that I wanted to invest in automation because I have a small team, and so I need to be able to ensure that we're able to iterate quickly without having to spend a lot of cycles on data exploration, data discovery, data transformation realize the promise of ELT of you can get the data in, you can get it loaded, and then you can do the exploration and transformation after the fact without having to worry about losing information or incorrect assumptions until you've actually done more of that exploration.
What I see as the compliment or tension with automation is correctness where correctness often requires a lot of time and energy and engineering effort invested to ensure that you are fully capturing context and business knowledge and domain expertise into the final data product as it's landed. And it's not to say that you can't achieve correctness with automation, just that they are intention because automation requires that you have a certain set of baseline assumptions, a certain set of presumptions about the way that the data is landing, the way that the data is formed, and particularly when you're dealing with the data integration flow even with not doing transformation explicitly in that integration path, there is still the likelihood of losing some information and losing some context. For a concrete example of that, what I've been seeing is with something like air bite and this will be the true for Meltano and Fivetran, etcetera, where you have this interface from the source into the destination, you have to do some sort of mapping because the sources aren't going to have the same set of ideas and assumptions about the structure of the data and the capabilities that they provide versus the destinations and the capabilities that they are assuming.
And if you try to capture the entire possible range of mappings and transformations from 1 data type to another, then it eliminates the promise that these integration tools provide where you want to be able to say, I just want my Google Analytics or my Salesforce or my Postgres database to be able to be extracted and loaded directly into my s 3 files or my snowflake tables, etcetera. And in our case, with the glue connection with the Hive metastore presumptions, because it was a new connector, we needed to understand what are those mappings. And so, for instance, a postgres database has a certain set of columns and types.
Airbyte has a certain set of types that it will support that the sources need to map into. And then in that intermediary layer, these are the types that are able to be represented. So you need to tell us what you want those to be represented by in the destination. So for people who aren't familiar, hive supports both you know, all of the basic types, integers, strings, supports all of the basic types, integers, strings, bullions, etcetera, as well as some complex types such as arrays and structs and Being able to figure out. How do I represent these different things properly? This goes back to the point of correctness where if I'm doing these transformations manually, I can investigate. Okay. What are all the source?
Datasets looking like? I can map this into a struct that I will manually create if I was writing my own DDL SQL, and then I can ensure that that type information is correct. But because I'm trying to do this in an automated fashion, I have to give up a little bit of that information at the ingest point to say, okay, from the source, this is a nested JSON object, for instance, coming from an API. And in the destination, I need to just represent that as a string because the source isn't giving me enough information about what those nested structures look like. Or in the case where I do have that information, for instance, the Mailgun API, the connector does have that information for some of the structs, and so it will get created and mapped properly, but the keys in that nested document are using hyphens, which are invalid for keys within a struct, so then you have to do some late transformation there anyway. And so a recent exercise that we had to do was dealing with and so we we've recently been going through an iterative cycle of finding what are some of those edge cases of incorrect assumptions in those type mappings. So 1 of them was that in the time stamp field that air byte emits to say when the sync was run, the numerical value was too large for the basic integer type in Hive because the integer type I think is only a 4 bit integer and so that gets overflowed fairly easily.
So we actually needed to expand the possible, maximum value that those numeric fields would take. And so we did some investigation found that Hive has a decimal type which supports accurate numerical representations both integers and, decimal values, which we would want rather than just using floats because of their potential for inaccuracies and rounding errors. But if you only say that this is a decimal field then it turns out that it will that the hive engine and and and glue holding true to their specification will do the same. We'll presume that you want a decimal of precision 10 which means that it can actually only be 10 digits long. So this is 1 of those back and forth things that we had to do some exploration and understanding of And so decimal fields in hive can take a precision and a scale value which will say and and both of those max out at 38. So a decimal can be at most 38 digits long, and you can also say that the scale can be from 0 to 38, and that means how many digits are on the right hand side of the decimal point.
And so I said, okay. Well, let's just make that mapping decimal 3838 so we have the maximum compatibility. But it turns out that that second 38 will say we could take a 38 digit number, but all 38 of those digits have to be on the right hand side of the decimal. So when we were trying to parse even simple integers, it would throw an error. So the end result was that you actually only want to specify the precision of decimal 30 838 maximum digits and leave the scale, at at and then leave the scale undefined so that it can be whatever it ends up needing to be. The other instance was for those nested documents and rich objects where we said, well, actually, we just want those to be converted to a string in the JSON files so that we can just treat that as a string in the table definition and then we will just use JSON functions in the Trino engine engine to be able to parse those values back out and transform them into the structures that we need. So this is another case of we're willing to sacrifice some information in that automation flow for greater flexibility so that we can do our own work at the point where we have the time to do it. 1 of the questions that I've been asked and been asking myself throughout this whole process is, well, what if I just did say I I'll give up on this data lakehouse approach and I'll just use 1 of the big data warehouse engines such as Snowflake. Will that solve all of my problems?
And maybe it would have solved a couple of the problems where the connector in Airbyte, for instance, didn't have the right type mappings, but it we still would have run into some of these edge cases and we still run-in run would have run into some of these issues of being able to source the source and land this data and understand it. And most of the effort isn't in that loading of the data, it's in the understanding of the data and figuring out how do I structure it, how do I manage the transformation so that I can use it in multiple different scenarios. And so that's another area we've been exploring is from the data modeling perspective. Do we have the time and context to be able to build the proper dimensional modeling of this underlying information? Or do we just say we're gonna do some simple transformations and create the structures that we know we need today so that we can answer the questions that we have today. And so for now, we're we've been focusing on just saying, okay, well, we understand these course grained dimensions that we want of taking the source tables and aggregating them a little bit into some of the contextual domain objects that we know we need to talk about and then be able to write some reports. And I think that that is a useful exercise to understand how is the data actually being used versus doing a lot of upfront investment of building that dimensional model before you have any consumers.
And so, again, this is that tension of automation versus correctness and willingness to take on technical debt and be cognizant of the fact that you will have to do some refactoring and pay down that debt down the road. Pontificating for a little bit now about some of the things that we've learned and some of the impact that it has on the broader ecosystem, there are definitely platforms out there that focus very heavily on automation, making sure that you have a seamless experience all the way through, and there's a lot to be said for that. And it's definitely possible to build those systems internally, but it's a matter of the engineering effort that you have the time and capability to put into it, as well as the assumptions that are being made about those workflows and about what capabilities people want and what are the cases where you do hit an edge case or hit a dead end and you need to do that custom engineering anyway. And so I I think that there's something to be said for getting the simple things done quickly so that you can understand, yes, I can do this thing, but also in the path of that, making sure that you have some hard problems or hard data to work with that you try to load early so that you can try to figure out what are those edge cases? What are the problems that I'm going to run into in this platform before you're fully invested in it and before you hit the point of being beyond the sunk cost fallacy of I have to stay with this platform because I don't have the time or capability to reengineer onto something else. And so, again, making sure that you have those proper seams in your platform design so that you don't have to rebuild the entire thing from the ground up if you hit 1 of those edge cases and if you need to evolve in a separate direction.
From the kind of protocol perspective as far as how we think about maintaining context and men and supporting automation, it's really hard because it's something that requires everybody all the way up and down the technical stack across all of the different industries and platforms to invest in where what are some of the ways that we can maintain some of this contextual metadata to be able to map those different data types from source to destination cleanly and effectively be able to understand the domain context of where data is coming from and where it's going to without having to reconstruct it after the fact.
Because that's I think where a lot of the wasted effort comes in in the data engineering space is just rehydrating data after you have done some manual effort and detective work to figure out where did this come from, why did it come from there, how is it getting to where it needs to be, and what am I supposed to do with it? Because that is something that I've experienced personally as well as something that I've talked to a lot of people about in the industry. And so in terms of my experience building this platform, doing some of these deep dives on figuring out how to manage some of these type mappings, how to balance that tension of automation versus correctness and reliability.
1 of the most interesting and unexpected and challenging lessons that I've learned is just that technical debt isn't just something that an individual team takes on. It's something that particularly in open source is accrued over time throughout the community, and it's not always clear when it's being adopted. And also as you bring on new tools, it's not always clear what that long term technical debt will mean. So for instance, in in the case of hive, that is communal technical debt that is being addressed in some fashions, but it's not always clear as you start to use a particular platform that it is technical debt and what those edge cases are going to be. And so I I just think that for people who are working in open source and who are contributing towards these ecosystems that it's important to be cognizant of that fact and the long term impacts that your decisions can have and maybe try to at least document those cases of what are the shortcomings and try to help popularize that and figure out what are some of the ways that we as a community can maybe move a little bit slower in aggregate, but will have a a more solid long term impact.
And in terms of this overall experience, I'd say when is ELT the wrong choice? I'd say it's the wrong choice when you need to guarantee correctness and the maintenance of context from the source system the data was originated in to the downstream analysis where it's very easy to have information loss that is implicit. It's not even necessarily clear when it's happening, but it's possible to lose that information just from the extract and load process. And so maybe you need to have a more deliberate and manual process of building the context around the data as it's extracted to ensure that it doesn't get lost downstream. It has to be So again, it's the that trade off of how quick do I need it versus how correct do I need it to be at what point. So looking forward in terms of the work that we are doing for our data platform, some of the things that are coming up next are definitely investing more heavily in data validation, data quality checking, and in in particular, building a more robust metadata platform around all of our information so that we do have a better means of building and maintaining that context as data traverses our overall platform and particularly as we start to add new downstream consumption use cases to it. So, for anybody who wants to get in touch with me and follow along with the work that I'm doing, I'll add my contact information to the show notes. And the final question of where I see the biggest gap in the tooling or technology for data management today, I I think it definitely harks back to this concept of how do we manage that information about typing and context during that data integration flow of how do I understand where is my data coming from, what is the type information that's in there and why, and how do I maintain that without losing it because of the fact that I need to conform to a particular interface to be able to load it into that destination.
So thank you for listening. Hopefully, this was helpful and informational. Definitely feel free to, continue listening to the show. I appreciate all of the goodwill and attention and kind words that you have all given me over the years that I've been running this show, and I look forward to continuing to run it into the future. Thank you, and have a good rest of your day.
[00:26:01] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast.init, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Host Background
Building a Data Platform
Challenges and Edge Cases in Data Lakehouse
Tech Debt and Community Evolution
Custom Engineering and Abstraction Boundaries
Automation vs. Correctness
Data Lakehouse vs. Data Warehouse
Lessons Learned and Broader Ecosystem Impact
Future Work and Data Quality
Closing Remarks