Summary
Five years of hosting the Data Engineering Podcast has provided Tobias Macey with a wealth of insight into the work of building and operating data systems at a variety of scales and for myriad purposes. In order to condense that acquired knowledge into a format that is useful to everyone Scott Hirleman turns the tables in this episode and asks Tobias about the tactical and strategic aspects of his experiences applying those lessons to the work of building a data platform from scratch.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show!
- Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
- Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more.
- Your host is Tobias Macey and today I'm being interviewed by Scott Hirleman about my work on the podcasts and my experience building a data platform
Interview
- Introduction
How did you get involved in the area of data management?
Data platform building journey
- Why are you building, who are the users/use cases
- How to focus on doing what matters over cool tools
- How to build a good UX
- Anything surprising or did you discover anything you didn't expect at the start
- How to build so it's modular and can be improved in the future
General build vs buy and vendor selection process
- Obviously have a good BS detector - how can others build theirs
- So many tools, where do you start - capability need, vendor suite offering, etc.
- Anything surprising in doing much of this at once
- How do you think about TCO in build versus buy
- Any advice
Guest call out
- Be brave, believe you are good enough to be on the show
- Look at past episodes and don't pitch the same as what's been on recently
- And vendors, be smart, work with your customers to come up with a good pitch for them as guests...
Tobias' advice and learnings from building out a data platform:
- Advice: when considering a tool, start from what are you actually trying to do. Yes, everyone has tools they want to use because they are cool (or some resume-driven development). Once you have a potential tool, is the capabilty you want to use a unloved feature or a main part of the product. If it's a feature, will they give it the care and attention it needs?
- Advice: lean heavily on open source. You can fix things yourself and better direct the community's work than just filing a ticket and hoping with a vendor.
- Learning: there is likely going to be some painful pieces missing, especially around metadata, as you build out your platform.
- Advice: build in a modular way and think of what is my escape hatch? Yes, you have to lock yourself in a bit but build with the possibility of a vendor or a tool going away - whether that is your choice (e.g. too expensive) or it literally disappears (anyone remember FoundationDB?).
- Learning: be prepared for tools to connect with each other but the connection to not be as robust as you want. Again, be prepared to have metadata challenges especially.
- Advice: build your foundation to be strong. This will limit pain as things evolve and change. You can't build a large building on a bad foundation - or at least it's a BAD idea...
- Advice: spend the time to work with your data consumers to figure out what questions they want to answer. Then abstract that to build to general challenges instead of point solutions.
- Learning: it's easy to put data in S3 but it can be painfully difficult to query it. There's a missing piece as to how to store it for easy querying, not just the metadata issues.
- Advice: it's okay to pay a vendor to lessen pain. But becoming wholly reliant on them can put you in a bad spot.
- Advice: look to create paved path / easy path approaches. If someone wants to follow the preset path, it's easy for them. If they want to go their own way, more power to them, but not the data platform team's problem if it isn't working well.
- Learning: there will be places you didn't expect to bend - again, that metadata layer for Tobias - to get things done sooner. It's okay to not have the end platform built at launch, move forward and get something going.
- Advice: "one of the perennial problems in technlogy is the bias towards speed and action without necessarily understanding the destination." Really consider the path and if you are creating a scalable and maintainable solution instead of pushing for speed to deliver something.
- Advice: consider building a buffer layer between upstream sources so if there are changes, it doesn't automatically break things downstream.
Tobias' data platform components: data lakehouse paradigm, Airbyte for data integration (chosen over Meltano), Trino/Starburst Galaxy for distributed querying, AWS S3 for the storage layer, AWS Glue for very basic metadata cataloguing, Dagster as the crucial orchestration layer, dbt
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- MonteCarlo: ![Monte Carlo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/Qy25USZ9.png) Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit [dataengineeringpodcast.com/montecarlo](https://www.dataengineeringpodcast.com/montecarlo) to learn more.
- Atlan: ![Atlan](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/ys762EJx.png) Have you ever woken up to a crisis because a number on a dashboard is broken and no one knows why? Or sent out frustrating slack messages trying to find the right data set? Or tried to understand what a column name means? Our friends at Atlan started out as a data team themselves and faced all this collaboration chaos themselves, and started building Atlan as an internal tool for themselves. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to [dataengineeringpodcast.com/atlan](https://www.dataengineeringpodcast.com/atlan) and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription.
- Linode: ![Linode](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/W29GS9Zw.jpg) Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to: [dataengineeringpodcast.com/linode](https://www.dataengineeringpodcast.com/linode) today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans could focus on delivering real value. Go to data engineering podcast.com / atlan today, that's a t l a n, to learn more about how Atlan's active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork, and Unilever achieve extraordinary things with metadata.
When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to data engineering pod cast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, we're flipping the script, and I'm being interviewed by Scott Herleman, about my work on the podcast and my experience building a data platform. So, Scott, before I hand the reins over to you, can you just share a brief introduction about yourself?
[00:01:42] Unknown:
Yeah. So thanks, Tobias, for setting this up. I have a kind of crazy background, so let's not get into all of the things I've done. But right now, I'm helping a lot of folks in the data mesh community understand kind of what is data mesh, how to get going on that. I have a podcast around that. I've helped create a community that's being moved into a foundation around data mesh. I'm also gonna be creating something, a working group around data contracts, so so I don't have to focus on data contracts. So just, I've got my hands in a lot of different pies. So
[00:02:14] Unknown:
That's, opportune timing because I actually just released an episode with Abe Gong about his opinion of on data contracts from his work as the CEO and 1 of the cocreators of Great Expectations, so opportune timing there. And, in terms of your experience of working in data, do you remember how you first got involved in that space?
[00:02:34] Unknown:
It was kind of how I get involved in a lot of things, which is randomly. So I was hired on by a company to do semiconductor venture capital in about 2010. And then they realized about 3 months in that they didn't really wanna go that capital intensive venture capital. So they said find a new space, and this was right when kind of, you know, early 2011 was the kind of big big explosion around Hadoop. And so looked at, you know, tens or maybe even a 100 plus of the kind of, quote, unquote, big data startups and helped in the VC firm invest in DataStax and moved over there and have just been kind of bouncing in around data ever since. So
[00:03:14] Unknown:
And so now since you're actually here to interview me, I will hand the reins over to you, and I will let you carry forward.
[00:03:21] Unknown:
Awesome. Well and I think before we jump in, we're gonna be talking about a lot of things around, like you said, building your data platform and how do you think about, you know, you you've talked with so many different companies and vendors and all these people. You've done a ton of the research for a number of people in your audience. But now that you're applying it to your own work, I'm gonna be really excited to dig into that. But before we do get to that, I'm sure most people know at least some of your background, but would you mind giving people a bit of an introduction to yourself, and then we can jump into the conversation at hand? Sure. So I have been working in tech for,
[00:03:58] Unknown:
I think, going on 12 years now. I actually started off I got my degree in computer engineering. I liked that particular area of study because it was a nice hybrid between electrical engineering and computer science. So you get everything from how the transistors are built from the electrical components perspective up through the kind of computer science algorithms and data structures piece. I think that really set the tone for my career of kind of spanning everything from the silicon to the user interface. I actually started in tech as a systems administrator, so working on managing Linux hardware.
My first day on the job, I didn't really know what I was doing, but I kind of had the, tenacity to stick with it and figure it out. And so that's what carried me to where I am now. I actually got involved in data just largely because of working on systems and having to maintain databases and maintain infrastructure. Spent a while as a software engineer, so carried some of my operations experience into that role and kind of building systems in a way that carrying the operations focus into my software work and then kind of naturally fell into the role of a DevOps engineer because that was around the time that it was picking up steam. So, ostensibly, I've been a DevOps engineer for a number of years, but 1 of my jobs was actually as the kind of DevOps engineer and the sole back end engineer for a product that had a lot to do with clickstream data.
And I came in with a project that was half implemented, and we had to get it over the finish line. And that project was actually capturing all of the user events from a platform that was a JavaScript layer top of video players. So we were dealing with all of the kind of video events and user interactions on different quizzes and overlays that our platform put up, and they were piping all of that into BigQuery. And this was around the 2015 time frame. So anybody who worked with BigQuery in that time can understand some of the challenges that I was dealing with because BigQuery is not meant to be an event store. It is great for processing, you know, lots of data, but it is not an event store. It is not intended that way.
So and I had a Python app, you know, Flask app that would buffer data into a Redis queue so that it could flush it up to BigQuery. And then on the other side, we had to try and build a UI on top of that data that was end user facing. And so I was doing kind of embedded user facing analytics powered by a BigQuery database. So I ended up having to actually pre aggregate data into a Postgres database and then do some more kind of views on top of that to be able to try and get this to work. So that was, I think, what really got me down the road of data engineering, just dealing with all of that pain of, like, this is not how this is supposed to work. Like, let's pick the right tools for the right job. And so then in 2017, I had already been running my other show, podcast.init, for a couple of years.
And I just decided, well, let's try another podcast. And so when I started my Python show, that was motivated by my engagement with the Python community. I was really digging deep into the language and its ecosystem, and I really wanted there to be a podcast about Python. And at the time, all the other shows that focused on Python had stopped producing new episodes. And eventually, I said, okay. Fine. I guess it has to be me. So I started that show, ran it for a couple of years, and then beginning of 2017, I said, okay. Well, let's start another show. What's interesting? And I said, that was right around the time that Max Boeschmann wrote his foot posts about, you know, the rise and fall of the data engineer.
And I had been working with data, and it was, you know, a very back end oriented area of work, which is where I had spent a lot of my focus. And so it's kind of the interesting Venn diagram of operations and back end engineering and systems. And so I said, okay, well, there aren't any podcasts about data engineering. And at the time, there were dozens of podcasts about data science because that was the new hotness. So, you know, data science was gonna save us all, and it was what everybody wanted to do, but nobody was talking about all of the work that has to be done before you can do the data science. And so that was kind of the impetus for the data engineering podcast. And so I've been running that for 5 years, and it was kind of perfect timing because it was right at the inflection point of when everybody realized, oh, wait. Actually, data engineering is a thing, and we really need to invest in that. And so I have had no shortage of material to discuss on the show.
[00:08:16] Unknown:
Yeah. I think that especially that part about the computer engineering, like, I was like, wait. Is that different from computer science and electrical engineering? And what you said, I think really ties in really well to where I think a lot of people overlook a lot of the key parts of data engineering is that it is kind of all of the systems up and down and so much of it is spent on just the processing versus like how does this all interconnect? And, you know, I'm sure you're familiar with the OSI model, and like, how do we not have that OSI model for data of actually working together on that? So I would love to hear about the you've been talking about this on the podcast for a while about building out your data platform. I would love to hear let's start with kind of 1 question that some data engineers don't typically like, but why are you doing this? Like, what's the point of this versus let's get to the cool speeds and feeds? But, like, what were you actually trying to accomplish when it came to building this out? Like, what was the reason for doing this now? What was the impetus? And what's the actual goal of doing this?
[00:09:22] Unknown:
Yeah. And so to give a bit of context there, as you said, I've talked about this on the show a bit, but for people who are just tuning in the first time right now, my day job is not actually being a podcast host. This is a kind of a hobby slash side gig despite the fact that I am now at 3 podcasts and producing most of them weekly. So for my day job, I actually work for MIT as the associate director of platform and DevOps for open learning, which is a department within the institute. And our focus is on building products that are mostly externally facing, so bringing the knowledge and educational understanding and pedagogy that MIT develops and making it available to anybody in the world who wants to engage with it and learn and, you know, potentially even come to MIT to extend their education.
So we have, honestly, I don't even remember how many business units at this point because it feels like they always add a new 1. But, you know, some of the more famous ones are OpenCourseWare, which was 1 of the first kind of free online learning resources that was put on the Internet. So that's, I think, over 20 years old now. So that's course material from MIT professors that has released as Creative Commons licensed information for anybody to use however they want. So that's lecture notes, it's, you know, lecture videos, quizzes, all kinds of different resources out there. We've got MITx, which is the MIT MOOCs that are produced. So it's actually largely course material that is used for MIT students and then slightly remixed and put on the Open edX platform for people to be able to take those courses on their own time. If they want, they can get a certificate of completion for it. We have another product called X Pro, which is similar where it's using the Open edX platform to produce course material for professional education. So that's people who are mid career, they want to stay up to date with different technologies or techniques or trends. So there are courses on things like systems engineering, quantum computing, data science.
And so we have all these different products. As a member of the engineering team, we build systems that support all of these efforts. Department. 1 of the first forays into making data insights available to the department was actually just standing up a Redash server as a quick and dirty business intelligence platform. So for people who aren't familiar, Redash is a BI platform focused on data visualization. It's written in Python and Flask, so it's got a good ecosystem of plugins to be able to connect to various different data sources. And so you can just stand it up and say, okay, I'll just connect directly to my application databases, munch the data to be able to get some reports to people so that they can get their job done.
And we've been using that for a number of years now, and it's always been a bit clunky, some sharp edges, a little painful because it's not designed to be the entire data platform. You know, we made it work, so we kept kicking that can down the road a bit. And so the goal of our work on revisiting this idea of build a data platform that is scalable just like all the other systems that we build, that is maintainable and adaptable. And so we want to be able to actually incorporate the information that we generate from all these applications and systems that we maintain and all of the services that we engage with. Another thing that was lacking in the kind of Redash implementation was any sort of cohesive data governance, any sort of cohesive view of what data do we even have and who is generating what data.
Because mostly, it was tied to the systems that we and the engineering team built and maintained, but there are other groups in the department who don't necessarily rely on the engineering team to build their platforms for them. So there are other sources of data that are kind of the shadow IT approach, not in the sense that they're not approved, but just in the sense that not everybody in the department knows that they exist. And so it's hard to be able to actually get a concrete view of what information we have available as a department to be able to further the goals of open learning, of making education available to everyone in the world. And so that was in terms of the timing, you know, we've been building more systems, we've introduced new product offerings, and so we kinda hit the tipping point of where Redash just wasn't really cutting it anymore.
The appetite for data in the department has been growing, and so there have been kind of pain points and friction that we wanted to be able to address by building a holistic data platform that we can use as a solid foundation to build forward for now into the future.
[00:14:07] Unknown:
1 thing that I keep seeing that gets overlooked when people are building these platforms is the user experience. Are you first starting, you know, on the data mesh side when people say the self serve platform? I say for whom? And it's like, okay. You know, I had somebody on the podcast recently, and they listed out, like, 12 different personas that they're building their, you know, different user experiences for. Are you first starting with getting people to be able to publish easily or being able to get people to consume easily? Like, because I think if you're trying to get focus on the consume, it can be putting the cart before the horse and, like, how do you think about that road map? Like, how did you plan out
[00:14:48] Unknown:
when, what, why, and then, you know, how do you start to build that out so that you're not like, okay. I'm going to debut this thing in 4 quarters, and it will be the greatest thing ever, and people haven't gotten to touch it and give you feedback and give you the chance to iterate on it. That was definitely a bit of a stumbling block early in the process where the the overarching goal was we wanted to start with what is the foundational piece that we can build that will allow us to even start that iterative process of figuring out who are the first users, how are they going to engage with it. So initially, it was just, let's figure out how do we get all the data into 1 place so that we can work with it? What are the ways that we're going to work with it as an engineering team that fit with our overall development model? So we were definitely very focused on kind of code oriented software engineering principles first, And we wanted to build a system that was flexible so that we could plug in different interfaces for different stakeholders and use cases.
So the kind of flexibility and adaptability was a key design consideration and how we selected the various components and the way that we wanted to build it out. We aren't jumping straight to, hey, let's build a self-service platform so that everybody will do everything that they want. It's, how do we build a system that lets us say, okay, we can build an initial dataset that is useful for this persona, get that in front of them, let them start working with it, and then we can build from there. And we also didn't want to do a point solution of, hey, we're building for this person, so let's engineer it with that in mind because then we're going to take a bunch of shortcuts that make it harder to reengineer down the line. So we had to say, okay. What are the sampling of some of the use cases that we want to be able to build for?
And let's do a very kind of bottom up first principles approach of let's start with let's get the data in 1 place. Let's be able to interact with the data, let's be able to, you know, transform and model the data so that we can get it into a shape that is actually useful for other people. Because we had that Redash platform in place, that also gave us a little bit of directionality of, okay, well, what are the ways that we are already using the data? What are the questions that people are asking? And so that can help us understand, you know, what are the priorities of which data sources to pull in first, which transformations to build in. So that helps with the overall kind of design process.
As we're talking right now, we haven't actually put it into, quote, unquote, production where we don't have people hammering on the platform yet. We're still building that out to, okay, let's get this to a point where we are happy saying this is production grade, and we will support what we have built so far when people start using it. And right now, we're in this phase of kind of crossing our t's and dotting our i's, and we're getting to where we can say, okay, here's the first use case, here's the data for you to be able to use, go ahead and, you know, start working with it and let us know how we can improve.
Once we hit that point, then we're really gonna be in our stride of iterating and saying, okay, this works, this doesn't work, you know, this is the piece that we're missing that we didn't know we needed or, you know, yes, we knew we needed that. We just didn't do it yet because it wasn't core to the kind of foundational capabilities of the platform.
[00:17:54] Unknown:
1 thing that I've talked about a little bit is the concept of what I'm calling a data shrek just because it's the stupidest name I could think of, and I just love doing that stuff. But, like, a purposeful swamp where you purposely create a data swamp, so you've got kind of a sandbox and people can't immediately start using data from it. But, like, it sounds like what you're saying is your team, the data team are the ones that are deciding what is useful and what isn't. I've seen this from a couple of companies where the art of the possible that the data consumers don't really think about the art of the possible about, oh, I could do this. I could do this. I could do this. Or do we have this data?
Are you finding that you're almost having to generate? Consumer demand for data as well because you know, the MIT system is quite different in the academic space versus, you know, a lot of corporations where they're like, I am trying to do this to drive my business unit to more profit. Like, you're driving to more for good or something like that. So, like, how do you think about that aspect as well? Are you finding that that's made things a little bit different? Like, because you are doing a lot of the work without the consumers necessarily directly involved.
[00:19:10] Unknown:
We have been working with some of the people in the department, not promising anything or saying, Hey, this is what we're building, but say, you know, just understanding from the ways that they're already doing their work, you know, what are the things that they're trying to do with information, what are the questions that they're asking, and how are they getting their answers, and okay. So they are managed to get answers. Are they good answers? Are they timely answers? How can we make their experience better? And then kind of building towards that so that when we do flip the switch to say, okay. This is production. This is ready for you to use. We can go to them and say, hey. You know that thing that's really hard and annoying for you to have to deal with? Here you go. It's easier now.
[00:19:48] Unknown:
As part of this, you know, you're talking to so many of these people. You know so much of the speeds and feeds, but I think you also have a good focus on the high level of what are we trying to do and why. Like, how do you get yourself from focusing on the cool tools and focusing more on what matters and what matters to this use case? And like you said, you're building this foundation to build something scalable on top, but, you know, you're like, well, I need it to be able to handle, you know, 10, 000, 000 requests a second. It's like, are you ever gonna get 10, 000, 000? You know, it's it's like a fun engineering challenge versus the, like, what actually matters. How are you keeping yourself in that mode?
[00:20:25] Unknown:
1 is just kind of working with other people beyond just myself. So I'm not doing this in isolation, kind of tinkering at long hours into the night. I'm trying to figure out, okay, how can I direct my team? How can I work with my manager and some of the people on the development team to be able to build this and make sure that everybody understands what we're building? So that helps with the kind of shiny tool syndrome where I don't want to just bring in everything under the sun because I know that I'm gonna have make sure that everybody understands how it all works together. So it was, you know, what are the core pieces that we absolutely have to have in place to be able to do something? And so and then also 1 of the decision criteria that I've been leaning on is, if this tool or vendor goes away tomorrow, does that completely destroy everything else about the platform? So designing around these kind of composable elements. I don't wanna throw out the term modern data stack because I wouldn't say that's quite what we're doing, but some of those principles of, you know, build with the inputs and outputs in mind, build so that each kind of unit of functionality is something that if I have to, I can either engineer a replacement or there are other options out there. I also tend to bias heavily towards open source because everything that we build is open source and we rely a lot on open source, we contribute to open source. And also from a kind of platform maintenance perspective, it's important to be able to understand what's working. You know, if something is broken, I want to be able to have the power to dig in and fix it without having to rely on a support ticket that has a turnaround time of 3 weeks.
Very much being able to own our own destiny around this platform is important as well. And so the overall architecture that we built up is kind of the data lakehouse paradigm with ELT, just kind of throwing out the buzzwords there, but concretely, we ended up using Airbyte as our data ingestion tool because there's been a lot of community investment around that. It has support for pretty much all of the data sources and destinations that we needed. Because it's open source, we have the ability to be able to create our own plugins for those cases where there isn't something that we can pull off the shelf, but we don't have to reengineer the entire kind of infrastructure around the data movement piece. We just say, this is the API that we need to hit. We already use AWS for all of our infrastructure, so it was kind of a no brainer to use s 3 as the storage layer.
Everything speaks to s 3, so that makes it easy. And then we're using Trino as the query engine, again, because there's a lot of community support around that. Concretely, we actually used the Galaxy offering from Starburst so that we didn't have to do all of the operations around spinning up Chino clusters on our own. It's kind of a serverless ish way of interacting with it where I can just say, okay, just give me a cluster, let me query. I don't have to worry about all the operational aspects of it. But in the event that either they decide to kill that offering or it doesn't do what I need it to do or I need more customizability, I do have that option of just saying, okay, well, I'm just gonna throw open source Trino in, run it up myself, and no harm done. I don't have to re engineer anything around it. It's just Trino at the end of the day.
And then we're using Daxter as the orchestration engine. That's actually 1 of the first decisions that I made just because I really believe in the programming model and the philosophies that they have around data orchestration being very kind of data aware in terms of how the units of computation are built, and just the fact that it is a very well designed framework that has enough extension and integration points to be able to customize to fit whatever needs we have without being overly prescriptive and constraining. So for the kind of metadata layer for being able to actually schematize the data in s 3. We're just using AWS Glue because it's there. It's easy well, relatively easy to use, but it does a good enough job. And honestly, that's actually 1 of the hardest pieces that we dealt with in the process of trying to get this up off the ground is actually managing that connection from, okay, I can pull my data out. I can put it in s 3 to, okay, now I can actually query it with Trino. That seems to be 1 of the missing pieces in the ecosystem in general of being able to actually put something in s 3 and then be able to query it without having to do a bunch of work in between.
So AWS Glue has their crawlers, for instance, which works great if you just have a little bit of data, but as soon as you start scaling the kind of number of tables and the variety of data that you're working with, then it starts to become very painful to work with and orchestrate. And if you just try to throw it at a bucket and you have lots of, you know, different data sets within a bucket, then it'll say, hey, they're all the same table, which obviously does not work. And so we found 1 plug in for Airbyte that used the AWS Lakehouse interface for managing those table schemas and that worked a bit, but there were some sharp edges where we didn't actually want to use the lake formation. We just wanted to use Glue, but that was the most expedient way to do it. And then we ended up actually working with somebody in the Airbyte community to add a layer on top of their existing s 3 destination that would talk directly to Glue and introspect the schemas that Airbyte was pulling through and just write that out as a table definition in Glue so that we could very seamlessly say, okay. Pull it through Airbyte. It's in s 3, and I can query it right away. I don't have to worry about, oh, my schema changed in the source, and now I have to wait for the crawler to run before I can query this new column.
So just figuring out how do we get the experience of writing into a database without the constraints of buying into a BigQuery or a Snowflake or a Redshift and be able to have that adaptability and flexibility that the lakehouse ecosystem promises.
[00:26:06] Unknown:
And I've got 2 potential avenues to go down based on a lot of what you were talking about there. 1 would be from microservices, the concept of that anti corruption layer. Like, we shouldn't have to have that in data. We should have an OSI networking model where we have to have anti corruption layer between the tools instead of, you know, between the services. But, also, a lot of what you're talking about, you're talking about the data team. The central team is the consumers. And you just mentioned the schema change, and you were talking about data contracts and things like that. Like, have you found anything where you can push that up to the source to say, hey, you're trying to make this change and it's gonna break things, or are you still in that reactive mode? Because this was something you had Ape Gong on. I had Ape Gong on a while back, and he said something that just horrified me that I didn't know about, which was consumers would create expectations and basically contracts without ever telling the producer that these are expectations in our contract, but there was never that communication. So either of those 2 things, I think, would be really interesting to kinda go down.
[00:27:12] Unknown:
Yeah. So kind of managing the interaction with those source destinations. Right now, it's we're still in the phase of, let's just get something working. And it's not even a matter of, like, the schema change. Like, that's something that I'm guarding against. It's really even just a matter of, I'm adding a new data source. I wanna make sure that the schema is there. That's the hardest part. Like, even that basic capability is what was missing in being able to use, you know, just s 3 and Trino. So from what I've been able to understand running the podcast and doing my own research, This idea of the lake house is gaining a lot of steam, but it's mostly from people who are using something like Spark, where it has the built in support with Iceberg and Hoody, etcetera, to be able to just create those table definitions directly.
And if you're not using that, then you're kind of left out in the cold of, well, I guess you should just use Snowflake because that's the thing that's gonna work for you. And so there didn't really seem to be a lot of appetite for people to say, okay, well, I actually want to do a lake house. I don't wanna have to buy into Spark because I don't need that right now. I just want these 2 things to be able to communicate the right way. And so that's where we actually said, okay. Well, let's just try and make this happen, you know, at least in a way that we can use it, and then we can try and help build on top of it there. Across the community, like, being able to use something like Trito as a lakehouse engine was kind of the missing piece. There was that disconnect between get the data to a place where you can query it and then be able to actually understand what you're querying.
[00:28:39] Unknown:
A lot of what you were talking about as well of of the vendor selection and tooling and open source and things like that. You've obviously got a pretty well honed BS detector even from selecting guests and then from having interviewed, you know, you're on, I think, episode 300 plus or whatever of this podcast and of, you know, you're probably nearing a 1000 total episodes across all your podcasts. When you think about that kind of build versus buy, a lot of the tools that you selected were open source, but a lot weren't they were, you know, custom to AWS. So you're kind of locked in potentially to AWS, which I think people overestimate the cost of locking into a cloud vendor and things like that. But unless, you know, Google starts doing what GCP starts doing what they've done with other Google products where they just kill all the things off and it breaks everything.
But how do you think about build versus buy? And then when you think about even buy, you talked about with Trino. How do you think about that, like, open source versus not and and, you know, how do you think about kind of protecting yourself versus I just wanna get to getting the work done. Right? I just want this thing, I pay x amount and it's just going to work and I don't have to care. So I'm going to hit the easy button, you know, Brandon Bidell on your podcast had talked about this a lot, and, you know, I asked him to come on mine simply because I really love the conversation you had with him about this stuff as well.
[00:30:04] Unknown:
Yeah. So the build versus buy, I think 1 of the main things I consider when I'm saying, okay, well, I'm just gonna go to a vendor with this is, you know, what is the escape hatch? You know, if this vendor goes under, if they get acquired, if their product just stops working 1 day, how do I keep moving past that point? You know, for instance, with Galaxy, as I said, I can pay for it right now. I don't have to worry about all the operations aspects. But if I get to a point where, you know, it's too expensive to run it through them and I just wanna do it myself or some other reason, I need more customization, whatever, then I can just take Trino off the shelf and run with it. AWS Glue, it's a managed service, but effectively, it's just a Hive metastore at the end of the day for the ways that I'm using it. So if I have to, I can just run Hive metastore myself. I don't want to. I want there to be something better than the Hive metastore, but it's what we've got for now.
And so for Dagster, I'm actually running it myself because I already run a lot of applications on EC 2, so I've got the capabilities to just deploy and manage that iteration. But if I get to the point where I say, hey. Actually, building and deploying, this is a giant pain. I don't wanna have to deal with it anymore. Well, then I have the option of going to Dagster Cloud. I don't have to rewrite all of my code. It just continues to do the thing. Same thing with Airbyte where I'm running the open source version myself on my own infrastructure, but if I get to a point of, hey, this is too much time and headache. I don't wanna deal with that anymore. I can just say, hey, Airbyte Cloud, take my workload. Everything else stays the same. It's the, you know, what are the escape hatches in both directions? Where if I build it myself and then I decide, hey, it doesn't make sense for me to run this, how much time effort is it going to me to then say, okay, well, I want somebody else to do it for me. And then if I say, hey, I'm just gonna buy a vendor because it makes my life easier. I can get something up and running faster.
If I then decide that I don't like what the vendor is doing or I need some control that they're not exposing, then I can say, okay, I'll run it myself because at that point, it will be worth the expense in terms of time and cost.
[00:32:08] Unknown:
1 thing that you didn't wrap in there that I'm I'm assuming you look at somewhat, but it sounds like it's not quite as big on you is the roadmap for if a vendor is going to, you know, open source, it's a lot harder to say. Like, what is the road map on the open source and things like that? Or at least I think for most projects. But you know, when you're tying into a vendor, how much are you betting on that vendor versus you're saying, I need this capability. They have the capability now versus I've got this gap now, and I'm okay taking the pain of doing this manually or of just not filling this gap. But they're saying they're gonna meet this in the future. And, like, how much does that factor into your current day decision?
[00:32:48] Unknown:
That doesn't have a lot of play, especially because I'm building around these open source tools. So particularly for vendors that are built on top of open source, if there is something that they are not doing that I decide I need to have happen, then I have the power to actually just contribute that capability to them. That's another factor of the kind of control aspect of this thing does 90% of what I needed to do. I would really like it if it did this other thing, but I have the knowledge and the capability to be able to make it do that thing if I want it to. Then that's where some of the tool and vendor selection comes in as well as, you know, how much potential input do I and my team have on making the world the way I want it to be.
[00:33:28] Unknown:
That's funny because anytime I talk to somebody who has a sysadmin background, that is exactly their approach. It's it seems to be that, you know, you've been looking through a ton of these things, like, if you don't mind, you take us through some kind of capability selection where you said I needed this. What was the thing that I actually did? You know, you've talked about doing Pulsar instead of Kafka. I don't know if you're at that point right now, but any of these things where you said, okay, you know, take us back in your mind, walk us through how this might have happened. So other people can get a sense of what was your thought process, and then they can either mirror that or they can augment it for their own process. But I'd love to hear kind of where do you start when you start to look at this? Was it I need this capability? What are the tools that are available? And then, okay, I'm gonna go out and make that selection or was it like, I I have good feelings about this tool. I wanna see where it fits in, and if I can slot it in because I like the way that they talk about it, the way they approach things, and so I wanna make sure that we can, you know, put that in a modular way, but build it in or how do you think about that?
[00:34:36] Unknown:
Yeah. So it's definitely from the capabilities perspective of, for instance, the data integration piece of, I just need a way to be able to get my information from point a to point b, you know, from my Postgres database into my lakehouse. Or if I decide, hey, the lakehouse isn't working, I'm just gonna say forget it and go with Snowflake, you know, do I have that option? And so I looked a bit at, you know, some of the commercial offerings. I didn't bias towards them. I was trying to bias towards open source because of the kind of customizability perspective. And so I looked at Miltano, was 1 of the ones that I was digging into, and Airbyte as well. I spent a lot of time comparing them against each other. I even spent a little bit of time saying, well, what about if I just say, let's just go with 5tran because I can just throw a credit card at it. I don't have to do all this work of operationalizing.
And so those were probably the kind of top 3 contenders. I looked a little bit at some of the other commercial vendors, but it really came down to what are the data sources and destinations that I'm considering working with? How well do these different tools function with those connections. And at the end of the day, Airbyte is the 1 that had the majority of what I needed to be able to get my job done today versus having to spend the next 6 months engineering a solution to it. And so, in particular, the connection to the data lake layer was what was the most challenging where I was looking for that piece of, okay, I can get it into s 3. Is it in the right format? You know, does it support writing to s 3 as parquet or is it only as JSON? You know, if I write it out, can I get a table definition around it? Or is it just, hey, here's a bunch of data somewhere. Now you have to do 3 more steps to actually make it useful.
Or, you know, when I was looking at Meltano, they've built a great product there, but some of the pieces I was looking for specifically, like, at that s 3 layer was, okay. Well, you can write it out as a parquet file, but only to local disk, or you can write it out as a JSON and file to s 3, but then that's all it does. It doesn't talk to the schema layer. And so I actually tried to dig into the code of Meltano and some of those plugins to say, okay. Well, if I really wanted to do what I want, how much work is that? And then at the end of the day, I said, okay. Well, Airbyte does what I needed to do. It's easier for other people beyond me to be able to use it because they have a very nice UI for managing connections. And that was actually another demerit on their part was when I first started looking at it, they were very UI first of, if you wanna create a connection, you have to do it through the web interface. You don't have a code first way of doing it, which went against the grain of what I'm trying to build of, I want everything to be code first, automated, you know, it goes into GitHub and that's the source of truth, which is where Meltano has their strengths of, you know, we are a data ops tool. Everything you do is through code. They're kind of as a UI, but not really, but then they don't have an API to be able to actually manage the orchestration piece of it as easily. So you end up having to embed it as a command line execution in the Dagster workflow, so it's like, ah.
It was that kind of capabilities matrix of what are the things I care about most, what are the things that I can do to make this work the way I want it to versus things that I have to wait on other people to maybe implement. So Airbyte ended up winning out in that regard because of the the list of connectors that they had available, the growth and support of the community around it. And, actually, as I was working through some of the initial implementations is when they started their work on their command line utility. And so I've actually been working with that team to say, okay, it's great that you have the command line, working with it is a bit of a pain. What if you just made it a Python package so that I can do whatever I want with it? 1 of the ways that I'm kind of unfairly privileged as the host of the data engineering podcast is because I do have kind of implicit access to the people who are building all of these tools and companies to say, hey. Yeah. You remember that time you're on my podcast? Well, now I want you to do this for me because that would be great.
[00:38:22] Unknown:
I just had a conversation with Ananth who's doing schemata. I know he was on, and we were talking about how important it is to be part of somebody's workflows, and how, you know, the multiple panes of glass. Everybody wants to be the main pane of glass, and so multiple panes of glass is a major pain in the you can you can finish that. But, like, how are you thinking about that user experience for yourselves right now? Because you are the main user. You are the conduit between we're absorbing this, and we're pushing that. Is that you're building out that so it is just code workflows, and that it it makes it so that it's just easier because then you don't have 15 different places where you're trying to manage it, and you're trying to do all of that. And that's kind of your anti corruption layer almost of, like, hey. We have it in code, so we know where it is. We don't have all of these hidden dependencies that we don't know about.
[00:39:17] Unknown:
Yeah. Definitely very focused on code first, code being the source of truth. I'm building around the orchestration layer being the beating heart of the entire platform where everything has to get routed through the orchestrator or else it doesn't happen. And in terms of the end user consumption piece, our intent isn't to dictate this is the only way that you can interact data mart that hook your Tableau into a Datamart that has all of your information in it, go nuts. If you want me to put it into Redash or Superset, here you go. If you want to build a machine learning model off of this transformed data, here's the way that you go ahead and do that. And so building it with, you know, the orchestration layer is the source of truth of how everything gets to where it wants to be. It is the piece that, you know, makes sure that everything does what it's supposed to do, and then saying, okay, this is the boundary that we are maintaining complete control over.
And if you want to do other things for, you know, beyond that point, here are the interfaces that we are willing to expose for you to get your work done, and then here are maybe some useful guardrails to make sure that you don't end up in a ditch somewhere.
[00:40:36] Unknown:
When you think about the orchestration layer, if that's such a core piece, that can be a single point of failure. Right? When you think about scalability, you think about modular, approach, you think about that. Like, how are you trying to make sure that that doesn't become your pain point? And maybe if you can wrap in the concept of when you're thinking about these tools, the total cost of ownership. Right? It's not just the initial purchase price. It's not just the control, But, you know, you talked about with Airbyte where you're like, I don't wanna go into the UI. This doesn't do exactly what I want it to. This is very frustrating. And you say you think about the total cost of ownership of like how much pain do I have of using this? How much is it gonna be to my day to day, and how much training do I have to have of people to use it, and things like that? How do you kinda think about that?
[00:41:27] Unknown:
Yeah. So the orchestration layer, you know, it is the authoritative way of getting things done, but it is not the only way of getting things done. So for instance, I'm working towards embedding the Airbyte connection definitions into Dagster so that it is all in code. It manages kind of creating those connections and maintaining them and keeping the credentials up to date. It manages executing those syncs and then making sure that the dbt run executes immediately afterwards, making sure that all my tables are up to date. But if I need an dbt And I can go to my command line and say, run this DBT code right now, and I don't have to necessarily go through the orchestrator.
But I understand that that's kind of the paved path approach. I had a good conversation on this show with somebody where they're talking about their kind of paved path approach of, like, if you want to do something, this is the easy way to do it. We can get you from point a to point b as long as you follow our rules. If you wanna do it yourself, great. Go for it. Not my problem.
[00:42:30] Unknown:
I think I remember somebody talking about Roman Road, the exact phrasing that they used on it. Yes. So, you know, when you're doing these evaluations, it sounds like I was expecting us to talk more about vendor BS detector stuff and, like, thinking about that. But a lot of it is just seems to be that you already know that you know more than any rep that you're gonna be talking to about the tool and things like that. But if somebody else is look at building out their platform, you've talked to so many people. Like, what advice would you have for others when they're building out their platform? Is it, you know, start from x or look at y or, you know, like, if someone were to say, Tobias, please write me the pamphlet around, how do I build a good and scalable and useful platform? Like, do you have a a place to start and kind of your do's and your don'ts or your anti patterns or things you've learned where you go, and I thought I would do this, and it didn't go so well. Yeah. Well, my snarky answer of where to start is episode 1.
[00:43:33] Unknown:
But, realistically, it's really a matter of understanding what it is that you're trying to do and why you're trying to choose a particular tool, not just, hey, that tool looks really cool. Let's figure out how I can make it work, you know, taking it from the perspective of what is the end goal rather than how do I find a reason to use that tool? You know, it's not resume driven development. And so figuring out what am I trying to do? Okay, now how do I instance, data integration. You know, it might be a you know, for instance, data integration. You know, it might be a feature of a broader offering. And so because it's just a feature, they're not investing their whole engineering effort into making that useful and robust.
So is that good enough where you actually don't care about the data integration so much as the other thing that they do? Or is data integration the thing you care about the most? In which case, go for Fivetran or Airbyte or Meltano, you know. And then also, what are your capabilities as an engineer, as an organization? What are your priorities like? Do you care most about being able to move fast and not having to spend all of your engineering time on minutiae? Or, you know, do you care about having that control capability of being able to solve your own problems? So if you just wanna be able to move fast and have something that works, sure, go for Snowflake, go for BigQuery. They're great products. But if your goal is to be able to, you know, be very flexible and adaptable and have an escape hatch of, hey, if I decide that I'm paying way too much to Snowflake, do I have a way to get out of it? Or is all of my data there, and now I have to pay to get it all back out?
[00:45:15] Unknown:
You know, your 1 question that you always ask people is, like, was there anything surprising that you really didn't, you know, expect going into this? And, like, especially, you're talking about building this platform kind of all at once. It's not, you know, I mean, you are gonna add on to it and expand and all that, but it's not the okay. We're adding a new capability because we need this new capability. You're kind of congealing it all at once, and that can lead also to tool sprawl because you're, like, I wanna have best of breed for all of these things. So, like, anything surprising or how do you think about kind of the tool sprawl aspect?
[00:45:54] Unknown:
Are you struggling with broken pipelines, stale dashboards, missing data? If this resonates with you, you're not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end to end data observability platform. Trusted by the teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, DBT models, airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes.
Monte Carlo also gives you a holistic picture of data health with automatic end to end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today. Go to dataengineeringpodcast.com/monte Carlo to learn more. 1 of the things that I thought at the outset of this project is, okay, 1 of the first things I have to do is get the metadata layer in place. Nothing proceeds without metadata, and that's 1 of the things that I've had to bend a bit on of, okay, well, we're able to get some stuff done. Yes. It will be great to get a metadata layer in, but, you know, Dagster has enough metadata to be able to answer some of the questions about kind of lineage and execution. The thing that I found most surprising was really that gap in the kind of connective tissue between data integration and kind of data access and data manipulation in that lakehouse architecture paradigm where I thought it was actually going to be easier to be able to say, I just wanna dump this into s 3 and then query it in Trino. But there was just that 1, you know, link in the chain that nobody bothered to forge.
I was a little surprised at how challenging that ended up being. I helped make it better, which is again why I chose open source. You know, it seems like other people are investing in that area as well, and so you don't necessarily have to be bought into Spark or Flink or what have you. You don't have to be fully bought into the vendor ecosystem of, you know, AWS Lakehouse or Lake Formation or, you know, the Google product suite. I forget all the names that they have out there for it but that was the other piece. It's like, okay, yes, you can build a lakehouse as long as you're all in on AWS. Everything has to be AWS, which means you need 15 different services.
So that was another aspect of tool sprawl that I didn't wanna get into. It's like, yes, it's all AWS, but now I have to figure out what are all the APIs. Like, I actually only need this 1 service for 1 API call and the rest of the other 50 things that it does, I don't care about. So that's another aspect of it is like, is this functionality the core capability of what I'm buying it for, or is it just incidental that I have to use this thing because somebody else says that I have to use it because that's what they integrated with?
[00:48:39] Unknown:
And, well, in that metadata problem, I've been talking about the trapped metadata problem of all of these systems generate all of this metadata. And so, like, where is a primary place for x metadata and do people have to go to these 15 different places to understand what you're actually looking at, what it means, and then, you know, thinking about observability. And I just talked to somebody yesterday who said 1 thing that they implemented observability, and it seemed like it was gonna be great, except for now they were just there's no prioritization. So do we have to start to bring in the kind of SRE reliability engineering concepts more to how we're doing data, so that somebody who can actually know, okay, this is important and this isn't. Right? This thing looks a little bit weird. Well, we pull that report on a weekly basis. So it looking a little weird today, and it's Tuesday, and we're not gonna pull it until Friday. Yeah. We wanna keep an eye on that or it's not crucial. It's just something 1 person uses, and it's kind of helpful, but it's not the biggest thing versus this other thing. And how do you actually do that prioritization? How does it all cohesively work together instead of I'm going tool to tool to tool to tool.
It's just like you're almost playing telephone, you know, and you get purple monkey dishwasher very very quickly because every single tool is, like, I integrate with this tool in this specific way, but it doesn't take into account how I integrated with the tool that was further up the chain. And are you just finding that would that I could? Right? Then it's like, wouldn't it be nice if somebody had actually done this in a holistic way, but you just kinda have to do the best with what you've got? I mean, were you expecting that or was that something that kinda hit you?
[00:50:25] Unknown:
Yeah. I knew going into it from having talked to people on the podcast for so many years that the metadata situation is a mess. There is no kind of perfect solution despite what everyone might want to tell you, And that's actually 1 of the spaces that I'm seeing the most evolution right now. You know, there are a lot of product categories where the thing that it does is always going to be the thing that it does because it does it well. So data integration is always about getting data from point a to point b. You know, they might put a shiny interface on it, or they might have different levels of connectors, or, you know, it might be batch versus CDC, but it's still just move it from here to there. Metadata, I think, is where the biggest opportunity is right now of how do we make this, you know, as an industry, a cohesive and unifying experience versus the kind of schismatic fragmentation that we have where I have my metadata engine, it collects my metadata that I care about so I can use it for observability. I have my metadata over here. It does governance. I have my metadata. It tells me when this execution ran. I have my lineage data over here. I have my catalog over here. Starting with the kind of data catalog space is where I think the broader conversation around metadata started to gain volume and traction.
And if you look at the data catalog companies that started 3, 4, 5 years ago, they're not even about data catalog anymore because that has become, you know, it's a feature, not a product. And so, you know, now it's about, oh, active metadata. It's like, well, still not quite a product that's a feature of the metadata layer. And so, you know, data lineage again, you know, there are products that started off as data lineage. It's like, well, okay, that's great that it has lineage, but how do I find the thing that I care about to know where it came from? So that's data catalog. I think that all of these products that started off as a feature of the metadata fabric are starting to expand.
Recently, I had a conversation where I described it as ripples in a pond. You know, they all started as a drop in the pond of metadata in their own little corner and now those ripples are spreading out and converging. And so we're getting to the point where, you know, we are building up to this point where metadata is ubiquitous. It is all encompassing. It is of critical importance, and everything needs to know about all of the metadata everywhere without it being a siloed experience of, okay. Well, I have this metadata about this tool over here, and now I either have to build a linkage to this other tool to map it into their metadata format. But the fact that there are people who are thinking about how do we make this interoperable so that it is a team sport instead of a gladiatorial arena.
[00:52:59] Unknown:
To me, every tool should be broadcasting out via there should be standard APIs that everybody broadcasts out and that it, you know, marks metadata as kind of alpha or beta, you know, primarily generated from this tool versus not, and then it integrates into your workflow. So if you do have somebody who's, you know, at the CLI who's got, you know, whatever that they can see it there, and they can say, okay, this is the all encompassing metadata around this. But if somebody's in the data catalog or kind of the data marketplace, which is becoming that kind of UI for kind of, you know, increasing data literacy and all. But nobody is willing to do it because it is a moat that they have built. And so, you know, the couple of companies like an Atlant or an OpenMetadata or things like that that are being more open about it.
I don't know. I hope they get paid off for for doing the right things, but I don't know. So before we jump into the thing that I I wanted to kind of wrap up around the kind of doing the call to action around guests, is there anything that we didn't cover that you think, like, any advice or anything that we didn't cover that you think we should have relative to kind of building out this platform and what others can learn from what you've done thus far?
[00:54:12] Unknown:
I think that 1 of the perennial problems in technology is the bias towards speed and action without necessarily understanding the destination. So, you know, sometimes that's useful when you don't actually know what you're doing and you want to do some initial discovery, but it's all too easy to just take that discovery into just the end goal and building the final product without actually really knowing why you built it or for whom or for what. And so there are so many kind of shortcuts and bad architecture and design decisions that I've seen both in terms of my own work and in talking to people in the community and the and the different products where And this also brings in the conversation around data modeling of you need to take the time to figure out, like, what are you trying to do?
How are you going to do it? And how are you going to get there? Rather than just, I wanna get over there, so I'm just gonna start running vaguely in that direction and hope I don't hit a wall. And that's not to say that we all need to go back to waterfall, but and this can be an organizational issue too, where if you have people in management who say, No, we just really need to get over there. Let's just do it. You need to have that kind of pushback, you know, bottom up and top down buy in of being purposeful about the way that you're doing the work and the destination that you're trying to arrive at so that you don't get to that point of, Hey, I ran straight over there, and now I'm standing in a data swamp, and I have no way out because my boat sank.
[00:55:43] Unknown:
And that you have that communication with people, like data consumers that they don't lock on to, hey, we're iterating, we're figuring this out, we're trying to get to this thing, like, we're trying to work with you. But, like, that limiting of technical debt is such a because exactly what you talked about. I keep talking to people who say, no. The only thing that matters is speed, and it's like, you know, fast, cheap, or well done. You get to pick 1, and if you pick fast, it's not gonna be well done. And in the end, it's not gonna be cheap, and it's gonna cost you. But, you know, I think we need to figure out as an industry, how can we capture some of the value quickly, but not lock ourselves in to that expensive option because we went and we said, we're gonna capture 50% of the value upfront with an understanding that it's going to require rework and iteration, but I'm just not seeing ways of doing that. It's like this pie in the sky of wouldn't it be nice, but nobody ever actually tells you how to do it. Yeah. The other piece of it too is
[00:56:43] Unknown:
particularly if you're starting from 0 and you say, I have all of these grand plans and ambitions, but you need to start somewhere, is, okay, well, what is the core piece that you can start with that does some percentage of what you're trying to do that has the capability to expand and evolve and build on top of? Like, what is your foundation? Because if you try to build your system on shaky foundation, it's all gonna fall down. But if you have a solid foundation that doesn't do everything you want, but it does half of what you're trying to do and it has the capability of doing more, that's the thing to try and build on top. That's 1 of the reasons I really like this lakehouse paradigm of it gives you a way to say, hey, it's SQL first. Everybody understands SQL. I can use dbt. I can use all these tools that understand SQL, but then I can also still, you know, run a Spark pipeline or I can still write a bunch of Python and Pandas and be able to do other things with the data where I'm not constrained by SQL, or I don't have to pull it all out of Snowflake before I can then go do the thing, or I have to use, you know, Snowpark and whatever capabilities they decide to offer me and pay however much extra for it.
[00:57:45] Unknown:
Yeah. And this is kind of where I was frustrated by a lot of the modern data stack talk was everything that people are building is built on a shaky foundation. Not necessarily the stack itself, but that all of the stream can continually change and break all of these data products that aren't actual products, because we're not thinking about, like, life cycle of a product. We're thinking about trying to encapsulate something that we hand to somebody, but it just keeps breaking. And so, yes, we can get too fast, and we can get to some value, but it just breaks. And so can you never rely on that for actual business ongoing things or not? It's just a frustration of mine that I've seen over and over in these conversations.
[00:58:28] Unknown:
Yeah. 1 of the things that I like about how we've been iterating is being able to have some sort of a buffer layer where even if that upstream source completely changes or just completely breaks, that doesn't mean that that is immediately reflected in the end user experience. So, you know, the idea of, you know, the ELT of let's just land everything in raw, nobody ever touches raw for anything, so if that just completely breaks today, fine. You know, I'm gonna have a lot of work to do, but it's not gonna be cataclysmic. And then saying, okay, now from raw, I'm going to clean it up a bit and then put it into staging area. So the dbt model of, you know, landed in raw, stage it so that you can clean it up and get some sort of cohesion around the semantics of the data that you're working with, then put it into your intermediate, which is where people might start interacting with it, and then put it into your mark for, you know, business users who just wanna be able to see at their dashboard, you know, are my KPIs correct?
So that if your source system completely crashes and you can't pull raw data anymore, you can still see the report, it's just a little bit stale. Or I pull everything in and now my dbt model broke because there's an extra column or there's a column missing. Okay. Well, I can see that it broke because there's an error message and I can just go and fix that and then everything else works. So just having those kind of safety valves in the process of going from, you know, I don't control the upstream. If something breaks there, I can figure out a way around it to, you know, this is the controlled experience that I'm giving to everybody else and I'm not beholden to all of these other systems to be complete, you know, constantly up to make sure that I can still serve my customers.
[01:00:03] Unknown:
Yeah. I had Chris Ricomini on mine, and he was talking about developers, software engineers can't really care about certain things because they can't know what changes are gonna break. So I did wanna get to 1 thing that I'm passionate about, which is, as well, about a lot of people who reach out to you just like me are probably gonna be the vendors. Right? That's why you've got a lot of vendors on the podcast, and and you're also obviously very interested in talking to a lot of the vendors. But, you know, when you think about what is your ideal guest, do you have that, or do you have something where we can have a call to action for more people that are actually building the things? And, no, you're not gonna have all the answers. But, like, what do you want to tell to potential guests? Listeners out there who aren't thinking right now that they could be a great guest, but that you would love to talk to. Absolutely. So
[01:00:58] Unknown:
my litmus test for understanding, should I have an episode on this topic or with this person, is kind of myself. I use myself as the listener of, you know, is this something that I find interesting that I want to learn about? Will this be helpful to the work that I'm trying to get done? So whether that's talking to a vendor to say, okay. What are you building? Building? Kind of in the trenches engineers or managers of, hey. What are you building? Why are you building it? You know, what are the challenges? How did you get around it? What's your process for tool selection and build versus buy? You know, is there some really gnarly bug that you found in Spark shuffle files that kept you up at night for 3 weeks until you finally got it fixed? How did you go through that process of debugging?
What are the things that are you afraid of? You know, you built your data platform and it works, but you know that there's a skeleton in the closet and it's just waiting to jump out at you. Like, how do you keep it in the closet? What is it? And definitely want to encourage anybody who, you know, works with data, works with data practitioners to reach out. You know, when I started the podcast, I definitely had a bit of imposter syndrome because I was a systems engineer, not a data engineer. Then, you know, talking to people about, you know, distributed big data systems with or, you know, particularly now where I'm starting the machine learning podcast where I do not have a, you know, a fancy stats background or, you know, I haven't done I've done some advanced math, but I don't have a PhD in neural networks or what have you. You know, definitely have some imposter syndrome around that. You know, I had it when I started the data engineering podcast, and yet now, 5 years later, somehow I'm seen as 1 of the foremost experts in the space. How did that happen? I'm just a guy with a podcast. So, you know, definitely don't want people to be intimidated just because I've been doing this for so long. I'm I'm just a guy who is curious and likes to talk about interesting stuff.
[01:02:55] Unknown:
And I think, I mean, I suffer with the same thing of people are like, oh, Scott, you're the expert. And I'm like, I'm not doing this work. I'm going and extracting this information from people. Like, that's what I do with the podcast because I'm interested, and I think it will be interesting for the audience. But I'm not an expert, especially data mesh stuff. We're so early, you know. Mac and I have talked about this. Nobody can be an expert right now because we're figuring it out. You know, if somebody was in 2010 and saying, I have microservices all figured out, you just laugh them out of the room, especially in hindsight. Right? I mean, even with data engineering, you are an expert. But, you know, anytime I suggest it to somebody, a lot of people go, that sounds really intimidating. And it's like, Tobias is really nice. He's gonna take care of you. He's gonna ask, you know, good questions. And the worst thing that can happen if they reach out to you is that you say, that's not really an episode I'm that interested in doing right now. You know, I suggested a few people and you're like, that's not actually the thing that I'm not interested in right now. And it's like, okay. Right? Like, oh, no. That's the end of the world.
Do you have any advice? I just don't have vendors on my show just because right now in data mesh, they just wanna vendor wash. They just wanna sell, and I don't go super deep into the technology aspects like you do. But, like, do you have any advice for vendors? Like, 1 thing that I've always kind of said is, vendors create a platform for your users, and they'll positively mention you. And that they can talk about really interesting things. Like, do you have advice for vendors reaching out, and then maybe we can go into advice for the nonvendors, the practitioners reaching out? Yeah. So, I mean, 1 thing is
[01:04:36] Unknown:
do your homework, make sure I haven't already talked about exactly what you're trying to talk about, at least in recent memory. The other piece is I'm always happy to have vendors on the show because they're the people building the things that we're all using, but I'm not gonna bring you on the show just to kind of repeat your branding or your marketing material. I'm actually going to dig in and figure out what is it that you're building and how and why, you know, so that people who are trying to do that evaluation are a step ahead of the game so that when they do get on to their sales calls, they can say, well, on the podcast, you said that this is how it's supposed to work. How does it actually work, and how are you gonna make it work for me versus, you know, oh, well, I'm gonna bring you on the podcast, and I'm just going to happily spout whatever your marketing material says about, you know, you can spout puppies and rainbows and unicorns into my data lake, and everything will be magical. Like, what are the pain points? What are the things you can't do? You know? When are you the wrong choice? I definitely
[01:05:28] Unknown:
enjoy being able to have people on so that we can dig into all the cool and fun things that you're doing, but I'm also gonna ask about the things that you can't do and expect that you're going to be honest about it. Yeah. I'm I'm always skeptical when somebody says, I don't know. There's not that many things we can't do. And I'm like, well, you just lost your credibility. You spent an hour building it up, and you just lost it in that 1 little second. But, like, I mean, do you like it when vendors bring you their users? And it's not like we're gonna have you talk about how you used x y z tool, but it's like, hey, here's somebody who had this big challenge, or they've got these set of 5 challenges.
[01:06:00] Unknown:
Is that something that you're kinda open to? Is that something that you find useful? I've definitely had plenty of vendors who I've worked with, who I've had on the show, and then who have said, hey, I've got this customer who's doing something interesting. They happen to use us, you know, do you wanna have them on the show? I've had lots of conversations like that. Generally, they go pretty well. So definitely interested in just kind of talking to people about the challenges that they have dealing with data, how do they address them, you know, what are the things that they still can't do, what keeps them up at night. So just my general filter is I try to make a show that I find interesting, that I find educational, and helps me figure out how to keep doing my job better.
[01:06:38] Unknown:
Yeah. Same. And 1 thing that I try to tell people is that think that you are good enough. Right? And it's not if it's like, hey. This isn't really a fit. It's not that you aren't good enough. It's just I've talked about this or this is something that's not really that interesting to me or the audience or something like that. But, like, for me, I think that's 1 of those things of, you know, at least I just see myself as just a guy that's asking stupid questions and that has done, you know, a 120 interviews. So my questions are gonna sound more smart than, you know, somebody who's just starting out because I've talked to so many people about this stuff and that, like, you are good enough to be on this. And, Keanaret, I'm talking to you specifically at this point, so I told you I would say that you should be on, and I fully mean it. But, like, do you have any kind of words of wisdom for people to kind of to be brave?
[01:07:29] Unknown:
The the way I always frame these interviews is that I think about it as a hallway track conversation that you'll have at a conference. So you just bump into somebody and you overhear them talking about, oh, man. I was up late last night because my spark cluster failed. And, hey. What happened with that spark cluster? Why did it keep you up? Could you have done it in the morning? No. Because I had to have the report out at 5 AM for my stakeholders. Oh, great. Well, you know, why why do you have your stakeholders looking at reports at 5 AM? So it was that kind of
[01:07:56] Unknown:
curiosity and interest from you you just walk up to a complete stranger and then you start, you know, talking through all of the, you know, their successes and failures as somebody working in this space because we all have them. So we're all just people trying to do our best. I think that's part of the reason why your podcast has been so successful, is you don't ask people to prove themselves when they come on. You've already said, okay. I think you're an interesting person. I wanna dig into that. So I do try and tell people you are good enough, be brave, and kinda reach out. So I'm all good on my end. Is there any way that you'd kinda like to wrap up, or did you wanna ask yourself your question of any tooling? Or you've kinda talked about where you're seeing a lot of the metadata issues, but any any way that you'd wanna wrap up the episode?
[01:08:37] Unknown:
I'll kind of close with the gap. The 1 that I've brought up before is I still think that there is a missing link in the kind of data production stage and making sure that there is a smooth transition into data platform and analytics, particularly in application frameworks where you think about, oh, I have my ORM that helps me work with my database. I don't have to think about the mechanics of the database too much. I can just do what I wanna do and write my business logic. How do we get to that point of where we can capture the context and semantics of the domain objects that we care about in the application and expose them without dehydrating them, but it will send them over into the data lake. How do we maintain all of that richness and context and do it in a way that the application engineer doesn't have to spend an extra 6 months building some fancy new feature just to pipe it into the data platform and then have to change it again? Like, how do we make that something that is just, you know, pip install a package or, you know, drop these 5 lines of code into your data model, and it just does what it's supposed to do. Yeah. Zhamak and I have had that conversation about, like, why do we decompose
[01:09:43] Unknown:
everything about the information? And so it's just the ones and zeros, and we're focused on processing the ones and zeros, and then try and readd back all the context instead of treat it all as if it's like a package of information at once. And so, yeah, I fully agree on that. So we talk a little bit about how people should reach out any other way that you'd kind of I've got a little call to action that I recorded ahead, so that I don't go on and on. The 1 thing I did forget is if you are interested in figuring out how to do data contracts, do get in touch with me because I am putting together a working group around that. So I don't want to ever have to hear about data contracts. I want this group to figure it out and kind of move forward relatively quick. But outside of that, is there any way you'd kinda wanna wrap up the episode?
[01:10:27] Unknown:
Yeah. So I'll say that for folks who appreciate your work on this show, remind them that you do have your own podcast. Have you had your preferred contact information to the show notes for anybody who wants to follow along with the work that you're doing and get in touch. And just thank you for taking the time today to join me and grill me on my experiences with building a data platform and running the podcast. It's always fun being on the other side of the mic and being the person who wants to answer the questions. So appreciate you leading this conversation and taking the time to help the audience explore a bit about my side of things that don't always get out there. Well, you've learned so much. Right? You've done so much of this that it's it's super valuable for the audience out there. So, again, thank you, Tobias, for having me on, and thank you for being such a great quote unquote guest on your own podcast.
[01:11:19] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Guest Background
Tobias Macy's Career Journey
Starting the Data Engineering Podcast
Building a Data Platform at MIT
Choosing the Right Tools and Vendors
User Experience and Orchestration
Metadata Challenges and Solutions
Advice for Building Data Platforms
Closing Thoughts and Call to Action