Summary
The ecosystem for data tools has been going through rapid and constant evolution over the past several years. These technological shifts have brought about corresponding changes in data and platform architectures for managing data and analytical workflows. In this episode Colleen Tartow shares her insights into the motivating factors and benefits of the most prominent patterns that are in the popular narrative; data mesh and the modern data stack. She also discusses her views on the role of the data lakehouse as a building block for these architectures and the ongoing influence that it will have as the technology matures.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
- Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
- Tired of deploying bad data? Need to automate data pipelines with less red tape? Shipyard is the premier data orchestration platform built to help your data team quickly launch, monitor, and share workflows in a matter of minutes. Build powerful workflows that connect your entire data stack end-to-end with a mix of your code and their open-source, low-code templates. Once launched, Shipyard makes data observability easy with logging, alerting, and retries that will catch errors before your business team does. So whether you’re ingesting data from an API, transforming it with dbt, updating BI tools, or sending data alerts, Shipyard centralizes these operations and handles the heavy lifting so your data team can finally focus on what they’re good at — solving problems with data. Go to dataengineeringpodcast.com/shipyard to get started automating with their free developer plan today!
- Your host is Tobias Macey and today I’m interviewing Colleen Tartow about her views on the forces shaping the current generation of data architectures
Interview
- Introduction
- How did you get involved in the area of data management?
- In your opinion as an astrophysicist, how well does the metaphor of a starburst map to your current work at the company of the same name?
- Can you describe what you see as the dominant factors that influence a team’s approach to data architecture and design?
- Two of the most repeated (often mis-attributed) terms in the data ecosystem for the past couple of years are the "modern data stack" and the "data mesh". As someone who is working at a company that can be construed to provide solutions for either/both of those patterns, what are your thoughts on their lasting strength and long-term viability?
- What do you see as the strengths of the emerging lakehouse architecture in the context of the "modern data stack"?
- What are the factors that have prevented it from being a default choice compared to cloud data warehouses? (e.g. BigQuery, Redshift, Snowflake, Firebolt, etc.)
- What are the recent developments that are contributing to its current growth?
- What are the weak points/sharp edges that still need to be addressed? (both internal to the platforms and in the external ecosystem/integrations)
- What are some of the implementation challenges that teams often experience when trying to adopt a lakehouse strategy as the core building block of their data systems?
- What are some of the exercises that they should be performing to help determine their technical and organizational capacity to support that strategy over the long term?
- One of the core requirements for a data mesh implementation is to have a common system that allows for product teams to easily build their solutions on top of. How do lakehouse/data virtualization systems allow for that?
- What are some of the lessons that need to be shared with engineers to help them make effective use of these technologies when building their own data products?
- What are some of the supporting services that are helpful in these undertakings?
- What do you see as the forces that will have the most influence on the trajectory of data architectures over the next 2 – 5 years?
- What are the most interesting, innovative, or unexpected ways that you have seen lakehouse architectures used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on the Starburst product?
- When is a lakehouse the wrong choice?
- What do you have planned for the future of Starburst’s technology platform?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Starburst
- Trino
- Teradata
- Cognos
- Data Lakehouse
- Data Virtualization
- Iceberg
- Hudi
- Delta
- Snowflake
- AWS Lake Formation
- Clickhouse
- Druid
- Pinot
- Starburst Galaxy
- Varada
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans could focus on delivering real value. Go to data engineering podcast.com/atlan today, that's a t l a n, to learn more about how Atlan's active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork, and Unilever achieve extraordinary things with metadata. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.
With their new managed database service, you can launch a production ready MySQL, Postgres or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to data engineering podcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Colleen Tartoe about her views on the forces shaping the current generation of data architectures from the modern data stack to data mesh and beyond. So, Colleen, can you start by introducing yourself?
[00:01:41] Unknown:
Absolutely. Thanks for having me. I'm Colleen Tarto from Starburst. We are based on the open source platform, Trino, and our software allows access to data in pretty much any platform or source. So you can do your analytics on data directly at the source either with SQL or through an integration with your favorite analytics tool. I personally run the enterprise engineering organization, and I've been interested in enterprise scale data management and architecture for quite a long time because it's a really interesting space with some fun problems to solve and cool ideas like data mesh to think about. And I'm really pleased to be here today and looking forward to good conversation.
[00:02:18] Unknown:
And can you share how you first got involved in the area of data management?
[00:02:22] Unknown:
Yeah. I kind of I fell into it. I really love hard problems, and I love data and math and numbers. And data management, which in my mind is how to organize both the data and the people around it. It's a really interesting, complex problem. And so after I left graduate school and I got involved in data software, it was the early 20 tens when big data was the hot topic. And I found it really interesting that the focus was always on moving data around. And then, eventually, the people in the data world moved on and started to get into data science and analytics, which, you know, that's actually how you get the value out of the data, and it's really incredibly important to the business. And I was working at an enterprise ETL software company at the time, and we kept seeing these huge challenges around organizing, accessing, securing, and then analyzing massive datasets.
And then I moved on, and I was working in analytics and data science because that's also really fascinating. But I kept coming back to the data engineering and the data management side of things. Because without that, you can't do the analytics, and you can't do data science and get the value out of the data. And you need that solid strategy. So I always find that really interesting.
[00:03:30] Unknown:
When I was preparing for this interview, I noticed that you also have a background in astrophysics, which is kind of hilarious given that you now work for a company called Starburst. You recently released a product called Starburst Galaxy, which is the actual astrophysical phenomenon that is what generates stars. And so I'm wondering if you can, in your expert opinion, explain sort of how the metaphor of a starburst in a starburst galaxy maps to the technical platform that you're managing and building and some of the interesting metaphors and parallels that go between them.
[00:04:02] Unknown:
No. And it's funny because when I was initially speaking to Starburst, I kinda laughed and said to my husband, it would be really funny if I worked there because, you know, it's my PhD and is in it. We just kinda giggled about it. But now here I am 2 years later working here, and it's great. So starburst galaxies that are undergoing a period of really intense star formation, and it's usually caused by, like, a gravitational encounter with another galaxy. And then those stars age several 1000000000 years, and, eventually, they all explode. And it creates this coordinated burst of luminosity from the stars and all this energy comes away from the stars. And what's interesting about these is that they're exceptionally bright, but you can also predict how bright they're going to be. So if you know how bright they are and you know how bright you measure them to be, you can say a lot about the universe that exists between you and the starburst. And so you can study them from really far away too, because they're so bright. And so when you study things that are far away in space, you're actually studying them back in time because light has a finite speed.
And so, you know, an object that's 1 light year away and it's light and takes a year to get to you. So when you see it, you're seeing it how it was a year ago. So the farther away you look back in space, the farther back in time you're looking, which is really cool. And so all that said, starburst galaxies are these beacons of light in the universe. And they're interesting both internally and in the context of providing information about the universe around them because we can study that light that they're emitting and absorbing and bending with their gravity. So to me, there's an obvious parallel with Starburst, the data company. Our technology allows users to use whatever format of analysis works for them, whether it's SQL or a BI tool or Python or whatnot.
And with that capability, we can shed light on new insights faster and easier than they'd be than we'd be able to if they had to follow the legacy paradigms of moving data around and architecting it specifically for a single use case, which presupposes that you know exactly what you're looking for. So the cool thing to me, both starburst galaxies and starburst data technology is that they're interesting on their own, but when you apply them to the universe around them, they provide these additional insights and like a faster, more convenient way to get information about their surroundings.
So maybe I'm pushing that metaphor a little far, but I do think it's there. Yeah. It's definitely great.
[00:06:27] Unknown:
And so in terms of the overall trends of data architecture and some of the ways that things like Starburst and Trino are able to influence and, in some ways, circumvent architectural constructs and constraints. What do you see as the dominant factors that typically influence a team's approach to data architecture and design given the current ecosystem of technologies and systems and objectives that are in play?
[00:06:57] Unknown:
It really depends on the starting point in a lot of ways. Right? So it depends where you are and how, what I call, data mature you are. So for an older legacy system that's undergoing, like, a multiyear strategic digital transformation, that's gonna look an heck of a lot different than a cloud native startup that's just getting off the ground. And like I said, there's this concept of data maturity that I like to think about, and it's a measure of how much the people, processes, and technologies are data forward at an organization. And so what's interesting is not directly correlated to the size or the age of a company or the technology platforms that they're using or their budgets or the number of people or their overall strategy of the company, but rather it's a combination of all of these things.
And it really says, how much is this organization enabled to use its data to drive forward its vision and make strategic decisions based on data? So when a organization is making a decision on how to approach or define a strategy, there are all these factors that end up going into it. But what I think is really important is that there's a focus on the business value that will be attained with the data. So you start with the question of how are we gonna make this business more successful through data and then getting into the details and tracking that back to the actual data needs and making sure that you're leaving room for innovation around that as well. And then from there, you can start to evaluate where you are and design an architecture and choose technologies and a strategy that helps you evolve your data environment into a more mature stance.
[00:08:38] Unknown:
As far as the current patterns of the, quote, unquote, modern data stack, which is still somewhat of a nebulous term, unintended, but very apt, and also the advent of data mesh. And to some extent, the collision of those 2 principles, what do you see as some of the points of confusion or opportunity that exist for people who are either evolving their existing data stack or coming into a greenfield to adopt those patterns and paradigms and maybe some of the potential pitfalls that they need to be aware of as they go down that journey.
[00:09:16] Unknown:
It's a really good point that the modern data stack is incredibly nebulous. You know? I mean, there's websites dedicated to figuring out what it means, and every vendor has their own definition of it. And it's all over the data zeitgeist these days. Right? And so is data mesh for that matter. And they kind of, in some ways, in my mind, come in at opposite angles. Right? So let's start with the modern data stack. Right? And I will talk for hours about this, so definitely cut me off if you need to. So with the modern data stack, it's the idea of taking data from a source, copying it to a some sort of centralized data storage, data focused storage, and then using analytics tools to gain insight from that data.
And when you say it like that, describing it, it sounds beautifully simple. But like most things in life, it's not that simple. And so in reality, you really have to curate that data. And that's the challenge is where do you transform it and how do you end up handling issues around things like latency, complexity of pipelines, vendor lock in that you end up getting. Because, you know, as you're trying to evolve your stack, you're working with partners, but you end up getting locked into certain technologies that can get very expensive as you get more mature and more advanced. And so a key point that I think about a lot is that the modern data stack is actually not that modern. Right? 40 years ago, folks were doing the exact same thing. They were taking data from their mainframes and putting it into Teradata and then using Cognos on it. Right? So it's the same idea about curation, and there's nothing super new to that. Right? And so the main improvements that have happened are the separation of storage and compute and cloud architecture, and you've got these managed SaaS platforms and tools, which is great. But, honestly, it's largely the same paradigm we've always had with the centralized target data store that's driving that architecture and that organizational structure around it. And, you know, I think the key is also understanding that there's a people story there too. And so all of that said, you know, I think the modern data stack is really wonderful when you're starting up a data story and getting started off on your journey of data maturity. Right? There are really good benefits in that. It can be quick to spin up. I mean, you can go from 0 to having that modern data stack in a day. And if you think outside the box and you focus on the ultimate goal of getting business value and data insights from your data, there are some really modern improvements that you can make on that that will accelerate the speed to value for that data.
So for example, you know, you could be cloud native in these days, and it's really quick to spin up that stack where you could use, you know, s 3 for storage and Starburst Galaxy for curation. And then use your favorite BI tool layered on top for analytics. And in a short amount of time, you're up and running and producing value, and that is really cool. And so that's sort of where I think the modern data stack is going, and it's allowing organizations to scale and mature and grow as much as they need to without locking you into specific architectures. Then for data mesh, on the other end of the spectrum, right, data mesh really doesn't make sense at the scale of a single modern data stack. Right? It's more of for large and complex enterprise companies, like a telecom or a financial services company. Something where you've got regulatory complexity on top of your physical complexity, your organizational complexity, your data complexity. And that's obviously not going to be an environment where you can spin up a quick modern data stack and call it a day. So there are typically completely independent business units with independent data strategies and architectures, structures, data.
And in these cases, I think the idea of bringing your data together into a centralized store, the way you do with that modern data stack just isn't gonna fly. You know, organizations have tried this for years to have that single source of truth for data, and it was just never working. It's never truly a success. And that's what leads the idea of the data mesh, that centralized data architectures and organizations lead to the same challenges over and over and over again. And this is why digital transformations fail, and that's why we need to do better so that we can get business value out of the data at scale. So that's where Data Mesh comes in focusing on both the organizational and architectural decentralized data strategy.
[00:13:35] Unknown:
In terms of the sort of technical underpinnings of these different platforms, a lot of the conversation around the modern data stack has been centering around the different data warehouse vendors. Snowflake has definitely been getting a disproportionate amount of that focus. And as somebody who is working on and building a system that can act as 1 of those central clearing houses of information in the form of Trino and building out the sort of modern lake house paradigm where you're getting the benefits of the warehouse with the scale of the lake. I'm wondering how you think about the, sort of, the staying power of the current formulation of the modern data stack versus the overall principles that it's encompassing and some of the ways to the things like data virtualization as offered by Trino, etcetera, is able to facilitate some of the concepts that are core and central to the data mesh paradigm and some of the ways that those 2 can be sort of in synergy with each other versus at odds with each other.
[00:14:42] Unknown:
Yeah. And I think given that they're sort of coming in from different angles, whereas the modern data stack is more nimble and a faster solution, and data mesh on the other hand is a journey and an evolution of a strategy. You know, I think what's really interesting to me is where these 2 things intersect. Like, how can we be more intentional about transitioning from the modern data stack driven world where we're focusing on speed to value and then maturing thoughtfully into that decentralized and data product driven world. Right? And so I think the lakehouse can be a key part of that story too because the real benefit of the lakehouse is that it's providing the functionality of that warehouse, including that user experience and visibility of the data to the end users with the scale and the low and nearly linear cost per gig of a data lake. Right?
So it's really about being intentional about where you're applying the business logic to the data. Right? Is it coming in initially on writing, or is it coming in on reading? And so there's sort of this idea of the modern data stack where if you develop a modern data stack and you end up getting a larger and larger modern data stack, you sort of end up with a data lake anyway because you typically do have a staging area before data gets loaded into Snowflake or whatever cloud data warehouse that you have. So you end up with the staging area that ends up being much like a data lake, and then you've got the cloud data warehouse sitting on top of it. And so in my mind, you know, you've got this transition from ETL to ELT that focuses on bringing data together into that centralized storage layer, which is the staging area. And then you've got that warehouse part of the lakehouse that's given in the modern data stack. So in some ways, you're kind of building it without even intentionally building it. And the challenge is to make that a more thoughtful thoughtful architecture for a lot of modern Data Stack users. And then as you build out to scale, how do you really articulate that architecture and organizational structure for a larger scale as you acquire other companies, as you break out different business units, things like that. And that's where the mesh idea comes in, which is really decentralized architectures based on each domain doing what's best for themselves, but producing data products and really treating data as a first class product.
[00:17:15] Unknown:
In terms of the adoption of the lakehouse, it's definitely been picking up speed in the past even year because of the growing maturity of the underlying technologies and capabilities that are offered, particularly with things like iceberg and hoodie and delta formats to provide a more well engineered table structure on top of the underlying storage where there isn't actually any real table to be spoken of. So getting some of that power of fully integrated database engines with things like time travel and, you know, MVCC and the ability to evolve schema in a more natural format. And I'm wondering what you see as the pieces that have been missing to date that have made things like Snowflake and BigQuery and Redshift the default, de facto core elements of the modern data stack and how the evolution of these lakehouse technologies are starting to maybe level the playing field and make that a more viable option as the core central default technology that an organization might orient around to be able to get the kind of combined benefits of cheap storage at scale and performant queries on this semi structured and structured data.
[00:18:30] Unknown:
Yeah. I mean, I think you hit upon a lot of the recent developments that have really accelerated the growth of the lakehouse as a viable option. I think the cloud data warehouse, I mean, in 5 minutes, you can have a query running. Right? And I think that's really that user experience of going from 0 to queries is really attractive, whereas it's taken a bit longer for data lakes to get up to that user experience. Right? Like, we now have Lake Formation. We have all these other great technologies that are allowing people to spin up lakes, but it's still not as clean and easy for just any data engineer to do this. Right? Like, I think, you know, I don't wanna say my mom could do it, but she probably could. Right? She probably does Snowflake account and get up and running. But I also think that there's a lot of technologies out there that are allowing you to get to the point where you can kind of set it and forget it with the infrastructure side of things, which is what really is the power of that cloud data warehouse.
And the challenge is that unlike a lake, the cloud data warehouse gets very expensive very quickly. Right? Like, you know, my mom's spinning up a data warehouse. My poor mother picking on her. But if my mom were to spin up a cloud data warehouse, then she forgets about it, and then she gets a crazy bill because she forgot about it. Right? Whereas with the data lake, it's just storage. Right? It's really your storage and your compute as opposed to someone else's storage in their compute. Right? And so with that higher startup cost for the lake analytics, you know, that's been higher than spinning up a cloud data warehouse just because of the virtue of the technology. But I would argue that by thinking outside the current strengths of the modern data stack, there's this whole new class of tools emerging that provide the speed to value and the user interface on top of lake or the lakehouse. And so that would be things like Starburst Galaxy where, you know, you just point us to your storage, and then we handle the compute.
[00:20:21] Unknown:
Going back to the higher level architectural principles of the data mesh and the modern data stack, picking a bit more on the modern data stack right now. As you mentioned, a lot of the core ideas that are being executed on it aren't really new at all, but they're being repackaged as this bright, shiny approach to how to build things because of the fact that we have these cloud technologies, and so that's what makes it modern. I'm wondering what are the core elements of it that will continue to have staying power, and what are the pieces of how people think about the, quote, unquote, modern data stack that are liable to shift in the next year, 2 years, 5 years' time because it's just the latest trend and some of the areas of convergence that you see coming down the road for some of the currently disaggregated technologies where there's opportunity for simplifying the experience, simplifying the technologies, and starting to combine them into not necessarily as fully integrated of a stack as we had with things like Informatica, but, you know, not as disaggregated as we have right now where you have to have 5 accounts across 5 different vendors to be able to get what most people view as the modern data stack. It's a lot to unpack there. So
[00:21:41] Unknown:
I'll start by talking about the modern data stack. And, you know, I do think that it really is legacy paradigms built with modern tools, which, I mean, there's nothing wrong with that. Right? But like you said, you end up with 5 different tools, which are built around this paradigm of moving your data away from the source. So the closer you can get to analyzing the data at the source, the better off you'll be. And with modern cloud technologies, that's a reasonable expectation, right, because you do have scaling. Right? You have auto scaling. You have horizontal and vertical scaling. There's no reason you can't now query your data at the source or as close to the source as you can get. And so I think building the idea of building all these pipelines to centralize your data, hopefully, is a thing of the past because I do think that is a legacy technology idea. And so, really, the idea of query engines and data virtualization and data federation and query federation, I think that all comes into play here because you do have this idea of your data's already stored. Don't store it somewhere else, but instead, use a query engine to query it directly at the source. And then you can hook in whatever analytics tool downstream you want with all the, you know, modern networking and things like that. And so you really don't need to move data around as much as you used to. And so if you still do want that data lake capability, you can have that, but, again, still use that query engine rather than having to actually move all the data into a centralized storage platform. Right? And you can retain control over your data in a way that you couldn't in the past, which is great. And then you had asked, you know, what other forces I think will have influence over the trajectory of these architectures?
Looking into my crystal ball, I do think that, you know, a key in deciding factor is this idea of data as a product. And this is what it's the heart of the data mesh, but it's really something that you know, there's so many blogs and podcasts and everything out there about data as a product now because it's long been the case that data has been a side product or a byproduct of business, and we've been trying to, after the fact, treat it like a product. But we're getting closer to the source again. And so it's something that, you know, it's no longer a pet project that you hire a couple of data engineers and trust them to just handle it all. It's now a main product that you're creating. And so focusing your strategy around data, you need to be certain that your data is high fidelity, it's reliable, it's produced with the consumers and the consumption in mind. So that the treatment of data as a first class product in the business is really essential now, and that needs to happen at all levels.
And it also needs to be part of the culture of your company if you wanna truly be data driven. And so I think with the exponential growth of data, because the volume alone has, you know, been just absolutely bigger than anyone could have imagined, I think. You know, interesting features like separation of storage and compute, you know, are now essential. They're not optional. And so, you know, if you wanna be data driven, that doesn't come for free. And so folks have really embraced at a high level, but when it comes to actually executing on making data a key part of culture and strategy, it takes more than good intentions. It takes training and tooling and cultural alignment. So I think that will inform the technologies going downstream of the next 2 to 5 years.
[00:25:00] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done. DataFold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. DataFold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying. You can now know exactly what will change in your database.
DataFold integrates with all major data warehouses as well as frameworks such as airflow and DBT and seamlessly plugs into CI workflows. Visitdataengineeringpodcast.com/datafold today to book a demo with DataFold. What are the pieces of technology or the operating paradigms or, quote, unquote, best practices of today that will go the way of MapReduce 5 years ago as we continue to iterate into the future?
[00:26:09] Unknown:
I mean, I'm not gonna say that cloud data warehouses are on their way out. Right? Because I don't think we're close enough to that. But I do think that people are starting to understand that the modern data stack is, like, key set of 5 technologies that you put together. You know, it's this magical data stack that you can get value from immediately. You know, I think people are starting to question that, you know, it's as simple as it can be because there's all these different plug and play pieces that you need to put in to really make it viable at scale. And so I think folks are starting to think more about what are we gonna do in 3 years. Right?
And what are we gonna do when we have so much data that the modern data stack with its cloud data warehouse is cost prohibitive. Right? And so I think people are starting to think about data as a product and how that works in that world, and how does the modern data stack either serve or not serve that use case.
[00:27:05] Unknown:
In that sense of people starting to, you know, dig themselves deeper as they say, okay. Well, I'll I'll just use these de facto tools that everybody else is using because they say it's easy to get up and running. What are those missing pieces that they start running up against as they go further along in their journey and they start to say, okay. Well, in my business unit, these 5 tools are great, but now I actually need to start expanding out to the entire organization or the entire enterprise, and, oh, shoot. This doesn't work anymore. Or, oh, shoot. I just spent $1, 000, 000, 000 on my data warehouse.
[00:27:37] Unknown:
Yeah. Absolutely. I mean, I think governance is a huge thing. Right? Like, governance, understanding lineage of data, understanding the quality of data. You know, there's all these data observability tools now, and I think that's a really fascinating field because the analytics are only as good as the quality of the data you're putting in. And that's especially true the more advanced you're getting in your analytics. I mean, data science, if you've ever done data science, you need just, like, absolutely massive quantities of incredibly reliable data. Right? And so the modern data stack isn't really intended for that use case in my mind so much as a data lake would be. Right? And so, again, it gets back to leaving the data closer to the source, so you're doing less to it. So it is more reliable.
And so I do think governance is a huge piece of that. I also think performance is a huge piece of that. Right? Like, your cloud data warehouses get really expensive if only because of storage, but the compute to get the performance that people require is just absolutely incredible. And so you end up really paying for that performance.
[00:28:42] Unknown:
On the data lake and data lakehouse aspect of performance, I know that that's an area that that has seen a lot of investment in tools such as Trino and in the work that you're doing at Starburst. I'm wondering what are some of those performance bottlenecks or the constraints or some of the ways that data teams need to think about the way that they're laying things out on disk, partitioning, which table format to use? Do I need to use iceberg? Do I need to use Hoody? Do I need to use Delta? You know, maybe I need Hoody because I need streaming inserts, and so I wanna sure that I have performant queries on newly written data. Like, what are some of those challenges that people are facing as they try to adopt some of the Lakehouse technologies so that they can scale, but they also are looking for high performance or maybe it's just that, okay. I need to actually use my lakehouse for massive scale analytics, but for anything, you know, performant with low latency, I actually need to stick it into ClickHouse or Druid or whichever OLAP store I might have, you know, is the flavor of the month.
[00:29:41] Unknown:
Yeah. Flavor of the month is a good way of putting it. Yeah. I mean, I think that there are still some edges that need to be addressed in that lakehouse world. Right? It doesn't have a seamless experience yet, like I've mentioned. You know, users of cloud data warehouses, they're used to that experience. And so this is something where we wanna deliver the experience within the lakehouse ecosystem as well that you get at that cloud data warehouse. And so there are a lot of tools like DBT and Good Expectations that are delivering cloud data warehouse like value in the lakehouse context as well, which helps it to get more mature and helps it really to allow users to think more about the business value and think less about the particular formats. Right? And I would also add that, arguably, the lakehouse is suffering from all of the associated issues with both the data lake and the data warehouse. Right? You get the best of both worlds, but you also can sometimes get the worst of both worlds. And so in the data lake, you have data that's difficult to access and understand, and you need context added to it. Whereas in the data warehouse, you have issues around agility and enabling different data context. So, you know, for the way a marketing organization would consider a customer is gonna be very different than the way a risk team considers a customer. And so, you know, this is 1 respect why things like data mesh becomes so interesting because you start thinking of data as a product, and it's aligned cross functionally regardless of whether it comes from that warehouse, that lake, or a lake house.
[00:31:11] Unknown:
And so as data teams are faced with these decisions of, okay, do I build around the modern data stack, or do I just evolve my current system to add whatever missing capability there is, or, you know, do I need data mesh, or am I not at the scale where that makes sense? What are some of the ways that you have seen teams start to approach those decisions and questions and some of the discovery efforts that are needed to be able to make informed choices in this constantly evolving and very confusing world of data that we're living in? It's funny because it can be confusing, but it's also
[00:31:47] Unknown:
a buffet of choice in a lot of ways. And so as with any product development, you know, I, again, solidly think that we need to treat data as a product. And so I think domains or the people creating data products can benefit from things like product planning and agile methodologies and creating MVPs just like any other product development organization would. And so finding the right technology partners that allow folks to develop quickly and iterate and get value out of those first few data products in short order is a really great way to start proving value in any transition. So that's the same as you would with any other technology product. And I think there needs to be sort of a people aspect of it, which is why I think data mesh is really hitting home for people these days is because its creator calls it associate technological paradigm. And, you know, I think the fact that people are included in that is really important. So a domain, you know, a group of data creators with clearly defined business purpose like finance or sales or manufacturing. And so both within and across domains, you want to make sure that data product developers are using the same definition of what a data product is.
And then on the consumption side, you want feedback mechanism for data product consumers and developers to work together on how to get data downstream to users. And so while the modern data stack can be helpful in these cases for, like, spinning up data environments for individual domains, you also need to think cross functionally about how that becomes a layer in which data products are produced and consumed. Right? And so I think there's this interesting world where the modern data stack intersects with data mesh. Right? And I think there's an interesting story that's coming out there that I do think is still being formed by the community.
You know, I have thoughts about it, obviously, with Starburst at the center. But, you know, I do think that there's interesting ways of thinking about all of these different paradigms at different scales and where they intersect. And so I think you need to define your strategy and focus on the business value. Right? Like and sort of iterate on all of these things at once. So, you know, whether it's a lake house or a warehouse or a data lake or, you know, the modern data stack and how that intersects with all of this. You know, I think you need to focus on shortening and streamlining the path from the data to the value that it creates.
[00:34:12] Unknown:
1 of the complexities that comes up as you are starting to go down this path of data as a product is the need to bring application teams along for the journey. And a lot of times, their incentives aren't necessarily aligned with that of the data team in terms of being able to actually make this a reality because their goal is I need to ship features for the widget because people are asking for the widget to be blue instead of green. I don't have time to figure out all of the things that go along with making sure that the data that I'm generating in that process actually maps to the concepts and the business objects that need to be exposed in this, you know, data interface for other people to be able to consume because that's not something that I'm ever going to be using, and it's not something that this person who, you know, goes to the website to interact with the blue widget is ever going to care about. So how do we think about updating the incentive structures to make sure that application development teams and data teams and business teams are all aligned in that process of both producing these applications that are actually driving the business, but also generating the data products that are necessary internally and for feeding back into those applications to kind of keep everybody moving along and aligned in terms of how to think about things, the sort of data modeling principles that are necessary to make these performant and understandable, and just all of the education that goes along with making this a reality.
[00:35:43] Unknown:
Yeah. I mean, I think you absolutely hit the nail on the head with 2 points there. 1 is incentive incentivization and 1 is education. Right? I don't think there's a recipe you can follow necessarily, but I do think it depends on your overall data strategy. But if you're thinking of data as a product, then the product development teams need to understand that that is 1 of their deliverables. Right? And it may slow down widget production for a bit, right, while they get their feet under them and learn how to do some basic data engineering. And the other option is to take data engineers and put them on the product development team so that they're working alongside the widget developers.
But I do think that it needs to be part of the overall corporate strategy as opposed to being a separate data strategy that is completely independent from a product development strategy. And so I think that's what we mean when we say that data is a product. Right? Like, data is something that that team is now responsible for that they weren't before, and you have to frame it well. Right? There's change management that comes in here, and you have to frame it as, hey. We're gonna teach you new skills, and you're gonna learn new things, and this is something you put on your resume. And, you know, I think that there needs to be product ownership and product management around data the same way there would be around anything else. And so, you know, and there's a product life cycle too. So data can expire or it can be phased out or it can be versioned and all of these good things. But, you know, I think we need to take it that step further and really say it's not just that the developers are now responsible for feeding a pipeline. It's more that the developers are the ones who understand the data. Right? Like, they are the subject matter experts here, and so they should be responsible for it as opposed to throwing it over a fence to another team, that centralized mythical data team that's an expert in all data across the enterprise because, I mean, that never works. Right?
So, I mean, I've worked at companies that I will not name, but I've owned a central data function. And, you know, I've had engineers be like, oh, we deleted all our data because we figured we were putting it in the warehouse. Why would we back it up? And I'm like, that's not how this works. Right? Like, you have to care about your data. And they're like, but we don't. And, you know, it's your problem. So, you know, there has to be, like, that cultural shift where, you know, the engineers understand that they're producing this data, and it is 1 of the products that they're responsible for. Yeah. And
[00:38:05] Unknown:
the metaphor of throwing it over the fence, in some cases, isn't even actually apt because a lot of the times, there isn't even really any intentionality in the application team of handing off the data. It's just it's in the database. Good luck.
[00:38:22] Unknown:
Or, like, I dumped it in an s 3 bucket. Oh, you need to know what it looks like? That's a you problem. Right? Exactly. I can tell you and I have both been in that situation. But, yeah, I think it's something that, you know, it needs to be driven from the top down and the bottom up in different ways. But I think it's education and intentionality. And, know,
[00:38:48] Unknown:
you know, top level corporate strategy of we're data driven and then nothing under it. Right? Absolutely. And from the developer tooling perspective, this is a subject that has been coming up a lot is if you're building a regular web application, whether it's using something like Spring in the Java world or Django in Python or Rails with Ruby, a lot of the way that you interact with the data is through this abstraction of the ORM where the database is there, but you don't think about it as the database. You just think about it as this is where the objects go until I need them again. And so your primary interface is through the code, and so a lot of times that leads to if you're just looking at the database tables and the structures there, it can seem very disjointed and chaotic as to why are all these tables named this way, or why do I have 5 tables for this 1 concept?
Because a lot of times, application teams aren't thinking about the data modeling from a database engineering perspective. They're just thinking about it from an object interaction perspective. And I think that that's also where a lot of this confusion comes in for data teams who are trying to then reverse engineer meaning out of these database tables that they're replicating into the warehouse or into the lake. And I'm wondering what you see as some of the opportunity for either injecting a new abstraction layer alongside the ORM or in tandem with the ORM or inside the ORM to be able to build up these domain objects and these business objects that are actually semantically meaningful for building these data products so that you don't have to do as much reverse engineering from the database layer or so that you can provide a more natural API from the application for doing some of this data extraction in a semantic way instead of just in a very mechanical way that then requires a bunch of extra processing steps down the partnership. Right? Like, this is a product.
[00:40:39] Unknown:
It is a downstream thing that the data creators are responsible for. So they have to start thinking about modeling in that way. And so I do think that there's product management that needs to come in. Right? And whether it's a actual product manager or if it's, you know, some other person who's involved like a data product owner, you know, that's something I see being bandied about a lot these days is the that role. But it's the idea that when products are being designed, design the data as well. Right? Like, get out in the front of it rather than being more reactive. But, also, when you're designing it and you're thinking about downstream users for the product, also think about the downstream users for the data.
And so, again, it's absolutely additional scope. Don't get me wrong. And it will slow down that product design phase. But that said, you're saving money on the back end because you're no longer having to retrofit things. Right? Like, you're no longer having to say, oh, this thing is this hideous JSON. Let's figure out how to get it into a table that can be used by Tableau or something. Right? So, like, you're actually giving, you know, more thought upfront to save yourself time on the back end. And I do think that that is a different muscle to flex for engineers. Right? Like, that is not something that they're used to. They're used to sort of saying, oh, well, you know, my deliverable is the widget. Right? That is what I care about, and that's what we've designed. And, you know, later on, we can futz with it to make sure that the data is a little better. But, you know, instead of doing that, why not be intentional about both things from the get go?
[00:42:28] Unknown:
Tired of deploying bad data? Need to automate data pipelines with less red tape? Shipyard is the premier data orchestration platform built to help your data team quickly launch, monitor, and share workflows in a matter of minutes. Build powerful workflows that connect your entire data stack end to end with a mix of your code and their open source low code templates. Once launched, Shipyard makes data observability easy with logging, alerting, and retries that will catch errors before your business team does. So whether you're ingesting data from an API, transforming it with DBT, updating BI tools, or sending data alerts, Shipyard centralizes these operations and handles shipyard today to get started automating with their free developer plan.
Putting you on the spot a little bit in terms of some of the kind of architecture design aspects when you are starting to move towards data mesh where you say, okay, I want to have this domain data product. What are some of the guiding principles that you have seen useful in understanding what level of granularity is applicable where, you know, is the data product just all of the data that pertains to this application where, you know, the data product is the data that lives in the database for this 1 web app? Is the data product this aggregation of web apps that all pertain to a given business unit within the organization? Like, what are some of the guiding principles that you have seen be helpful in figuring out what is that appropriate
[00:44:00] Unknown:
domain boundary for me to then build this sort of mesh interface on top of? Yeah. Absolutely. And I mean, I think at Starburst, we've seen people come at it from different angles, and I do think that as long as you're consistent, you can kind of do it however works for your organization. Because a lot of this is sort of the idea of having a contract with the consumers and the developers who are creating the data products to say, this is what we're building. And a lot of it is just metadata. Right? Saying, like, here's the context for this thing and allowing it to be more self serve on the consumption side. Right? But in reality, you know, there's no such thing as self serve.
And so I do think that, you know, we have a data products interface where we allow people to just use SQL to create data products so they're not they don't have to learn new technology, and they present it. And it's either a view or materialized view that, therefore, consume down stream. And we're building that view, so you're, again, getting closer to the source of the data. So I do think that providing something that has less lift for the engineers creating the data products is really important. On the downstream side, making it so that the consumers are using something that's familiar to them like SQL, is really important as well. But I do think that when you're talking about, you know, the scope, like I said, the consistency is the key. And then on top of that, you want to make sure that you've provided all the information that a Giantstream user needs to be independent. You want to make sure that these are, you know, describable and accessible and well governed and, you know, clean and reliable and all that good stuff. But if you do that, then you will necessarily have considered the downstream use case as well.
So I was talking to someone at a client who had 38 domains within 1 BU. Right? And that made sense for them. Right? And then I've talked to customers who are smaller companies, and they have 4 domains. Right? And it so I think it just depends, like, how your business organization is structured as well, how you think about breaking down your business into individual units.
[00:46:00] Unknown:
As the data lake gains steam and gains polish, what are some of the elements that you think are necessary that are still being developed to bring it to the kind of level of accessibility that the data warehouse vendors currently offer? What are some of the areas of investment that are happening that people should be keeping an eye on as they are starting to make these decisions of, you know, which technology stack to use, which vendors to go with, which architectural paradigms are going to make sense for them. Do I want a data lake, a lake house, a warehouse? Do I just want everything to be in my Kafka topics?
[00:46:41] Unknown:
All of the above. Yeah. And I think people do get paralyzed by choice these days in some ways just because there is a ton of choice and, you know, everybody says, oh, you've got the modern data stack. You've got data lake. You've got lake houses. You've got data mesh. And it's like, you know, you can become an expert on precisely 0 of these things just because they're such broad topics. Right? And so I think focusing on the business use cases, you know, job number 1 here, really understanding, like, is your job to get real time analytics, or is your job to provide downstream analytics that your customers are gonna consume and that you're gonna actually sell your data in the long run. Right? So, like, you have to really understand what is your use case and then work back from there.
I think that finding flexible technologies that really focus on performance and scalability and simplicity are really important. You wanna avoid getting locked into 1 vendor because 5 years from now, things are gonna be completely different. Right? They were different 5 years ago. They're gonna be different 5 years from now. That's 1 thing I will say for sure. Right? And so I think that having that flexibility and avoiding getting locked into a specific technology is really key. And so I think the lake house is, you know, an effect of that. Right? And that people went all in on a lake or went all in on a data warehouse, and then the lake house allows you to sort of inch away from whatever you chose and sort of get the benefits of the additional architecture. And so I think there's sort of these strategies, these huge strategic things like data mesh where it's a journey and you'll never kinda get there. And then there's things that you could spin up today like Modern Data Stack. Right? And then there's a whole world in between. So you have to sort of figure out where do you see yourself on that maturity spectrum, and then what are your business goals? And then sort of drill down from there into, you know maybe that means you need an s 3 data lake. Right? Maybe it means that you already have everything in parquet and you're good to go. Or maybe it means that everything's in Excel and you have to work back and really start from the beginning there. Not to knock on Excel, but, you know, because I feel like that's a whole episode in itself.
[00:48:42] Unknown:
And as you look toward the sort of continuing evolution of this landscape, what do you think are going to be the major shaping forces over the next 2 to 5 years that push the architectural trends from where they are now to wherever they are going to?
[00:49:02] Unknown:
I do think that data as a product is 1 of those trends. And like we were saying, I think allowing non data engineers to produce data products is going to be key. And I think there's a few different factors there. 1 is hiring. It is really hard to hire people, and so you want people and technologies that can be flexible. And, you know, data reliability and governance is key. I mean, governance is always a thing. Right? Everybody's been doing governance forever because it's 1 of the hardest problems we have. But I do think that that is going to influence our trajectory because we've got all these new and exciting regulatory compliance initiatives that we need to handle over the years. And so, you know, now it's GDPR. Who knows where it will be 5 years from now?
And then, you know, I think the technology and, you know, the cloud evolution has been really fascinating. Right? Like, 10 years ago, it wasn't where it will be, and so it's gonna continue evolving. And, you know, quantum computing will be a really interesting play here too. Like, I think there's a lot to be done for performance. But, you know, I think people want answers now. They don't wanna be spending, you know, 6 months spinning up a data stack. They want their answer now, and then they wanna know that they can rely on that answer. And then they need to figure out their longer term strategy for theirs. So I do think that governance and speed are really 2 key pieces
[00:50:20] Unknown:
here. As you have been working at Starburst with your customers, what are some of the most interesting or innovative or unexpected ways that you have seen this sort of data virtualization, data lakehouse technology, however you wanna phrase it, being used particularly in these contexts of the modern data stack and data mesh?
[00:50:39] Unknown:
Yeah. Absolutely. I mean, the modern DataStax side of things, obviously, we have Starburst Galaxy now, which is our completely managed and hosted Trino as a service sort of platform. And so you really can spin up a starburst environment incredibly quickly. Right? You just need to point us to your data, create a quick account, point us to your data, and you're good to go. I do love seeing how our customers are spinning up really exciting analytics very quickly with Starburst Galaxy, which is really fun. I love talking to some of our enterprise customers too because they've just done some really interesting thing. We have 1 digital customer, Comcast, that built out a lake house that handles all our streaming and traditional structured data. And they built it using traditional data modeling. And they provide this self-service data repository for all their different departments, and each department spins up their own cluster with their own technology, and then they can query the data however they want. So, you know, it's interesting seeing how this works at that kind of scale.
And we also see a lot of organizations that are doing both analytics and machine learning. And in that case, the lakehouse really fits well because the warehouse side of things is serving the analytics use cases, the BI tools and reporting and things like that, whereas the lake is serving the ML case. Right? And I think that's probably, you know, 1 of those best of both worlds situations. And then you've also got things like time travel. Right? Time travel is amazing, and you've got that capability to see how your data was on some arbitrary date in the past, which is useful for debugging and compliance and BCDR and all that good stuff. Yeah. People are doing some cool stuff.
[00:52:13] Unknown:
In your experience of working in this ecosystem and exploring and understanding and helping your customers come to terms with these architectural patterns, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:52:27] Unknown:
It's really fun being on the cutting edge of data. Right? Like, it's also incredibly challenging. And I think, for me, the key is just realizing that change is hard for people. Right? And whether it's because we're on a remote world and there's a pandemic and all of our data is no longer applicable or it's changed the way we interpret it, or just trying to change the way people think about data. You know? People are uncomfortable with what they know, and they don't like the unknown. I mean, it's human nature. I have a toddler. I get it. But I do think we're coming up with some really new and innovative ways to think about data management and data access and data processing and then how all of those intersect.
Right? That's the complex problem that I think we're all really trying to answer. You know, convincing folks to think beyond these old paradigms like the modern data stack. Right? It's a fun challenge that, you know, I get to think about it a lot at Starburst. So it's kind of fun seeing how people are innovating to answer all of these questions.
[00:53:24] Unknown:
And for people who are starting to explore the modern data stack and data mesh, what are the cases where the lakehouse paradigm is the wrong choice and they are better suited going in 1 direction or the other of the lake or the warehouse. Yeah. I mean, I think it gets back to what are the business questions you're trying to answer. Right? If you're just trying to do, like, straight reporting and BI tooling, like,
[00:53:45] Unknown:
you know, maybe a warehouse is right for you. If you don't have huge data volumes, a warehouse can work really well. Or if you don't care about performance as much. Right? Or if you don't have a huge budget, maybe on the other side, like, a data lake is better. So, you know, if you've got all of your data in the lake, consider why you need that lake house. What's the business purpose? And I would argue that with more modern technology, the storage layer is actually the key. And the business logic layer on top of that that you'd apply either within the CDW or within some sort of query engine, that's where the business logic gets applied. Right? And so don't lock yourself into an expensive vendor.
I mean, all of these vendors, what they're really doing is it's like, you know, it's object storage plus the SQL access layer. So you kind of have the object storage already in the lake. So I would recommend focusing on the business driver and streamlining that technology stack. So using something like a query engine on top of a lake, that's really what the cloud data warehouses are doing anyway, and it's becoming more common to handle this in house because you do have these cool tools like Starburst that can help you do that.
[00:54:49] Unknown:
As you continue to iterate on the Starburst technology platform and Starburst Galaxy in particular, what are the things you have planned for the near to medium term or any of the pieces of integration that are missing or necessary or upcoming that will help to make it a more equal citizen in the modern data stack with some of these data warehouse platforms?
[00:55:12] Unknown:
1st and foremost, obviously, Starburst Galaxy is really taking off. It allows us to bring that power of Starburst to users and really get them up and running with the power of Trino in virtually no time at all for setup, and there's no infrastructure to worry about. It's already fully managed and hosted. So I think Starburst Galaxy really puts us firmly in that modern data stack category. Also, within both Galaxy and enterprise, which is what I focus on, we have this built in access control system coupled with, you know, data products functionality, and it makes us a really excellent partner for enterprises on their data mesh journey, which is where I think a lot of people will end up there. So if you can, like, get started on that journey sooner, I think it's really important. And then the big news from last week is that Starburst just acquired Verada, which is a performance accelerator for the Starburst ecosystem, and it's been great because we have a new office in Israel. We have fantastic new engineers who have just joined us who really understand the power of Trino and what we're doing in the marketplace.
And beyond that, you know, the acceleration of our
[00:56:12] Unknown:
already best in class query speed that we see in Starburst is just I think it's gonna blow people's minds. I'm really excited about that. Alright. Well, for anybody who wants to follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:56:32] Unknown:
I do think that there is a gap in that inflection point I discussed of what happens when you go from small to enterprise. Right? You know, you can have a modern data stack if you're small, and that works really well. But then once you've got multiple business units, you've got multiple modern data stacks, how do you get from there to something really mature like a data mesh? Right? It's that sort of adolescent phase or the teenage phase of the data management story. And I'm really fascinated by that, and I think that it hasn't been addressed yet, as far as I know. And I'm really curious to see there's a lot of start ups out there, and there's a lot of enterprises out there, and I think that sort of inflection point is a really interesting area to study. So, you know, I like sticky problems, and that's a really cool sticky problem I hope to tackle soon.
[00:57:26] Unknown:
Thank you very much for taking the time today to join me and share your thoughts on the current state of data architectures and the technology forces that are helping to shape them. Definitely very interesting and constantly evolving and hard to keep track of area. So I appreciate all of your time and energy in helping us explore some of those patterns and paradigms and how to think about them. Appreciate that, and I hope you enjoy the rest of your day. Thank you so much for having me. It's been really fun. Thank you for listening. Don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest on modern data management, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you learned something or tried out a project from the show, then tell us about it. Email hostspythonpodcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Colleen Tartoe: Introduction and Background
Astrophysics and Starburst Metaphor
Current Trends in Data Architecture
Modern Data Stack vs. Data Mesh
Lakehouse Paradigm and Data Virtualization
Future of Modern Data Stack
Data Quality and Observability
Data as a Product and Organizational Alignment
Developer Tooling and Data Modeling
Choosing the Right Data Architecture
Future Trends in Data Management
Starburst Galaxy and Future Plans
Closing Remarks