Summary
Building a data platform is a complex journey that requires a significant amount of planning to do well. It requires knowledge of the available technologies, the requirements of the operating environment, and the expectations of the stakeholders. In this episode Tobias Macey, the host of the show, reflects on his plans for building a data platform and what he has learned from running the podcast that is influencing his choices.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- TimescaleDB, from your friends at Timescale, is the leading open-source relational database with support for time-series data. Time-series data is time stamped so you can measure how a system is changing. Time-series data is relentless and requires a database like TimescaleDB with speed and petabyte-scale. Understand the past, monitor the present, and predict the future. That’s Timescale. Visit them today at dataengineeringpodcast.com/timescale
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
- I’m your host, Tobias Macey, and today I’m sharing the approach that I’m taking while designing a data platform
Interview
- Introduction
- How did you get involved in the area of data management?
- What are the components that need to be considered when designing a solution?
- Data integration (extract and load)
- What are your data sources?
- Batch or streaming (acceptable latencies)
- Data storage (lake or warehouse)
- How is the data going to be used?
- What other tools/systems will need to integrate with it?
- The warehouse (Bigquery, Snowflake, Redshift) has become the focal point of the "modern data stack"
- Data orchestration
- Who will be managing the workflow logic?
- Metadata repository
- Types of metadata (catalog, lineage, access, queries, etc.)
- Semantic layer/reporting
- Data applications
- Data integration (extract and load)
- Implementation phases
- Build a single end-to-end workflow of a data application using a single category of data across sources
- Validate the ability for an analyst/data scientist to self-serve a notebook powered analysis
- Iterate
- Risks/unknowns
- Data modeling requirements
- Specific implementation details as integrations across components are built
- When to use a vendor and risk lock-in vs. spend engineering time
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Outland as an internal tool for themselves. Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to dataengineeringpodcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode
[00:01:46] Unknown:
today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. I'm your host, Tobias Macy. And today, I'm going to be sharing some of the approach that I'm taking while designing a data platform. Some of the things I've been thinking about and some of the lessons that I've learned from the podcast that have fed into those decisions. So if you are listening to the show, there's a good chance that you're familiar with who I am, but just to give another brief introduction, I'm the host of the podcast. I've been running this for about 5 years now.
I've also been running the Python podcast dot in it for about 7 years. And for my day job, I run the platform and DevOps team for the open learning department at MIT. And I got involved in data management through my work as a systems administrator, software engineer, and now a DevOps and platform engineer. And in that journey, I've been very interested in the data elements and being able to build reliable and performant data systems, which is what led me to create this podcast. And so in the past couple of years, I've been thinking a lot about how to architect a data platform so that it will fulfill the needs of the organization, some of the specific constraints that I'm dealing with, and just generally the pieces that you need to consider when you're starting down that path, whether you have an existing data platform or if you're starting to build something out from scratch.
And so, in terms of the initial implementation of what we've been using, we started off with installing a system called Redash, which is a business intelligence and dashboarding system, open source, allows you to write SQL code in their visual editor and be able to execute it. And sometimes you can schedule some reports and it connects up to a number of different data sources. And so we took advantage of that because it was something that we had available. It was fairly straightforward to get it up and running and connect it up to the different systems, but it didn't scale very well, particularly as we started to need to do more complex analysis and we wanted to be able to join across multiple different data sources.
So we've definitely been hitting a lot of limitations of that system. And I've been thinking a lot about how do we want to architect a more full featured and robust platform to be able to address all of the data needs of the organization and how to make that maintainable from a platform perspective and accessible from an end user perspective, which is a lot of the things that we talk about in this podcast, sometimes at very specific component levels, sometimes more broadly. So I just wanted to share some of the thoughts that I've had and some of the considerations that I'm going through right now as I begin some of this architectural planning for implementing these data infrastructure components and how to stitch them together into an overall platform.
And in terms of being able to actually build the data platform, there are a number of different components that go into it. And at a high level, there is data integration, which is the extract and load being able to pull your data from the source systems into some centralized storage layer for being able to do all of the downstream analytics store it in file or object storage and build out some sort of a data lake? Are you going to rely on structuring the data and put it all into a data warehouse? And then you need a data orchestration layer to manage the data integration, the data transformations, any downstream uses of the data.
And it's important to have a robust metadata repository to maintain the record of all of the different components of the system, the data that you have, the way that it's being used, auditing, access control. And once you have all of the data in that central storage system, you need to have some sort of a semantic layer, whether that lives in your business intelligence, or if you are going to use 1 of these newer systems that are being called the metrics layer or the semantic layer or headless BI. And all of this is essentially useless if you don't have some utility of the data, some sort of a data application that's actually going to be a external end users or building machine learning models off of it. And so these are a lot of the things that I'm thinking about as far as how to stitch together these different pieces and different concerns.
There are a few different philosophies around that, where some folks will say you need to have a fully vertically integrated solution where you have everything living together in 1 tool so that there's a consistent experience across it. That's definitely something that's very popular, particularly in larger organizations, because of the amount of complexity that already exists. You want to be able to have a way to reduce that complexity. In the other direction, you have what's being termed the modern data stack, where you have individual best of breed components where each different tool will focus on 1 of these different concerns or maybe have some overlap into a couple of them. So you have this unbundling of the data stack, but then you also have other organizations that are working to repackage that modern data stack, where they will abstract over those different tools and infrastructure components to be able to give you that consistent experience again, but still be able to leverage some of the innovations and new capabilities that are introduced by these more recent contenders in the space.
And so digging deeper into some of those specific layers, beginning with data integration, which is where all data endeavors begin because you need to have some sort of data to work with, That has typically been extract, transform, and then load, where you need to perform some sort of cleanup or initial modeling before you load it into your data warehouse. With some of the cloud based capabilities and more recent advancements in data management layers, that has shifted into the extract and load phase, where you will just pull the raw data, load it as is into some of these destination systems, maybe do some very light transformation.
And some of the things that you need to think about when you're deciding what am I going to use as that data integration layer What are some of the sources that you're dealing with? So are you pulling from application databases? Are you pulling from third party SaaS platforms? Are you in control of the data that you have? Are you just pulling flat files off of some sort of file share that you're pulling from a vendor or a partner? And do you need to deal with real time access to data as it changes? So are you just dealing with periodic batches, whether that's on the frequency of minutes, hours, days, weeks, or do you really need to be able to process each event as it occurs and process it with as little latency as possible?
Because each of those capabilities are going to bring in different infrastructure and architectural requirements around the entire rest of the data platform. Some people will advocate saying that you need to start with streaming because batch is just a special case of streaming, where you're just doing coarser grained events. But it also requires more upfront investment in terms of the sophistication that you're dealing with. For my own purposes, I'm primarily dealing with application databases and some third party SaaS platforms, So I'm most likely going to be focusing on a batch approach using something like the singer specification or something like Airbyte.
And in terms of my approach to selecting the different tools and components, I generally bias towards open source, both because that is an ecosystem that I spent a lot of time in, and so I feel very comfortable. But also as a tinkerer and somebody who works at the platform and infrastructure level and working in sort of the reliability space, I really like being able to have access to the internals of the tools and the systems that I'm operating so that if something does go wrong, I have it in my power provided by using a vendor because they will be the ones responsible for a lot of that reliability engineering, so it alleviates a lot of the burden on your own engineering resources.
So there's definitely a trade off to be made there, and that's something that I think about a lot is, am I biasing too heavily towards open source? Should I be pushing this into a vendor, maybe a vendor that's running open source software so that I can do some of the debugging and be able to provide more detailed feedback to the vendor in the event that something does go wrong. So that's something that colors my overall thinking about the platform layer, is that sort of bias towards open source and running it myself. And so, in terms of the data integration layer and some of the choices that I'm making, my current thinking is to buy into the singer ecosystem, most likely leveraging something like Meltano to be able to handle some of the actual stitching together and monitoring and generation and tracking of the metadata related to those executions.
I still need to prove out that implementation to make sure that it fulfills all the needs that I have, but I like that by using such an open and flexible specification, it gives me the option to be able to invest in the long tail of data sources and destinations, where, if I'm using a vendor such as Fivetran, I'm a little bit at the mercy of what they have implemented. And because I'm going to be working with applications that my team owns and modifies, as well as working with some open source components and helping to contribute back to that ecosystem.
By being able to build my own customized taps and targets, I can add support for the systems that I need to be able to rely on. That being said, there are some event stream components to some of the systems that I'm running. So at some point, I will need to consider how am I going to incorporate streaming capabilities. Currently, I rely on batching up those events and publishing them as JSON files to S3, so I'll still be able to analyze them, but at higher latencies. So it's a question of, will those latency requirements ratchet down and introduce the need sooner rather than later to invest in a streaming environment.
My personal preference there is most likely to use something like Pulsar because of its flexibility in terms of operational characteristics, as well as the interfaces that it provides. So I like the fact that it has compatibility offerings for Kafka because of the size of that ecosystem, but it also supports the PubSub approach to messaging, such as what RabbitMQ provides with AMQP. So it's just a very flexible ecosystem. But that's something that I haven't explored fully yet, but that's my current thinking on the matter. And I've just heard a lot of reports of people running into challenges dealing with Kafka, particularly as it scales, even with using some of these managed solutions, just because of some of the foundational architectural elements about how consumers are managed and how that maps to topics and how you need to do a lot more upfront planning and consideration in terms of how you design your topics and event streams. So having that greater flexibility in terms of being able to iterate and discover and modify as you go, having to do so much upfront investment in being able to design the system.
Moving into the data storage layer, there are some additional considerations to be used. So how is the data going to be used? Is it primarily going to be textual data? Is it something that can be easily structured into a table format? Or are you also going to be dealing with unstructured data sources such as images, videos, binary object data, such as PDFs, or if you're dealing with genomics or customized geospatial information. Personally, I'm mostly dealing in the space of structured and relational data, maybe some semi structured JSON objects. But there is still a lot of discovery to be done in terms of some of the unknown data sources that I need to interact with as I start to scale out the platform. And I also like the flexibility of being able to store things in a file layer because it gives me the option having it in a loosely structured data lake to invest in modeling and optimizing some of the access around that data by loading it into a data warehouse or an OLAP store as a secondary concern without having to make that investment upfront and constrain some of the downstream capabilities that I have. So I'm optimizing for flexibility at the cost of a little bit more complexity in the overall stack that I'm going to be operating.
[00:16:48] Unknown:
TimescaleDB, from your friends at Timescale, is the leading open source relational database with support for time series data. Time series data is time stamped so you can measure how a system is changing. Time series data is relentless and requires a database like Timescale DB with speed and petabyte scale. Understand the past, monitor the present, and predict the future. That's Timescale. Visit them today at dataengineeringpodcast.com/timescale.
[00:17:18] Unknown:
Some of the other things to consider are What are some of the other tools and systems that you're going to need to integrate with? So, are you planning on using a data quality vendor that only works with some of the main data warehouses? Are you going to be relying on SQL as the primary access mode for managing your data? Are some of the, maybe, reverse ETL vendors that you're going to be relying on only available with some of these data warehouses. So there are some costs and considerations to be made when you're deciding whether to use a data lake or a data warehouse approach, particularly as the data warehouse becomes the focal point of the modern data stack.
And so, there are definitely a lot of benefits to be had by using a data warehouse from 1 of the big vendors, such as BigQuery, Snowflake, or Redshift, because so much of the ecosystem has invested in working well with those systems by being able to do things like analyzing the query logs, to be able to auto generate lineage information, being able to introspect the table schemas, to be able to load it into your data catalog. So there are a lot of companies that are investing in that ecosystem. That being said, there has been also a movement towards standardizing on what the lakehouse architecture will look like, where you have your data stored in a file or object storage, and you're building out the storage layer as a data lake, but then you also are able to take advantage of some of the warehouse semantics and access patterns through 1 of these SQL interfaces to object storage, such as Presto or Dremio or Trino.
And so given my choice to use a data lake storage approach, where I'm using S3 as the storage location for my files, I'm planning on using Presto or Trino as the SQL interface so that I have that data warehouse interface where I can treat everything through SQL and be able to lean on tools such as DBT and some of the other very sort of SQL optimized workflows, but still have the flexibility of access to work with those file objects through other systems, whether that's just pure Python or Dask or Spark. And the other aspect of the storage story when working with data lakes is that you need to think about what is the actual format of that data. So am I just storing it as new line delimited JSON?
Am I storing it as binary blobs? For any relational data, I'm focused on using Parquet because it's a very well defined and well supported format that gives you some of the advantages of columnar data stores for being able to do aggregate analytics. So it makes it much more performant when you're working with these Presto or Trino or Dremio systems. And in order to be able to stitch all of these things together, you need to have some data orchestration layer, where it will handle the mapping of what are the dependencies across these different either tasks or datasets, and what are the periodic schedules that I need to be able to manage, particularly when you're working with these batch systems.
And the data orchestration piece is becoming 1 of the most important choices that you make because it becomes the control center of your entire data platform, and it is where a lot of the business logic will live as far as what are the mappings of sources to destinations, what are the transformations that need to be applied and when. It will have a lot of the information about the lineage of your data as it goes from these sources to destinations. The question of whether the primary consideration is task oriented or data oriented will influence a lot of the capabilities that you have. So a lot of the earlier generations of workflow orchestration systems were very focused on task sequencing, where you would say, I need to do this task where it is agnostic to the actual data that it's working with or the execution, and then I need to do this other task, it will handle that dependency graph to ensure that these tasks are run-in sequence, but without adding in additional logic and additional systems, you're not going to have any concrete guarantees about whether the data that you care about is actually going to be processed properly or whether there are consistencies in terms of what the outputs of 1 stage are and the expected inputs of the next stage. So you need to, again, add in a lot of extra logic upfront work to make sure that those different task stages are compatible with each other.
And some of the more recent generation of systems are focused on this very data oriented approach, or sometimes data asset oriented approach. So you are able to encode in the logic of the task flows what are the types of data that I'm working with, both as inputs and outputs for each of these stages, And then it will be able to give you some early warnings when you're developing your flows if the output of 1 task is not going to be compatible with the input of another task. And it will also give you insight as to when different data assets are created or modified, where an asset could be CSV file in S3, it could be a table in a data warehouse, it could be a dashboard in a business intelligence system.
So being able to get that native view of the actual assets that you care about at the end of the day versus just whether tasks executed on time is very valuable. As I'm sure folks who listen to this podcast are aware, I have invested in the Dagster ecosystem because I think that they do a very good job of being able to bring that data native aspect to the fore, but there are definitely a number of other great systems out there. So definitely encourage everyone to go and evaluate those different tools to see how well it suits your own use cases. Some of the other things to think about when you're choosing a data orchestrator is who is going to be managing the workflow logic. So there are a number of different languages, different approaches, where some of them are very code and software engineering heavy, such as Dagster.
Others are optimized for people who understand the business domain but don't want to dig into the code, where they have visual modeling of being able to build your different flows. So I know that prophecy is a system that's focused on that, where they're building this low code or no code interface to being able to build your data orchestration, and then it executes on Spark. Other things to think about are what programming languages are your team familiar with. So, Daxter, Prefect, Airflow are all very focused on the Python language and ecosystem, which has grown to become 1 of the main languages used in data management and data processing.
But there is also still a heavy component of Java developers and Scala developers because of systems such as Spark and Hadoop. And so those are things to think about when you're deciding on what your data orchestration layer is going to be, or who is going to be managing orchestration layer? Is it something orchestration layer? Is it something that you and your team have the capacity to be able to deploy and maintain? Or do you want to go with some vendor managed or hosted solution for those systems? And then also, as with everything, what is the broader ecosystem around that tool?
Is there going to be general support and accrued industry knowledge about how to do different things with that tool? So Airflow has definitely become 1 of the main contenders there, where it is a very popular tool and has been around for a while, and so there are a lot of people who understand how to run it, what are some of the quirks, how do you deal with some of the different challenges that might come up, what are some of the useful design patterns? Whereas some of the newer contenders are still going through that process of figuring out what are the best practices around how to deal with this system, how to work with it as you scale, both in terms of data and execution, but also in terms of organizational and logical complexities.
Another piece of the data platform that's very important and has been gaining a lot of attention recently is the metadata layer. So that can take the form of data catalogs, it can take the form of lineage graphs. Different tools and systems will have their own concepts of representation of metadata, but it's very important to think about how are you going to manage the metadata of your overall platform to be able to get a holistic view of how data is being used, what are the different data sources and destinations and assets that you have. And, you know, there's also different types of metadata where there's the metadata that is the table schemas and column definitions for information in your data warehouse, but then there's also metadata about who accessed the system at what time, what queries were executed, what were the steps taken to get this record from the application database that it was created in to this dashboard, and what were the manipulations that were provided on it, how long did it take for all those things. And so it's definitely very important to consider what are the uses of metadata that you are going to be relying on and that you need to be able to surface to end users and to operators of the platform?
And what are the different ways that you are able to extract metadata from the different pieces that are working with this information that you care about? And how easily can you transmit that into a centralized metadata repository.
[00:28:21] Unknown:
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and to build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state of the art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up for free or just get the free t shirt for being a listener of the data engineering podcast at data engineering podcast.com/rudder.
[00:28:58] Unknown:
There are a lot of different tools out there right now. Some of the main contenders from a very sort of holistic metadata writ large perspective are DataHub and OpenMetadata. There is also the Open Lineage project with Marques as the reference implementation for being able to track lineage specifically and being able to understand what are the tasks and operations that have performed on data over its lifecycle. And there are a whole host of data catalog options out there for being able to track your data assets, what are the access patterns of the data, and being able to understand what is the overall popularity of a particular set of data resources so that you know what to invest in, what to focus on when there is any sort of a data outage, things like that.
So those are what I consider to be the main foundational elements to a data platform. Other things that you really need to be thinking about, too, are data quality monitoring, what a lot of folks are calling data observability, to be able to know when does something go wrong in your data platform, how are you going to identify where it went wrong, how it went wrong, how to resolve it. I haven't focused yet on that specific stage of it because I'm still in the early planning phases and initial implementation of some of these more core aspects. But that's definitely something that I've been thinking about as I start to plan this out.
Most likely going to be investing in using something like Predexpectations because it is an open source tool, it's very flexible, it gives me a way to start writing the contracts that I know that I care about. But then there are also a lot of things that you need to be able to monitor and be alerted on that are unknown unknowns, and that's where systems such as Anomalow or Bigeye or DataFold come in to be able to help with some of that automated detection of sources of errors that you might not know that you need to check for until it becomes an issue. Once you have all of your data in this storage layer, you have the orchestration, you have metadata about it, you also need to think about, for reporting purposes or machine learning purposes, what are the actual core data models that I care about? How am I going to do what some might call master data management or semantic modeling?
How am I going to make sure that the definitions that I have about some of these entities in my system are shared across all of the different downstream consumers, which is where the semantic layer comes in. Until recently, a lot of that semantic modeling happened directly in businesses intelligence tools, but there has been a recent shift towards headless BI or the metrics layer. And there are a number of different tools out there from the open source ecosystem. Some of the ones that I'm considering are Metricool or Kube. Js because they're both very flexible in terms of being able to use them to power data APIs and write additional logic on top of those systems.
But I'm still very early in those phases. I haven't invested a lot of time and energy in deciding on that piece of the stack until I get to I have all my data in a centralized repository and I started doing some exploration about what are the modeling considerations that I need to have, what are the entities semantic layer is in place, then you also have different data applications that you can start to build. But those are towards the higher end of the sort of data hierarchy of needs, where before you can think about what are the exact data applications that I'm going to build, I need to know what is the data that I have to know what can I build with it?
And so those are a lot of the core considerations and thoughts that I have around the data architecture elements of it. And then as far as implementation, where I'm largely starting greenfield, and there are a number of unknown unknowns as we start to explore the data, scale the capacity of the platform. And so, in order to keep from getting lost in a sea of decisions and end up building, you know, an incredibly sophisticated platform that nobody actually cares about and doesn't provide any value, I'm starting with what is a specific end user need for data that I know that I have, because that will give me a way to start with the implementation, explore a lot of these unknown unknowns, and start to discover what are the edge cases, what are the problems with some of the choices that I made early on that might become exacerbated as we scale out. And so by selecting a single focused end user need with this data saying, Okay, this is the actual initial data application that I need to provide.
That gives me a way to say, okay, these are the data sources that I need to collect. I don't need to collect everything from everywhere. I just need to collect these pieces of information into a centralized location, figure out the modeling and semantics of that data so that I can then provide it to an end user, validate whether my choices from an architecture and infrastructure perspective still hold? And how well am I able to provide a self-service interface to a data analyst or a data scientist to be able to work with that data.
And then based on that feedback that I get from the end users and from myself and my team and just the overall experience of working through that problem, I can then say, okay, these are the pieces that did work well, I'm definitely going to stick with them and invest more in them, or these are the pieces that didn't work well, and I actually had to replace it with this other piece or add in an additional component that I hadn't considered upfront. And so that gives me a way to mitigate some of the potential risks fairly early by doing an initial research spike of not scaling to production capacity, but just doing a very narrowly scoped, you know, how does this work from end to end? And then what are some of the limitations in terms of actual integrations across these different tools that I didn't know until I really started to get into the guts of it and work with it and use that for something that an end user is going to rely on.
And so, there's a lot of stuff that I've learned from this podcast, as far as the considerations to be made, the availability of different tools, some of the ways that very sophisticated organizations are working with data. But the other thing to consider as you're going through your own journey is, what are your own specific constraints? What are your own specific needs and capabilities as it pertains to these different tools? And so it's valuable to not just buy into whatever a vendor is pitching, but think about how is it going to fit with the rest of my infrastructure, with my team, with my own experiences.
And so, I definitely appreciate everybody listening. This is the 2nd time I've done a monologue for the podcast. It's an interesting way for me to try to organize my thoughts on the matter. So I appreciate any feedback that you have as to whether you found this valuable, if there are any additional questions that you have about my own thinking and experiences, if you have any thoughts that you would like to share that maybe you think would be worthy of an episode to discuss from your own experiences. Definitely grateful for everybody who listens to this podcast every week.
And so for anybody who does want to get in touch, provide feedback, follow-up, make suggestions, I'll add my contact information to the show notes. And in terms of the final question of what I see as being the biggest gap in the tooling or technology that's available for data management today, The last monologue, I focused on the lack of some of these application frameworks having a first class support for providing data extraction and integration capabilities out of the box. This time, given the focus on architecting a data platform, I'd say that 1 of the biggest gaps is in just sort of widely available and popularized information about how to actually navigate this expanding landscape of the data ecosystem.
There are definitely a lot of valuable resources out there and a lot of personal experiences of building some of these systems, but it's definitely still difficult to be able to say, okay, these are my specific circumstances. What is my best option for being able to build a data stack? But there are definitely ways that you can do that. There are things like the AWS Lake Formation. Google has their own opinionated approaches. Different vendors will have their own different end to end solutions. Databricks has Delta Lake. But it's always valuable to think about how does this fit with my own understanding and my own requirements. And so I still think that there is a bit of a gap in more vendor and platform neutral ways to think about your end to end approach of data infrastructure.
So definitely appreciate all of you listening. Glad I was able to share some of my thinking on the matter. Please provide feedback if you have found it helpful. And thank thank you, and have a good rest of your day.
[00:39:35] Unknown:
People listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Host Background
Initial Data Platform Implementation
Components of a Data Platform
Data Integration Strategies
Data Storage Considerations
Data Orchestration Layer
Metadata Management
Data Quality and Observability
Core Data Models and Semantic Layer
Implementation Strategy
Conclusion and Feedback Request