Summary
Metadata is the lifeblood of your data platform, providing information about what is happening in your systems. A variety of platforms have been developed to capture and analyze that information to great effect, but they are inherently limited in their utility due to their nature as storage systems. In order to level up their value a new trend of active metadata is being implemented, allowing use cases like keeping BI reports up to date, auto-scaling your warehouses, and automated data governance. In this episode Prukalpa Sankar joins the show to talk about the work she and her team at Atlan are doing to push this capability into the mainstream.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
- Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
- Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.
- Your host is Tobias Macey and today I’m interviewing Prukalpa Sankar about how data platforms can benefit from the idea of "active metadata" and the work that she and her team at Atlan are doing to make it a reality
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what "active metadata" is and how it differs from the current approaches to metadata systems?
- What are some of the use cases that "active metadata" can enable for data producers and consumers?
- What are the points of friction that those users encounter in the current formulation of metadata systems?
- Central metadata systems/data catalogs came about as a solution to the challenge of integrating every data tool with every other data tool, giving a single place to integrate. What are the lessons that are being learned from the "modern data stack" that can be applied to centralized metadata?
- Can you describe the approach that you are taking at Atlan to enable the adoption of "active metadata"?
- What are the architectural capabilities that you had to build to power the outbound traffic flows?
- How are you addressing the N x M integration problem for pushing metadata into the necessary contexts at Atlan?
- What are the interfaces that are necessary for receiving systems to be able to make use of the metadata that is being delivered?
- How does the type/category of metadata impact the type of integration that is necessary?
- What are some of the automation possibilities that metadata activation offers for data teams?
- What are the cases where you still need a human in the loop?
- What are the most interesting, innovative, or unexpected ways that you have seen active metadata capabilities used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on activating metadata for your users?
- When is an active approach to metadata the wrong choice?
- What do you have planned for the future of Atlan and active metadata?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Atlan
- What is Active Metadata?
- Segment
- Zapier
- ArgoCD
- Kubernetes
- Wix
- AWS Lambda
- Modern Data Culture Blog Post
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend. Io, 95% reported being at or overcapacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it's no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. That's where our friends at Ascend dot io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark and can be deployed in AWS, Azure, or GCP.
Go to data engineering podcast .com/ascend and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $5, 000 when you become a customer. Your host is Tobias Macy. And today, I'm welcoming back Prakalpa Sankar to talk about how data platforms can benefit from the idea of active metadata and the work that she and her team at Atlan are doing to make it a reality. So Prakalpa, can you start by introducing yourself?
[00:01:29] Unknown:
Hi, Tobias. Great to be back on the show. So, yeah, my background is I have been a data practitioner my whole life. Prior to this, did a lot of work in the data science for social good space. My cofounder, Warren, and I founded a company called Social Cups that basically, we we were like, you know, there are these large scale problems like national health care and poverty alleviation, and they don't use data. And it felt like they should use data, so we basically said, let's go do something about that. So that ended up doing, you know, wide variety of work where we basically became, like, the data team for our customers. So we're working with folks like the United Nations or the World Bank or the Gates Foundation.
Dealing with a you know, at 1 point, we're dealing with data for 500, 000, 000 Indian citizens and billions of pixels of satellite imagery. So really, I guess, dream projects in some ways for data practitioners. But that's sort of where the dream stopped because internally, it was not a dream at all. It was just a lot of chaos. Right? So as a data leader, I feel like I have gone through every single fire drill that existed. I have had cabinet ministers call me at 8 in the morning and say, Prakalpa, the number on this dashboard is broken, and I've gone through this while goose chase of calling my project manager, analyst, engineer, trying to figure out what went wrong. I've legit sat in the top of our terrace and cried once for 3 hours because an analyst quit on me exactly a week before the project was due. And he was the only 1 who knew everything about the data. And I was like, I don't know how to deliver this project to my customer anymore and just dealt with what today I think of is almost collaboration chaos that is normal in a data team's life. Right? Because data teams are so unique in the sense that they're diverse and less engineer assigned as business analytics engineers.
They all need to come together and collaborate effectively, but that same diversity leads to a lot of chaos. So that was our life. Hit a breaking point, realized we couldn't scale like that, actually started building internal tooling for ourselves, and that basically made our team more agile and effective over time. And we realized that that those tools could actually help data teams around the world. So, yeah, that's basic where I am today. Founder of a company called Athlen. We think of ourselves as a collaboration metadata layer in the modern data stack. So, really, how do you help these diverse tools and people work together effectively
[00:03:40] Unknown:
to ensure that we hopefully don't have to deal with all the chaos that most data teams and practitioners are sort of wake up every morning with these days. And you already gave a bit of your background on how you got into data, but I guess what is it about this space that keeps you engaged and motivated and not deciding to go and, you know, raise chicken somewhere and live a peaceful life?
[00:04:04] Unknown:
Well, that's a great question. I mean, I think a couple of things. I think, 1, I truly believe in the power of what data can do and amazing data teams can do. I saw that firsthand. So as a data team myself at Social GOPs, for example, we powered the cabinet minister and ministry that basically pushed out clean cooking fuel to 80, 000, 000 below poverty line women, and it was done in a record 12 months. You know, it was the most complex program like, you know, nothing like it had ever happened before in, like, probably the world. Right? Like, you know, programs with that scale just doesn't get driven around the world And I saw how data was used every single day literally at, like, an individual village level. Like, the cabinet minister used to basically get, like, these are the exact 2 people that you need to call today to actually, like, you know, solve problems on ground, you know, things like that. And, you know, when we built India's national data platform, it was built by an 8 member team in 12 months. It was the fastest of its kind globally. It's used by the prime minister himself. It brings together data literally down to the village level across every single, basically, development area that, you know, constituents need. Right? I was able to see you know, these were projects that people thought were impossible. And, honestly, I think they would have been impossible except that we build these tools that help these really diverse people like analysts, engineers with political scientists on our team, development economists come together and collaborate effectively to make these projects successful. Even like a couple of years before as a leader at Marshall, I think we would have failed in those projects.
And so I saw the impact that that can have on ground, and so I think that's 1 thing that drives me a lot. The second, I think, is I just don't want, personally, like, any other data leader to go through what I went through. Like, I still remember I mean, the sleepless nights. Like, 6 months, I wouldn't have left office because we were firefighting. Every day was a firefight. Like, Slack messages consistently, the amount of chaos that you're dealing with. And sometimes that trust breaking. Right? Like, I think as every data leader, your con you know, the people on the other side that you're doing this for is the business. And I think we all truly believe at some level that data will help drive better outcomes. And when sometimes a number is broken and you don't know how to answer that question to your business, Agility breaks, but trust breaks.
The trust that they have on you, trust that they have on data, like, those kinds of things break. Right? And just not being in a helpless position when you saw those things happening and then realizing that, you know, you don't wanna ever be in that position. And today for me, like, hopefully, not wanting any other data leader to be in that position again is what I think also broadly our founding team at Atlant.
[00:06:44] Unknown:
On that note of trying to improve the visibility and communication capacity for these different data teams and producers and consumers of data, I'm wondering if you can just start by describing, going back to the introduction of what we're talking about today, what this idea of active metadata is and how it differs from the current approach that most people are taking to their metadata systems.
[00:07:09] Unknown:
Let's talk about traditional metadata management to begin with. Right? If you think about traditional metadata management for the last couple of decades, we've basically said, let's take metadata from a bunch of tools, and we'll go put it in this 1 tool. And we'll call it the data catalog or we'll call it the data governance tool. Now the problem with this approach is that it basically tries to solve a silos problem by introducing 1 more silo to the stack. So if you think about this from a end user perspective, you know, as an end user, if I'm in a dashboard and I'm like, that number doesn't look right. Can I trust this? The last thing I wanna do is actually jump into this new new tool and search for that dashboard and figure out if I can trust it or not. Right? As an end user, I want context where I am and when I need it. So, traditionally, if you think about this approach, they've been very passive siloed ecosystems, and that's really where there's this new approach which we think of as the 3rd generation of metadata, which is centered around this idea of active metadata, which, you know, as the name says, is really about action oriented metadata. Right? So how do you take this metadata and bring it back into the daily workflows of teams? So be it, for example, when you're directly in your BI tool and you say, can I trust this dashboard? You should know if you can trust this dashboard through all the context that comes from the pipeline and comes from the metrics, the context, the documentation. Everything is available right there in that BI tool. Or, like, let's say, for example, we have customers who use Slack. Like, they have literally, like, Slack channels that are called data gurus, which are literally just, like, you know, if you search for does anyone know, there'll be, like, hundreds of questions that show up in that channel. Right? So be it in Slack, be it in Jira, how do you really sort of personalize and give these context back into the daily workflows of teams? But also daily workflows of tools. So how do you actually improve the tools in the modern data stack itself with the context that comes from all the other tools in the ecosystem?
[00:09:11] Unknown:
And as far as the capabilities and workflows and use cases that this active approach to metadata can provide, what are some of the, I guess, notable examples of ways that people might start incorporating these more active approaches to using that metadata to power the work that they're going to be doing anyway or to reduce some of the points of friction that they are experiencing in the current approach to how metadata is collected and provided?
[00:09:40] Unknown:
Yeah. So I basically think of it in in 2 kinds of workflows. Right? So So if you think about the problem in this ecosystem, there are humans, diversity of humans, analysts, engineers, scientists, business. And then there's tools, and there's a diversity of tools. Warehouses, BI tools, query tools. It's like there's a whole set. And this intermesh of this diversity causes chaos. So I think of, you know, the use cases of active metadata in 2 kinds of buckets. The first is the humans. How do you enrich the experiences that humans are having on a regular basis? So the way I see this is, for example, some use cases we see a lot enriching user experience in BI tools itself. So let's say when I'm so this could be, for example, a Chrome extension that sits on top of your dashboards and things and things like who owns this dashboard, was the pipeline that's connected to it updated today or not, is it verified or not, what does this metric, ARR, how do we measure ARR available right there in the dashboard itself or the widgets in the dashboard itself?
2nd, it could be, again, going back to that human approach, let's say, you know, a Slackbot. So if someone says, does anyone know how you measure ARR? The bot basically brings in context from basically everywhere including previous Slack messages and says, here's how we measure ARR. Right? So those are some examples of sort of that human approach. The other is actually the tooling based approach that I think of. Right? And so, for example, if you actually have metadata across your entire ecosystem, there's a ton of use cases that can open up. So let's say, for example, cost optimization, allocating compute resources dynamically. Right? If you know that 90% of your users log in to your BI tools at 10 AM on a Monday morning, you could actually use that to automatically scale up your compute clusters, and you could auto update your data pipeline at 9:45.
And you can scale down everything a couple of hours later when people log out. Purging stale or unused assets. Right? If you know that, hey. This is an asset that does not get used on a regular basis. You could purge it or archive it, or you could make sure that it's not updated regularly so that you save on, like, pipeline orchestration, you know, capabilities. Right? So action things like that to even examples of, let's say, the data mesh, for example, talks about this concept of programmatic governance. Right? So, you know, how do you as you go from the sort of centralized data speedboat based approach to more like data products in their own domains, you could actually start looking at approaches where if you know that, you know, the usage of a particular data product is great or not, you could actually then auto send a Slack message to, you know, a data product owner who owns that data product and say, hey. You're doing a great job. Your data products are being used a lot by end users, which is great. It's positive reinforcement.
But then there's there could also be the negative reinforcement. So you could unpublish from the central data mesh, and then you could basically send out, like, a message to the data product owner saying you should probably be improving, whatever, your data product shipping standard or documentation to actually help drive end usage. Right? So those are the kinds of things that you could technically do if you bring together all the kinds of metadata. Right? So be it usage metadata that isn't, you know, who's using your data products, to pipeline metadata, which is was it updated, not updated, when, by whom, you know, context metadata, which is, you know, what does this column name actually mean or descriptions, lineage, which auto con connects, you know, all of these pieces. When all of that metadata really comes together, there's a whole host of use cases across security, cost, observability that can be driven. You know, again, going back to help those humans and tools operate in controlled chaos in some ways.
[00:13:19] Unknown:
So 1 of the things that you mentioned was that the ways that metadata systems are being built and used and thought about right now is they're effectively this additional silo for information in an organization, but they were created as a way to reduce the burden of point to point integrations where if I want to understand how my spark runs are executing in my business intelligence tool, I need to have an integration from my BI into my spark cluster. And then if I want to understand when the last data catalog or metadata system was a single point of integration where you could say, everything just needs to write an integration using this format to be able to talk to Atlan or whatever your metadata system might be, but then it all lives in this 1 tool. And I'm curious how you are approaching this question of how can I now take this 1 centralized metadata store similar to the ways that people are using the data warehouse as this choke point for all of their organizational data and then be able to actually feed that back out into the different upstream and downstream systems so that you can make effective use of the metadata that has been centralized in this 1 spot and any of the lessons that you're learning from the approach that people are taking to the, quote, unquote, modern data stack and how they're using their data warehouses?
[00:14:42] Unknown:
I'd written this article that I called, you know, we probably are approaching a time where we need a metadata leak similar to the way that, you know, there was, I think, about, you know, 5, 6 years ago, like, the data lake, data warehouse, data lake house, whatever you decide to call it. Right? But the reality is that metadata itself is starting to approach, you know, in some ways the same paradigms that big data was or, you know, traditionally was in in about a decade ago. Right? There's so much metadata getting, like, created as a part of every single tool and interaction.
There's also metadata that, you know, traditionally, a lot of these systems just collected technical metadata. Right? But there's metadata. Like, the Slack conversation that you're having about that particular data asset is also metadata in some ways. It's social metadata. There's a lot of operational metadata today, which is starting to get exposed. So the 1 of the things that we're seeing in the modern data stack definitely is that a lot of the tools are starting to open up metadata a lot more because people are starting to realize the value that opening up metadata can start to provide. So metadata itself is starting to approach almost, you know, that paradigm of the data, and there's a lot of intelligence that can be created on top of this metadata.
So let's say, for example, you could automatically parse through your SQL query history to auto construct lineage almost at a like, at column level and a field level across your entire ecosystem, which generates 1 more level of intelligence across your siloed metadata. Right? And so I think that's the core layer, which is how do you almost think of a central store. So of your metadata, we think of this as the metadata lake, which is built on a knowledge graph based ecosystem, which actually create all the connections between the humans, the tools, the flow of data, the lineage, all of those elements that brings that together brings it together with semantic context. So that's really where business metrics and human context gets added into it. Where I think there's a lot that you can really learn from, you know, what data warehouses have done. Right? So what did cloud data warehouse do that made them insanely successful? It made it easy. It comes down to that. Like, why did Hadoop fail and why did cloud data warehouse succeed? Broadly, it made it easy. It made it easy to set up, easy to actually run, administer, maintain. And so that's 1 core element, which is how do you make it really, really easy to collect metadata. And so in some ways, I think the solution is actually a lot closer to segment for metadata than it is really an architectural standard or things like that, you know, like push versus pull. We can keep having all those debates, but the real challenge is actually customers don't wanna spend all their time setting up pipelines and things like that to bring that metadata into 1
[00:17:19] Unknown:
place. They just want to click and get it there. That's
[00:17:20] Unknown:
1 thing that I think we can really learn from the modern data stack, which is how do you make it really easy, decentralized metadata. And I think we have made much more progress as an ecosystem today than we were even, like, a couple of years ago. And I think the second element is activation. I think the modern data stack itself is figuring this out today even in the data layer. Right? But that's where, for example, you are seeing companies in the data activation space or the reverse CTL space that are coming up, which are starting to talk about why it's important to not just have your data warehouse and actually your BI tools, but how do you take that back into into the day to day workflows of teams? I think the same paradigm actually applies at the metadata layer as well, and that's what metadata activation can really do. So I think there's a lot that we can learn from the concepts of data apps that Snowflake, for example, is championing
[00:18:05] Unknown:
concepts of reverse CPL that are starting to become more mainstream in the data stack layer and start, you know, applying them into the metadata layer as well. And so in terms of the work that you're doing at Atlan today, what are some of the steps that you've taken to be able to promote this idea of active metadata and make it available to your customers?
[00:18:24] Unknown:
Yeah. So absolutely. A couple of things that we've focused on. First is that, Adelen, we've invested significantly into our fundamental platform capabilities, which we think of as the metadata orchestration layer. So if you think about any of these, right, these are basically metadata operations. It's metadata in. It's running operation on metadata, and it's metadata out. And if you think about it the same way that orchestration applies in your airflow, kind of ecosystem or paradigm in the data layer, Metadata needs the same kind of paradigm. So 1 of the biggest investments we've made is into a behind the scenes, a metadata orchestration layer that's actually powered by Kubernetes and Argo, which creates infinite scalability, but allows customers to actually build these active metadata workflows on AppLint itself. And that basically makes it really easy to run these workflows in some ways. Because what we have realized is that if you think about even, like, when do you notify your downstream consumer on something, right, there's actually flexibility. Like, almost think what Zap or Zapier did in the SaaS world where there were lots of different tools and, you know, Zap just made it really easy for you to create all these, like, workflows. And then there's some, like, pre build Zaps that you can just pick up and use. Right? So that's the approach that we've taken where we've realized that 1 size fits all approach is not going to be the way to solve the problem in its entirety.
So how do you create, like, a platform like approach? 2nd, how do you make the platform really easy? Again, their metadata workflows are just like data workflows. So the biggest challenge with data pipelines is not setting up the data pipeline. It's maintaining the data pipeline. That's for data engineers spend their time on. It's the same thing even with metadata. Right? So invested a lot in things like making it DIY to set up, you know, invested a lot in, you know, logs, monitoring, a UI that actually allows you to, like, debug, you know, automated alerting if something fails. So those kinds of flavors, which, you know, are not, like, fancy if you look at from the outside, but really make a huge difference in terms of, you know, being able to maintain these these systems especially at scale. So that's been the 2nd biggest investment.
And the 3rd has been this concept of being open by default. So 1 of our co principles at Adlam has been you have like, we're open by default from day 0. This means for metadata in as well as metadata out. So for example, we actually expose our metadata via AWS Lambda kind of function. So we've seen customers basically triggering AWS Lambda functions to do a ton of super interesting use cases that fully opens up the possibilities, right, of what you can do. And in fact, we actually are learning this even from customers, and these are going back into being packages in our marketplace, hopefully, for other customers to be able to use and and the broader community to be able to leverage. But, again, that being fully open and allowing interoperability across the entire ecosystem and stack is a third approach that we've taken here. In terms of being able to actually support these different workflows
[00:21:21] Unknown:
and provide some of the monitoring and visibility into how they're working and some of the cases where they might fail, what are some of the additional architectural components and infrastructure capacity that you've had to add in to be able to provide this
[00:21:42] Unknown:
So So that's why as metadata starts approaching data, then things like scalability and compute and those kinds of things start becoming really important. Right? And so that's where I think for us, the biggest advantage, like, we were born into the cloud. You know, we're not trying to, like, take something that was built in a different era and make it work in the cloud. Right? So we're, you know, able to truly leverage the fundamental advantage of cloud, which is elasticity of compute. Right? So, fundamentally, we are built on Kubernetes, which allows for auto scaling.
That's been 1 of the biggest sort of advantages. So customers are allowed like, are able to leverage that fundamentally in their in their metadata operations. And second, like, given the orchestration there, for example, is built on Argo, which leverages Kubernetes. So, again, that ability to be able to leverage that fundamental, you know, compute elasticity that that cloud gives you through being built on something like Kubernetes has been the biggest, I think, architectural capability that we've had to invest in. And then the second thing that we've invested on top of that is ease. Because 1 of the things we have also realized is that data engineers I mean, I remember for me, it was the hardest resource, right, like, in the org. Because you always overwhelmed. You have so much to do compared to you know, I've never seen an organization where data engineers there are more data engineers than there's work to do. Right? Like, there's that's that just never happens.
And so the second element is how do you make lives for the data engineers really, really easy. Data I believe that engineering time should be spent on building new things rather than building things that already exist or can be automated. Right? So that's why we've invested a lot in, you know, DIY setup, DIY monitoring, debugging, just those core workflows. Even testing and authentication, for example, we've seen it's a you know, when you're setting up a connection, testing and authentication takes a lot of debugging sometimes because as permissions, which IT team holds and then there's you know, there's these small things in the workflow that can actually truly make it magical, but also save tons of time for data engineers. So we're hoping that data engineering time doesn't go into the setup of metadata, which is what in fact, today, in generally in the ecosystem, that's what you spend most of your time on. It's on bringing metadata together and putting it in this central space. We're hoping that, actually, that time goes away from that and goes towards activation of metadata to actually start driving end value and end use cases. So that's where we would ideally like data engineers that are using Appland to be spending their time on.
[00:24:07] Unknown:
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state of the art reverse ETL ETL pipelines enable you to send enriched data to any cloud tool. Sign up for free or just get the free t shirt for being a listener of the data engineering podcast atdataengineeringpodcast.com/rudder. And so the other challenge of being able to actually power this active metadata use case is how to actually build those integrations into the other systems that want to take advantage of the metadata. So 1 of the examples being I'm in my BI dashboard. I just want to know, you know, did this data that I'm looking at in this report pass the data quality checks? Or, you know, when was the last time that it was updated? I just want a little widget that says last pipeline ran 3 minutes ago rather than having to say, oh, I know that this is linked into this other data catalog, so I'm gonna go there and find out, okay, this is when this ran. Oh, there was a failure, so I'm gonna go into my other tool to see what the failure was. Being able to actually propagate all of that useful information into those other systems and just managing that end times integration challenge of saying, I need to have integration into all these different systems from all these downstream systems and being able to power that without spending all of your time just trying to keep up with the API surface area of the different systems that you're working with?
[00:25:44] Unknown:
That's exactly the challenge. Right? Because, like, there's a significant diversity of tooling in the stack. And sometimes there are internal tools that teams are using and internal data quality frameworks and there are things like that. Right? And so our biggest focus on that has been there are a few fundamental principles that we've built. The first is this concept that we almost think of as a widget like experience for custom metadata for bringing metadata in into your ecosystem. So let's say, for example, we see a lot of customers wanting to bring in their metadata from DBT directly and natively into into Azure. So for example, similarly, we have customers that bring in from their air flow, their pipeline context, and bring that in. Now the challenge that most traditional sort of, like, data catalogs do, and this is where, like, I think active metadata sort of changes the paradigm, is that you treat all of this metadata broadly in a similar way. Right? So you take all this metadata, DBT metadata, airflow metadata, you know, business metadata, all of that, and you go put it in, like, the same sort of, like, view for all users.
And that makes it very overwhelming for end users. Right? Like, all users don't care about all that metadata. As a data engineer, I care about pipeline metadata, but my business user doesn't care at all about pipeline metadata. And so what we've done is we've created a concept that we call personas and purposes, which actually creates a personalized view. So think what Vix did in the website builder is where you could create these modular widgets and then apply that to a Netflix paradigm. Right? Where what you could do is then you could basically say who sees what digit in a personalized way. I mean, I guess websites do that now as well. Right? Like, I mean, if you're sophisticated websites, depending on, like, your cookies, you can show, like, sort of, like, personalized views. That's been the foundation that we've done, which is make it fully modular to bring in custom metadata across ecosystems.
Take a VIX like approach to do this, which means it's almost like a DIY. You type in a Google form. You create your custom metadata, you connect it to that tool. That's to bring in metadata into the ecosystem, and then you can personalize this and serve this up in in a personalized way. The final layer is taking this back into the tools that you're using every day. Now some of the generalized approaches, these things come out of the box fully in our case, right, where, you know, we've said, for example, Chrome integration. So that if you're accessing any BI tool, irrespective of the BI tool or what you're accessing, You know, if you're accessing on Chrome, you have it access like, you have that active metadata available directly in your browser.
Communication tools is a clear 1. Right? Like Slack. Like, again, just super deep integrations into Slack, Jira, tools like that that are very commonplace across the data team ecosystem. And the last thing has been beyond that, there will always be diversity. Right? So out of the box, every product will have its own road map and only be a certain number like, you know, we can try and make it as DIY as possible, but we will have a certain set of integrations that we'll prioritize to do that. Right? So the final layer has been going back to making it super open for end users. That's where natively, you know, being able to support AWS Lambda. So people can write their own Lambda function to drive these active metadata use cases. So that's been the thing. From a principle perspective, the way we think about it is fundamentally open by default for metadata in and metadata out.
And then on top of that, make it as DIY as possible so that you have to write the least amount of code. And you're only writing code for the truly custom use cases rather than the, you know, the generic stuff that can be fully automated and DIY.
[00:29:14] Unknown:
The other interesting question about making metadata available as an active resource is the question of how well the rest of the ecosystem is But in order for that to actually be useful, my business intelligence dashboard needs to understand how to actually process that metadata or how to issue a request to be able to pull that in, render it, and display it in a way that's useful to the end user. And I'm wondering what you have seen as the availability of being able to use this contextual information in these different interfaces that people are going to to be able to interact with the data that they are either producing or consuming?
[00:29:57] Unknown:
That's a great question. And I think a couple of I think 1 of the things we're definitely seeing, some of this has to be market driven. Right? Like, that's 1 of the things that I think I've realized. Like, you know, philosophically, I wish, like, something we believe in a lot is, like, open platforms, interoperability. But the reality is that there are vendors that have a vested interest in locking up their ecosystems. That's just the reality of the ecosystem. Right? So I think then the solution to this has to be market driven, which means that customer should be choosing tools on the basis of the value that things like this can start providing. Right? And that's in fact, we've seen that as being the best way of pushing the market forward, which is just make customers start to demand it. We're already starting to see this with the modern data stack quite a bit. You know, things like open APIs, interoperability is as is becoming a de facto standard, and we're definitely seeing customers expect this out of the tools. And I think this is a great opportunity for the newer age modern tools to really sort of move 1 step ahead of sort of legacy competitors in some ways. Right? So we are definitely seeing a lot more of that already possible natively via APIs that are available with most of the newer age tools in the modern data stack. I think the second layer to that is we think of this a lot is, like, what are the interfaces that you can use that are tool agnostic if possible. Right? That's where, for example, browser, for example, is something that is used across. So can you supersede the tool if you have browser? And so can there be a browser extension that supersedes supersedes any of the tools? That's where I think communication tools is another element. Right? Like, email, Slack, Teams, g like, there are communication tools that can supersede the individual tool in some way. So I think thinking about it, I I think again as a platform approach, and those are actually what we go after as native integrations to begin with. Right? Because that allows us to serve a larger number of use cases with a lower number of integrations. Right? That's sort of been the second approach that we've taken there. The other interesting element is the type of metadata that you're working with
[00:32:07] Unknown:
and some of the elements that might be present in there and how to address the challenge of catering to the lowest common denominator to make it more widely available versus being able to provide richer information to make it more useful in the context where you're able to take advantage of that and just some of the ways that you think about the types of interfaces and the types of information that are necessary based on how you're using the metadata, what metadata you're using, and where it's being accessed?
[00:32:38] Unknown:
We actually took a couple of approaches here. So in fact, in 1 of our earlier versions of the product, we tried to create a standard for metadata. So let's say, for example, in BI like, if you think about the BI ecosystem, we try to come up with our own framework of sorts of the hierarchy of metadata in BI. Right? So there's a dashboard, there's a project, you know, there's an individual widget, and we came up with that hierarchy. And we wanted to try and see if we can map all the different BI tools to that common metadata hierarchy. We failed.
And we failed because every tool is so unique. So let's pick Looka. Right? Like, Looka has a concept called models and Luca has a concept called projects, and Luca has a concept called boards. So if you really want to solve for a Luca customer to help them get the most value out of their metadata, If you try and take that to a standard or generic standard, a lot of the nuance that choosing Looker and leveraging Looker disappears. On the other hand, Tableau has a concept called calculated fields. It's a very, very different concept. Right? And so, in fact, our approach has moved away from standardization to customizability.
So, for example, we actually go super deep now in all our native integrations, and we pull out as much metadata as possible that can help drive value for end users there. And what we have done from a user interface perspective is actually create customizability. So let's say, for example, this goes back to that same VIX widget personalization concept. Right? So if I am using Looker and DBT and Snowflake, my filters are gonna look different. And my filters are gonna be different and only going to show me the Looker filters and not the generic BI filters. Right? And I think that has been, I think, something that we've done which has helped us actually drive value because the value in metadata is actually in the nuance. It's not in the generic.
And so you actually should allow for going people and should allow for that extensively. That has actually been our approach, and we're finding much better success with that. As far as the
[00:34:40] Unknown:
capabilities that are offered by automating different activities or infrastructure operations or triggering different data pipelines based on metadata events, what are some of the capabilities that that unlocks for data teams and some of the ways that they want to think about when and where to automate it fully versus keeping a human in the loop and making that ultimate decision as to what action to take based on the metadata that's available?
[00:35:10] Unknown:
I think that depends a lot on the team itself. I think of it as a scale. Right? It goes fully human and manual to fully automated, and there's a sliding scale entire flow. So let's pick something as simple as, say, a metadata orchestration workflow to determine who owns a particular dataset. Now the fully automated approach to doing that is basically to leverage SQL queries and see who uses it the most or say who created that table, whichever, and then you automatically assign the the owner. But there are nuances in that approach in some ways. Right? So sometimes, you know, teams could say, hey. I actually want that as an input to me, but I want a human to actually finally verify. And then the final human verification is what gets added as the owners, which is a mix of a machine led but, you know, human verified approach. And it all could be a fully human approach. As humans, I'll go and add in, like, who the owners are of my data assets. Right? There is a sliding scale, and I think it depends a lot on the team and the accuracy at which the machine can actually help you get to your final output.
That nuance, I think, is what drives that. And I think there are cases where it makes sense to have human in the loop. There are probably some decisions like auto scaling your data warehouse. You don't want that to happen directly by a machine. Maybe you just want a final dashboard as an output, and then you want a data engineer to make that final determination versus sending a notification on Slack that there's been an upstream issue in your data pipeline, maybe that you want in a fully automated way. So our approach there has been to actually allow for that flexibility for teams to be able to make that determination on what makes the most sense for them based on the use cases that they're looking to do and allow for that machine and human to work well together to actually get to the final outcome.
[00:37:04] Unknown:
In terms of the ways that you've seen some of your customers and people in the community adopting and taking advantage of these active capabilities for their metadata and being able to actually power some of the downstream workflows or provide contextual intelligence to consumers of the data? What are some of the most interesting or innovative or unexpected ways that you've seen it applied?
[00:37:26] Unknown:
Yeah. So still learn every day with, like, the kinds of use cases we're seeing customer, and we're working closely with customers on even, like you know, because many of them are even pioneering, like, what active metadata looks like in the world. Like, there's no, like, playbook of these are the 15 ways you can use active metadata today. Some of the things we're seeing a lot, which have been interesting, 1 has definitely been around emails, notifications, Slack alerts, integrations into internal communication systems. We are definitely seeing how having that exactly the use case that you talked about. Right? Like, knowing if the pipeline was updated or not, that's, you know and how that's tagged to your your final dashboard and having that context. We're seeing 1 is that kind like, just notifying downstream consumers of failures that happen upstream, automatically posting an announcement on Slack, just like things that help the team work better together and build trust. We're definitely seeing those as use cases. We're also seeing another interesting 1 which is on actually helping enrich metadata itself. So let's say, for example, if someone asks a question on Slack, you have a bot, but then you auto push that back into your central metadata repository to enrich overall context in your ecosystem, sending out emails around these are the top, like, sort of data assets that haven't been enriched or have been enriched. So we we're seeing those as 1 set of use cases. The other have been hardcore sort of, like, data engineering type use cases. Right? So we're seeing integrations into CICD pipelines and how to incorporate, for example, data quality metadata into the way that your CICD operates directly.
Programmatic governance is something we see. 1 1 of our customers is, like, implementing a domain based hierarchy for the data mesh and implementing these automated programmatic governance, federated governance kind of ecosystems, which are leveraging communication tools, but also, you know, Jira and these communication tools that they're using on a daily basis to actually make programmatic governance happen inside their ecosystem. So those are some of I think they've been some of the most powerful use cases we see on a day to day basis. Some of the ones that I'm most excited about that are still, you know, earlier, just like I think there's a lot that we can do around cost optimization, especially in the current
[00:39:36] Unknown:
economic climate. We're seeing that becoming a little bit more important than it was last quarter. We're seeing, you know, there's a security element of the kinds of things that you could do. Right? You could define access policies at scale, you know, things like that that we could do. Yeah. The questions of governance and cost control are definitely very interesting ones. The cost 1, I actually just did an interview with a company where their whole business model is now based around helping people optimize their spend on platforms like Snowflake because of the fact that it's so easy to accidentally end up with a bill that's, you know, tens or 100 of 1, 000 of dollars that, you know, just like with AWS where people who just start spinning things up and then get surprised when they're spending, you know, 100 of 1, 000 of dollars on infrastructure that they're not actually getting value from similar experiences and things like Snowflake or BigQuery.
And then to your point of data governance, that's another conversation that has been coming up a lot is how do we actually take advantage of adding some of these tags in your metadata for the different datasets that you have and being able to propagate that through to some of the downstream systems and consuming systems to be able to understand what are the policies that I need to apply and enforce on the data as it exists at these different stages of its life cycle?
[00:40:48] Unknown:
Yeah. Absolutely. The cost 1, the security 1, I think there's a lot that, you know, using metadata the right way and activating it can help drive a ton of optimization there to actually make those those use cases work without a huge workforce of people that are doing manual work to make it happen.
[00:41:06] Unknown:
Definitely. And in terms of your experience of working in the space of metadata management and starting to figure out how to make this active metadata approach available to your customers and approachable and understandable and thinking through some of the potential use cases or supporting them as they're starting to do their own explorations? What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:41:31] Unknown:
So I think something that's been very interesting is just making sure that you're thinking problem first rather than cool use case first. So for example, I personally am super excited about some of the, like, dynamic cost optimization kind of use cases. Engineers. Right? Like, you know, those are like the you know, those feel like the really cool futuristic use cases in some ways of how active metadata can be used. But what's been really surprising is that actually some of the use cases that have driven a ton of value have actually seemed like, they're actually a little simpler. Right? Like, they're about notification systems and they're about, you know, making sure that, like, end users have, you know, getting the right email at the right time or, you know, those kind of use cases we've seen being more valuable in terms of actually driving impact and value from metadata. So I think the 1 major important factor for us or something that we've learned a lot, has just been stay very problem focused.
Metadata is at a point where I believe, you know, big data was about, like, whatever, like, 7 or 8 years ago. Right? Where there was all this hype about big data, but people hadn't figured out how to start driving value from it. And the thing that the modern data stack did was it made some of the architectural stuff easy so that people could actually start driving value from data. Now I think metadata is at a similar paradigm where I think it's never been hotter than it is today. It's much more obvious. I mean, I've been in this space for, like, a decade now. And, you know, like, 3 or 4 years ago, it was you know, we needed to convince people as to why having, like, a central metadata space was important. Like, that's not important. Like, we don't need to do that anymore. Right? People get that metadata can help drive and solve a lot of problems in your data stack. Now I think the conversation needs to move to value and how do you start driving value. And I think there, making sure that you're starting at the right problems and the right scoped problems that can actually drive value for end users and then go from there. I think that has been the biggest lesson while working with end users.
[00:43:30] Unknown:
As you have been working with customers and they start to think through the ways that they want to apply an active approach to their metadata, what are the cases where that might be the wrong choice and they're actually better served just sticking with the kind of centralized approach to how their metadata is stored and focusing their efforts on other layers of their data experience and their data platform?
[00:43:52] Unknown:
I don't think the centralized approach of metadata is ever right. But I think there's a time at which investing in metadata becomes important. So for example, if you are just setting up your data warehouse or your BI dashboards, you know, your end teams don't even value the BI dashboards that they're that you're using. Like, oh, your organization doesn't believe in data. Right? Or you're a very small team. Like, you're a 2 member team. Like, I think those are times where maybe some of the more fundamental layers, like, I would think of as the decision layer, you know, setting up your data warehouse. Those are more important to do. At the time that I think you start thinking about metadata and we see 2 approaches here. So 1, we see approaches where people who have gotten into the mess, both the larger company gotten into the mess of metadata and realize that, you know what? Like, when I set up my data stack, I wanna set it up the right way, which I think is actually a really good approach where, you know, your metadata practices or your data product shipping principles get incorporated even to the way that you're building your data warehouse.
I think at that point, it's important to just set up your metadata platform the right way. Right? And so maybe you're not focusing on all the kind of end use cases that active metadata can drive. You're focusing on a lot more of the metadata in and the intelligence that the discovery, the collaboration there is that metadata can create for you. I think at the time that your team starts maturing a little bit, you know, when chaos starts becoming a real problem, a great internal metric we've seen people see is actually monitor the number of questions that come on the Slack channels and the number of data team requested, the service requests that are coming to the team. And the minute that number starts becoming something that your team is spending more time than they should be, that's when I think you really start investing in that metadata use cases in some ways. But I think always you set up the platform in a way that it's it's active metadata and not a siloed central metadata approach because that's how you drive final end value from your metadata.
[00:45:48] Unknown:
As you continue to work in this space and try to help your customers and the overall ecosystem move more into this approach of active metadata and being more, I guess, integrative in the ways that you think about the information that flows through these systems, both the actual core datasets and the information about them. What are some of the near to medium term plans that you have as far as Atlan's capabilities and just your overall thoughts and some of the lessons that you're hoping to promote?
[00:46:18] Unknown:
I think a couple. 1 is, you know, product focused and the other is actually culture focused. From a product perspective, we're looking to, you know, soon we'll actually be opening up some of these, like, just the behind the scenes packages and marketplaces to actually allow for customers to start using packages that exist, that other customers have used and and picked up. So a large focus will actually be on going back to that DIY approach. Like, how do you make a journey easy? So how do we help, like, the community leverage what the rest of the community is doing in some ways? So that'll be a huge sort of focus as we scale. The second is actually a lot more around what we think of as, culture enablement inside organization. So at the time that people adopt metadata approaches or tools like Adeleen, they actually are going through a little bit of a culture change where they're going from almost like a data service oriented mindset where they're just responding to requests that the rest of the the team is sending them to a more proactive approach, like, from the active to proactive.
And that takes a little bit of a mindset shift in the team, but also the rest of the organization. Right? So how do you start thinking about your internal sort of outputs as data products that are reusable and reproducible for the rest of the organization? How do you start ensuring that your end users are not messaging you on Slack but actually trying to self serve? There's a human change behavior. Right? If I can call somebody, I will call somebody. So it's a little bit of that human behavior change that you have to sort of be able to drive there. And so we think of this as data ops enablement inside an org, which is how do you really help data teams adopt the practices of data ops in their organizations, which we think of as culture enablement.
And that's something that we've consciously actually chosen to focus on despite being a sort of software platform in some ways. Right? Because what we've realized is that it plays a huge difference to the success that data teams can do. If you truly want to help data teams collaborate effectively and, you know, helping data leaders actually go from not just the tool, but also to, you know, help them enable best culture practices inside their org has been a huge focus for us.
[00:48:26] Unknown:
Are there any other aspects of this question of active metadata and the work that you're doing at Atlant and in the community to help promote it that we didn't discuss yet that you'd like cover before we close out the show? I mean, I think just that there's so many use cases of metadata. Right? I mean, I feel like we just scratched the support
[00:48:44] Unknown:
metadata can do in the ecosystem today. We use it for some use cases like data catalogs and discovery and governance, but there are so many use cases that metadata can help power and maybe help us truly get to that intelligent data management platform dream In a way, that is actually unbundled and this unbundled data stack that we operate in, but still help that unbundled data stack work together in a way that is truly intelligent. Right? And I think metadata really holds the key to that. So I'm just super excited about what the future holds.
[00:49:15] Unknown:
I definitely agree that there is a challenge in we have all of the raw processing power to do everything that we need to do with data and that most of the challenge now is in figuring out what are the actual full end to end integrations and the ways to add polish and ease of use to those systems beyond just having to understand at a deep level all of the mechanics that go into it.
[00:49:37] Unknown:
Yeah.
[00:49:38] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:49:53] Unknown:
I actually have a contrarian view here. I don't think the biggest gap is the tooling or technology. I actually think it's the culture stack. We've made a lot of progress from a tooling and technology perspective in the last 4 or 5 years. I'd written a blog post about this, which was, I hope it's time that we move to the modern data culture stack and not stop talking about the data stack. Right? Hopefully, the data stack becomes, like, click and we're live. But I think there's so many great tools available today. And there are things, obviously, there are things we can do better and we can keep improving and things like that. And, you know, I'm obviously biased to think that active metadata is the next big thing. But, like, the reality, I think, for data leaders is I think we're still trying to figure out what it takes to make an amazing data team function well together, and we're trying to figure out how our data team functions within the rest of the organization.
And I think of this a lot as the culture stack. How do you build a great culture inside your teams? How do you think about how different roles communicate with each other? How do we think about just growth parts for each of these different roles? How do we think about how the data team communicates with the rest of the organization? And, you know, I personally believe we actually need, you know, roles like data ops enablement, similar to what sales ops and sales enablement does in sales teams. We need those kinds of roles in our data teams to truly help us drive value from the technology and help people work together with the technology to help finally help us become more data driven. Right? That's the objective.
[00:51:23] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share your thoughts on this question of active metadata and the work that you and your team at Atlan are doing to make it more of a reality and more make it more accessible to your customers and the community at large. So appreciate all of the time and energy that you putting into that, and I hope you enjoy the rest of your day. Thank you so much, Tobias. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Survey Insights
Guest Introduction: Prakalpa Sankar
The Power and Challenges of Data
Traditional vs. Active Metadata
Capabilities and Use Cases of Active Metadata
Atlan's Approach to Active Metadata
Integration Challenges and Solutions
Types of Metadata and Customizability
Automation and Human Interaction
Customer Use Cases and Innovations
When to Invest in Metadata
Future Plans and Cultural Shifts
Closing Thoughts on Active Metadata