Summary
The binding element of all data work is the metadata graph that is generated by all of the workflows that produce the assets used by teams across the organization. The DataHub project was created as a way to bring order to the scale of LinkedIn’s data needs. It was also designed to be able to work for small scale systems that are just starting to develop in complexity. In order to support the project and make it even easier to use for organizations of every size Shirshanka Das and Swaroop Jagadish founded Acryl Data. In this episode they discuss the recent work that has been done by the community, how their work is building on top of that foundation, and how you can get started with DataHub for your own work to manage data discovery today. They also share their ambitions for the near future of adding data observability and data quality management features.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
- Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
- Your host is Tobias Macey and today I’m interviewing Shirshanka Das and Swaroop Jagadish about Acryl Data, the company driving the open source metadata project DataHub for powering data discovery, data observability and federated data governance.
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Acryl Data is and the story behind it?
- How has your experience of building and running DataHub at LinkedIn informed your product direction for Acryl?
- What are some lessons that your co-founder Swaroop has contributed from his experience at AirBnB?
- The data catalog/discovery/quality market has become very active over the past year. What is your perspective on the market, and what are the gaps that are not yet being addressed?
- How does the focus of Acryl compare to what the team at Metaphor are building?
- How has the DataHub project changed in the past year with more companies outside of LinkedIn using and contributing to it?
- What are your plans for Data Observability?
- Can you describe the system architecture that you have built at Acryl?
- What are the convenience features that you are building to augment the capabilities and integration process for DataHub?
- What are some typical workflows that data teams build out when working with Acryl?
- What are some examples of automated actions that can be triggered from metadata changes?
- What are the available events that can be used to trigger actions?
- What are some of the challenges that teams are facing when integrating metadata management and analysis into their data workflows?
- What are your thoughts on the potential for the Open Lineage and Open metadata projects?
- How is the governance of DataHub being managed?
- What are the most interesting, innovative, or unexpected ways that you have seen Acryl/DataHub used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Acryl/DataHub?
- When is Acryl the wrong choice?
- What do you have planned for the future of Acryl?
Contact Info
- Shirshanka
- @shirshanka on Twitter
- shirshanka on GitHub
- Swaroop
- @arudis on Twitter
- swaroopjagadish on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Acryl Data
- DataHub
- Hudi
- Iceberg
- Delta Lake
- Apache Gobblin
- Airflow
- Superset
- Collibra
- Alation
- Strata Conference Presentation
- Acryl/DataHub Ingestion Framework
- Joe Hellerstein
- Trifacta
- DataHub Roadmap
- Data Mesh
- OpenLineage
- OpenMetadata
- Egeria Open Metadata
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hi touch is the easiest way to sync data into the platforms that your business teams rely on. The data you're looking for is already in your data warehouse and BI tools. Connect your warehouse to Hitouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for reverse ETL today. Get started for free at data engineering podcast.com/hitouch.
When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Your host is Tobias Macy. And today, I'm interviewing Shrashanka Das and Swaroop Jagadish about Acryl Data, the company driving the open source metadata project Data Hub for powering data discovery, data observability, and federated data governance. So, Shrishanka, can you start by introducing yourself? Hi, Tobias. I'm Shrishanka,
[00:01:47] Unknown:
CEO of Acrual Data. And prior to this, I spent a decade at LinkedIn as tech lead of the big data team founded Data Hub as part of that journey. Big fan of the great work that you're doing here with this podcast and really happy to be here. Hey, Tobias. Great to be here talking to you.
[00:02:04] Unknown:
I am the CTO of Actral Data. Prior to this, I was at Airbnb leading the data platform and the search infrastructure teams.
[00:02:12] Unknown:
Really awesome to be here and talking to you. Likewise. And going back to you, Shrashanka, do you remember how you first got involved in data management? Maybe since the first moment where I tried to organize my movie library. I think, you know, this data management phrase itself is kind of overloaded. Right? The definition of what does data management mean can depend on who you ask. If you talk to data scientists and analysts, they would say, you know, data management should be about making data easy to access. And if you talk to a data infrastructure person, you know, we've been for a long time. It's all about indexing and storing data in the right way and making it performant. And is parquet the right format or is ORC the right format or is there a new columnar storage format that we should be doing? And we debate about Hudi versus Iceberg versus Delta. Right? But then if you talk to the governance team, they will say, well, what we care about is making sure that data is being handled correctly, that we are not retaining data for longer than it's necessary, and that management, I think So data management, I think, is all of this stuff. That's what we are finding out. Like I said, when I was tech lead of the big data team at LinkedIn, I was wearing multiple hats. My team was responsible for managing 100 petabytes plus of data at rest, including ingestion from our upstream fire hose of Kafka as well as our online databases.
And I built Goblin as part of that to basically ingest all of this data from these different sources, land it into our data lake, and then support multi cloud replication, multi cluster application from these on prem clusters to Azure and back. And then later, as I was tapped to be the tech lead for GDPR at LinkedIn, I ended up having to rethink metadata management and data management as part of all of that. I ended up building Data Hub and drove Goblin to manage data through metadata that was stored in Data Hub. So whether that's data deletion or data export or data anonymization, all of these things have been at the center of everything that I've had to solve in my past job.
Data democracy was a big part of it, of course. You know, when I joined the big data team and I had just come off of a stint building an online data store for LinkedIn, Swaroop and I actually built an online database. The first thing I noticed was just people were just asking around for datasets and not really knowing where things were, and data democracy was kind of the first rallying cry we had. And the first version of Data Hub was built literally to solve that problem, making it easy for people to get access to data. And then the next evolutions happened to enable governance and end to end observability.
[00:04:54] Unknown:
And, Swaroop, do you remember how you got involved in data management? Yeah. Very similar journey for me at Airbnb. You know, when I got to Airbnb, the data team had to be sort of rebuilt. No. I had to hire the best engineers and grow the team from 5 to 60 people and lay the ground for data democracy. Right? Like like Shashanka, that was the first problem everyone faced, just being able to find data efficiently. So we started with that problem during this time. We built airflow superset Airbnb data portal, which, of course, is widely cited in the area of data discovery.
I was the lead for the team when all these products were built out, and I saw the growth of usage for all these products. Right? So living through that hypergrowth phase and seeing data usage take off at Airbnb at the most critical time of the company was a huge thing. But then the pandemic happened later on and, you know, Airbnb's traffic nosedived and we had a different problem on our hands. We had to very quickly get control over our cloud bills. I was the overall cloud cost efficiency lead for Airbnb, and it was a challenge to understand what workloads are critical, what workloads can be turned off, the just the dependencies, and being able to get a clear view of all this through the right metadata substrate is actually a huge challenge. So I lived through that journey actually building the right metadata substrate and getting our cloud cost in control.
Yeah. And also related to that is when the traffic did come back and we went public, I saw the challenges related to data quality and compliance firsthand when we were getting ready to report our metrics
[00:06:41] Unknown:
to the public for the first time. I think nothing quite, you know, prepares you for building a data management startup, like, actually being on the front lines and being on call for mission critical data infrastructure. I was on call for LinkedIn's change data capture system, data bus, and then later on the source of true database, Espresso.
[00:07:01] Unknown:
And Swaroop has had a similar on call journey in his life. Yeah. I mean, same deal for me. I was on call for Yahoo's multibillion dollar search advertising pipelines. You know, people they'll call it data observability back then, but we face the same challenges about, you know, data quality and how to deal with source of truth databases. I was on call for LinkedIn source of truth database and Airbnb's data platform. Countless holidays and weekends were spent debugging mysterious data issues. So we both really understand the mission critical nature of data, and we have a lot of empathy for people on the front lines because we have lived through that life ourselves.
[00:07:44] Unknown:
Yeah. Somebody who has spent more time than I care to think about on call, it's definitely interesting the types of things that will go wrong when you're not expecting it. Absolutely. And so that brings us to what you're doing now with Acryl Data where you've both been involved in the Data Hub project, and you're building the business around it. And I'm wondering if you can just discuss a bit about what it is that you have created with Acryl Data and some of the story behind how you decided that this was where you wanted to spend your time and energy.
[00:08:12] Unknown:
Yeah. I mean, Acryl Data is the company driving the open source data hub project forward. We're doing it in collaboration with LinkedIn, obviously. LinkedIn is an investor in us. And in many ways, we are following, you know, the Kafka confluence, big path. Accruent Data offers Data Hub as a SaaS product, which is currently in private beta. We believe that data driven organizations need a reimagined developer friendly data catalog because the diversity and scale of the modern data stack just demands it. As you probably well know, the main incumbents in this space are, you know, Colibra and Elation, which in our opinion are, you know, struggling to keep up with the fact that data is constantly changing and the tools in the stack are constantly changing.
And we really believe that metadata is not just about serving humans, but also about systems. In terms of, like, the main use cases we see, it's data discovery, data observability, and data governance. But our approach, I would say, is fundamentally different in terms of both the project as well as how we're going about it as a company. We are a stream first metadata platform. We believe data ops, developer led way of doing federated governance. And we're building a highly scalable and sophisticated platform for tackling the challenges of data quality, which as Surb said, you know, it's it's being branded as data observability these days, but it's it's really the quality of data that's the underlying problem. And a lot of people ask us, what's the story behind the name? Why did you call it Acryl data? And do we call it Acryl or Acryl? And, you know, it's actually a interesting story. We were this was end of 2020. We were discussing what we wanted to really believe in, what were the things we were finding that are missing in data driven enterprises.
And we really wanted to convey the notion of safety and clarity. Remember, this was 2020 and, Lexi Glass was the material of the year. Right? This is the same material that was allowing our medical professionals to see clearly, treat victims, treat patients, and while staying protected from the virus. And so while it's not well known, the material used to make plexiglass is actually acrylic. The same thing that normally you associate with color and a lot of, you know, vibrancy. And Acryl happens to be the family of chemical compounds that include acrylic. And so that's how we came up with our name. And our vision is really to bring
[00:10:31] Unknown:
clarity to data and to do it in the most colorful way possible. As they say with computer science, there are 2 hard problems, naming things, cache invalidations, and off by 1 errors. And naming is definitely the 1 that always kind of rears its head most frequently. So it's always interesting to hear some of the stories behind how people come up with a particular name, particularly for something as important as their business, and I think that's definitely a very sort of poignant and well thought through naming. And so in terms of the experience that you've had of building and running Data Hub, I'm wondering how that has informed the product's direction and the particular areas of focus that you are bringing to Acryl, particularly given the early stage of the company.
[00:11:12] Unknown:
Yeah. As I mentioned earlier, I was responsible for both the data democracy hat, you know, making data extremely easy to access and use, as well as the data privacy and data governance hat. So making sure that data is not used in the wrong way and doesn't fall into the wrong hands, when I was at LinkedIn. And this influenced how I shaped the evolution of Data Hub at LinkedIn. It started out as a search and discovery product and then evolved towards adding on governance and compliance capabilities. And I actually talked about it at length at a presentation I did at in Strata in 2019. It's a very interesting paradox. On the 1 hand, you have to face the main data easy to use. On the other hand, you have to do the do not allow access to data. Can you actually do it with 1 system? And my realization really was that a system and really, I think of it like a platform like Data Hub needs to be built to address both sides of this coin to enable companies to establish excellent data practices.
But what I noticed was that building a system for 1 single company and even open sourcing it, you know, it's just 1 thing. It's 1 step in a journey. And making it actually work at scale to address the wide variety of integrations, the wide variety of deployment scenarios, the different ways in which people are doing data in their own companies, it's a whole different ballgame. In fact, I'd like to, you know, say that Data Hub actually has gone through quite a metamorphosis after I started Acro and took stock of what was needed to make this project actually achieve its potential. Since we started, we've actually rewritten major parts of the project. Like, we completely rewrote the UI in React, and we made it much more delightful to use and access.
We built a brand new ingestion framework that is super easy to use and extend. And we've gone through extreme lengths to make sure that the quick start experience is delightful. In fact, if you're up for it,
[00:13:10] Unknown:
let's take a risk and try to showcase this live. Would you be willing to go through that? Yeah. So, actually, while we were talking and getting up to this point, I just finished running through the quick start. And so I was just poking around in the UI and looking at some of the sample datasets and looking at the lineage view, and it's definitely
[00:13:26] Unknown:
very easy. So I ran it in maybe about 5 minutes, and it actually came up cleanly without breaking. So well done on that. That's amazing. And that's exactly the kind of feedback we hear from folks. You know, we have obviously also got a demo that's publicly accessible on, you know, demo dot data hub project dot io. And you probably navigated to the quick start also from that same home page. And, you know, 1 of the things we hear often is this common misconception that open source projects, well, they're not gonna be easy to install or operate, just not. And we've heard repeatedly from people in our community that Data Hub just works out of the box, and they're able to get to use cases and value so much faster than they could with any other technology out there. And so that's really been kind of the learnings from LinkedIn that we've kind of applied to the product direction at Accrual. But, honestly, a lot of it is working with the community and hearing from them what their pain points are and then solving for it. There's quite a bit of gold out there in our community, and we are able to work really effectively with them and make changes to the product. And that's, I would say, something that I'm really enjoying as part of this journey.
[00:14:38] Unknown:
And, Swaroop, in terms of your experience of running the data organization at Airbnb and working with Shashanka as you've been building up the company, I'm wondering what are some of the lessons that you've brought from Airbnb and, you know, bringing to the market some of the projects that people are familiar with, like Airflow and Superset that you have brought into the work that you're doing at Acryl and how you're focusing your efforts.
[00:15:01] Unknown:
Yeah. Airbnb, as you probably know, is famously a design led company. To this day, branches, he obsesses over every pixel on the site. And I've been in app reviews with him where there's just so much detail that he gets into it. I really, really admired that kind of attention to detail over the years and can now appreciate back in the day. May not be so much, but now appreciate why it's so important. It just needs to be ingrained as part of your DNA to, you know, to obsess over these details. So 1 of my insights while launching Airbnb data portal from 0 to 2, 000 plus weekly active users was this attention to detail to design actually makes a big difference in creating trust in the product.
And also trust in data itself, it's not just about how pretty the product looks. Making it super intuitive to use and being able to cut through the noise and answers key questions very easily is actually critical. Right? So, you know, similar to Shishank, of course, I also looked at all the same use cases, data discovery, you know, data quality, governance, and so on. But this design aspect is something that I'm really kind of bringing to the product at Acryl. The other thing is when I went on to lead search infrastructure for Airbnb, I led the creation of Airbnb's knowledge graph, which is also being called as travel graph. And I appreciated what it takes to go beyond collecting raw information to turn that raw information into higher level understanding.
For example, we collected all kinds of signals about destinations, and then we infer whether a given destination is a romantic, is a given neighborhood in a city, family friendly, and so on. Right? The same techniques can be applied to metadata graph. How do you go beyond collecting raw signals and get to higher level insights? Classifications of data. Is this a sales related dataset? Is this a marketing dataset? What is the data quality beyond just the raw signals? Right? What is the categorization you can assign to a given dataset based on the history? So these types of techniques go a long way in going beyond just a raw metadata graph.
And the other thing is, you know, while doing cloud cost efficiency, as I mentioned before, really driving home this metadata substrate piece. It's not enough to just build an app. You need to build a substrate which is able to bind together several tools and actually achieve those end to end use cases. And we'll dig into this a lot more, but that's another thing that I really kind of want to bring into the product at Acro.
[00:17:46] Unknown:
To your point, Shrashanka, about the data integration piece for Data Hub, I'm wondering what are some of the lessons that you've learned from your work on Goblin and the framework there to sort of simplify and streamline the overall process of being able to bring in the metadata, organize it cleanly, present it meaningfully.
[00:18:06] Unknown:
Yeah. I think, in fact, the Python based, ingestion framework that we designed for Data Hub recently was actually inspired by Goblin. We took a lot of the same ideas around, you know, sources and sinks and extractors and, you know, transformers and applied them to the to how we architected this specific ingestion framework. And it has actually led to a huge amount of increase in accessibility of integrating sources with Data Hub. Prior to that, a lot of people would say, oh, great. You have a push API, but how do I publish to it? Can you show me how to do it? Right? I've got a bunch of these systems. How do I connect all these systems into Data Hub? It's great that I can stream metadata into it, but how? Teach me. And so this Injection Framework really helps you get started and helps you give you guardrails around, well, let's define a data source. Let's define how you get metadata from that data source. Let's define how you partition up work for ingesting metadata from these data sources. And then if you want to fix up metadata as you're ingesting it kind of lightweight ETLs on the way in, how do you do it? And so those things have really let us scale out data hub adoption quite a bit in lots of different companies.
[00:19:15] Unknown:
Another interesting aspect of just the the space that you're working in right now of data cataloging, discovery, quality, observability, governance, you know, whatever label you want to apply to it. And, you know, different people with different areas of focus might apply different labels. But it's become a very active segment of the market, particularly over the past year or 2. And so you're definitely coming in at a point where everything is on the up swing. And I'm wondering, what are your perspectives on that overall market and the particular gaps that you're targeting with what you're working on at Acryl and some of these sort of opportunities for innovation or, you know, potential eventual consolidation?
[00:19:54] Unknown:
Yeah. I I actually think that's great news that companies are recognizing that the modern data stack really needs good metadata management. You know, Jo Hellerstein and I first started discussing this problem back in 2016, and that led to the ground research project and the paper that's widely cited in modern metadata management work. Joe, as you all know, is a professor of computer science at UC Berkeley, a legend in the face in the field of data management, and also founded the Trifacta company. And having been a proponent of, you know, next generation metadata needs for a while and creating mindshare, I'm really happy that the space is taking off in a big way. Because in the past, metadata was just a word that you tossed around, but you didn't quite know what to do with it.
And even now, I think there's still different segments of the market, and you're absolutely right. There are all these different labels that are being attached. And I think they're seeing this space very differently, and I think the speed space will mature into something that doesn't require the customer to buy lots of disconnected tools to handle discovery, quality, governance, compliance, and all the related use cases. But at the same time, we've seen that existing products like, you know, Colibra and Alation are also struggling a bit to keep up with the pace of innovation. We're seeing challenges with, on the other hand, with hyper specialization. A lot of our customers say that we've got a gazillion tools, and we're still trying hard to connect the dots between data discovery, data collaboration, and data quality tools.
A very simple example is ownership. Right? Every single tool has its own definition of ownership, and yet a company as a whole will struggle to answer the simplest of questions around, I've got a data asset right here in front of me, and I want to talk to someone who knows about this thing. Who should I talk to? Right? And so we think that there's a way to support specialization. Specialization is not evil, but hyper specialization can be. Right? So we think that there's a way to support specialization while still ensuring a carbon substrate. We are starting to call it the metadata fabric in our internal conversations. On top of which, multiple use cases can be built well. The other thing I've seen that's missing often is this focus on data producers.
And it starts with a lot of catalogs and data management systems have started with just embracing the mess and saying, well, it's all bad. So let's start with just enabling consumers. Let's start with enabling consumers to do whatever can do to make sense of it all. But I think the time has come for us to start shifting left, to integrating into workflows that are in the operational fabric so that metadata is not applied after the fact, but it's actually applied in context. When a developer is in their DBT model and they're making changes to it, well, add metadata to it right there. When you're about to check-in an Avro schema for a Kafka topic, well, let's provide additional metadata about that schema right there. So we really think about the metadata platform as something that allows metadata in all shapes and sizes to be modeled, produced, and consumed without loss of any consistency or fidelity.
And that there's a bunch of hard problems that are snuck into that simple statement. Right? You have to be stream first, both in and out because otherwise, you cannot be live. You cannot be operational. You're always behind. And that means having amazing APIs, REST, GraphQL, Stream First, but also supporting extensible modeling. Metadata is actually not a global thing. Every company has their own little concepts for what they consider an important entity in their metadata graph. And so being able to add your own concepts, being able to push metadata in, but also crawl it if you have to. Right? Supporting both kinds of integration. And then being able to deploy this at scale so that you're not limited by just 1 janky MySQL database that is sitting in the back. And if, you know, turn your fire hose of operational audit events into that thing, it's gonna croak and die. Right? You need to be able to throw all of the metadata that you have in your enterprise at the system, and it needs to be able to handle it and thrive. That we think are the hard problems that we are going after. And we think this is a missing piece in the modern data stack. People haven't really gone ahead and said the metadata fabric is the answer. Let's go build the best metadata fabric. And then use cases can be built on top of it. To the point of scalability,
[00:24:16] Unknown:
a lot of times too, when you say, I want this to be able to scale to, you know, enterprise grade, always on. It can do everything. That usually brings in the implication that but if I wanna use this for, you know, my 1 or 2 person side project that this is not viable because there are too many moving pieces, and I am not gonna be able to actually manage this system because it'll be too complex. And I'm wondering what you're doing to address that end of the market or if that's something that is just sort of a truism and it's not possible to do both extremes?
[00:24:48] Unknown:
So the enterprise readiness aspects, you know, coming up your forms. Right? I mean, obviously, the classic features like being able to audit exactly what happened, being able to export all the metadata that you put into the system, into your own warehouses, and so on. But just from a reliability perspective, the fact that we have this decoupled architecture, this federated model where your existing operational systems don't have to be tightly coupled to the way you deploy Data Hub makes a big difference in the reliability and scalability of the system. Right? So all you need to do is deploy these metadata emitters, agents close to the sources with the minimum security privileges that are necessary, and they all push to the Kafka bus, if you're, you know, actually deployed in different environments altogether, maybe across different cloud environments.
And we have a reliable way of ingesting all that metadata and getting into the central fabric without really compromising the availability or the resiliency characteristics of your dependent systems. So that, I think, is a fundamental strength of the platform itself, the fact that we enable this loosely coupled federated deployment models.
[00:26:04] Unknown:
At the same time, I think we are able to shrink it down to the fact that you could actually get it running on your laptop within 5 minutes or less. And that is actually production code. The the code that you are running is the production data hub. It's not like some different implementation of data hub that is just getting built specifically for quick start. It's just the single node standalone version of Data Hub, which just needs a MySQL and an Elasticsearch. In fact, we even dropped the dependency on Neo 4 j and use elastic search for a graph engine for those kind of deployments. So we are routinely seeing that data engineers, 1 person doing a hack day project, was able to get Data Hub up and show value by ingesting, let's say, their Redshift warehouse or their Snowflake warehouse or their BigQuery warehouse. And they're able to do lunch and learn sessions and showcase the value of it to their team before moving on to the next step. And we've done a ton of improvements Thanks to the new age of Kubernetes and Helm and all of these other systems that have been built on top. It's actually become quite simple to as long as you've approached the problem in the right way and the complexity is there as an enabler, not as a disabler,
[00:27:14] Unknown:
You can actually scale up and down Data Hub to your needs. And I think we're seeing that happen live, you know, quite a bit. Continuing on the aspect of the overall market ecosystem right now, an interesting aspect of the fact that you're focusing on Data Hub and building on top of that is that there is another company that came out of the Data Hub project and the team at LinkedIn in the form of Metaphor. And I'm wondering if you can just give a bit of, compare and contrast of your particular areas of focus at ACRL and how that compares to what the team at Metaphor are orienting around. So there isn't a lot of public information about what the Metaphor folks are actually building. There isn't a public road map. So we honestly don't know much
[00:27:54] Unknown:
what they're building, and we don't know what their focus is. What we do know is we haven't received any contributions from them to the data hub project in 2021. But, obviously, the folks were part of Shashanka's team at LinkedIn and the community is appreciative of all their past efforts. In terms of what we are doing, we are very much community first and open source first. You know, like Shashanka shared earlier, we've made step function improvements in the open source project itself, made it very easy to use, consume, made the product experience delightful, added a ton of platform capabilities.
And this has actually led to a huge explosion in terms of adoption and community growth. You know, we have kind of tripled the community size. We also publish our road map very openly. So there's a public road map out there in terms of what actual data is also building. We are very community first in terms of how we build things. Shashank, of course, runs the monthly town hall meetings with the community. He publishes a monthly newsletter about the updates, and we continue to be very aligned and collaborative with LinkedIn and the broader community to ensure that, you know, we maintain a very vibrant and inclusive community.
And that's really kind of the difference, I would say, in terms of how we are operating. And, of course, you know, when it comes to the SaaS product, we build things which are very complementary,
[00:29:15] Unknown:
what the open source product is, and we can get into that a bit later. But, essentially, we are very community first and open source first in terms of how we operate. And then as far as the data hub project itself, I'm wondering if you can talk through some of the notable changes or evolution that it's gone through over the past year or 2 since it was initially open sourced and announced from the LinkedIn organization.
[00:29:38] Unknown:
Since the beginning of this year when Acro started, we've kind of seen an inflection point in Data Hub, both in terms of the project velocity as well as the community velocity. We routinely get, you know, 100 plus, 1 30 plus, sometimes close to 1 50 commits per month. And this is, you know, a fast moving code base. We have a total of 100 plus total committers, external contributors. And even on a monthly scale, we get close to 20 external contributors from 10 plus companies that are routinely adding new features into the project, adding new capabilities. And pretty much every month for the last 6 months or so, we've had about 5 or 6 companies that are new. So it's not just the same folks adding stuff. It's actually a growing base of companies as well as contributors that are getting involved in the project. And that's, like, amazing news for a project that I think is still young and has a long way to go. When I started Acrobat, you know, our town halls, you know, we're, like, very sparsely attended, I would say, in the order of tens of attendees.
But right now in the latest town hall, we had a 100 attendees. So we routinely get, you know, 75 plus attendees in our monthly town halls. And so that's just the, you know, the symptoms. But a lot of hard work has obviously gone into it. Like I said, we built a significantly easy onboarding experience, new integrations, through this Python based ingestion framework. We now have more than 25 systems that we integrate with. And many of these popular systems like DBT, Looker, VDASH, they're all contributed by companies that are using these in production. So these are not, like, things that we just built in a cave, but things that people actually contributed because they needed those integrations and built them. We have companies like, you know, New York Times, BBC, Grab that are all contributing to all parts of Data Hub, whether it is designing in product surveys or improving our Helm charts or building Docker based deployment recipes and sharing it with the community.
There's, of course, companies like Expedia who were in from the early days, and they've gone all in on ML model management and building MLOps practices using Data Hub. And then there are companies doing data mesh, like Saxo Bank, Wold, and Klarna that are following GitOps and DataOps principles in practice at their companies. They're connecting their CICD systems with Data Hub. They're connecting the Data Hub APIs to their tools natively to do impacted action, etcetera. And what we've seen really in the last year is data mesh has erupted. You've probably I don't know how many times do you get that phrase in your podcast. Right? Not even totally sure at this point, but probably at least once a week, if not multiple times. Alright. So just add this to your, you know, counter. So data mesh obviously has erupted in the last year and as a mock and I are good friends. And along with that, Data Hub has emerged as the metadata platform of choice for a lot of data mesh implementations. And just the other day, I found out on LinkedIn that the Walter Group has been using Data Hub for data mesh. And I didn't even know about that. I just found out because someone tagged me on LinkedIn.
And we made these huge improvements to the platform capabilities, you know, being to extend the metadata model while retaining strong typing without writing any code. We're calling it no code metadata. We recently implemented support for time series data in the metadata platform, you know, to help observability. And then we've implemented, you know, data ops friendly approaches such as having your business glossary written up in YAML and then checking it in and having the to define and manage all your business taxonomies just like code and enabling data engineers to manage all of these things and version them. All of these things have really made a huge difference in how Data Hub is being used. And in terms of deployments, like we talked about, there are all these cloud native recipes that we've built, both through our Helm charts as well as how to guides. And we've made it incredibly easy to install and operate it both on AWS as well as GCP and in really an EKS friendly environment.
[00:33:33] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done. DataFold's proactive approach to data quality helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection. DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values.
DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows. Visitdataengineeringpodcast.com/datafold today to book a demo with DataFold. As far as the observability aspect, you mentioned that that's another area of focus for Acryl and the Data Hub project. And 1 of the pieces of insight that I've been kicking around as I talk to different companies who are in different aspects of metadata in 1 form or another, whether it's, you know, data quality or data governance or data discovery and, you know, observability tag that people are throwing around a lot is that there is a lot of opportunity for anybody in any 1 of those particular sort of subcategories to extend into the others. And so it's definitely interesting to see that as an explicit goal of what you're building at Acro and Data Hubs. I'm wondering if you can just speak to some of the opportunities for incorporating observability
[00:35:16] Unknown:
into the metadata layer. Sure. As I was saying earlier, both Shishankar and I have extensive experience dealing with operational fires. We've been on call for the longest time throughout our careers. So we have a lot of empathy for the frontline people. And the funny thing is we've been on call for production services, online services, and also the offline data ecosystem. And we've seen a difference there in terms of the level of rigor and the build test deploy kind of discipline. Right? And we're really happy to see that the best practices from the online services are kind of gradually bleeding over to the data world, which is much much needed. And we've been both doing it in our respective ecosystems over the years. It's not like this has happened all of a sudden. I think this is very much a evolution. So in terms of how we are approaching it, we are doing it in 3 parts.
1st, we are building a very scalable platform for efficiently handling time series data. Because guess what? All of the operability signals are at the end of the day time series data. Right? So you have to be able to ingest data profiles, usage statistics, and so on. So we've made substantial improvements to Data Hub to support time series operational metadata. Always had the ability to track versions, so we have the ability to track schema changes. This we store everything as a version metadata graph. And when you combine with lineage, you know, you now have all the signals that you need to investigate data incidents in an end to end manner. So to us, that's just stable stakes of just building a metadata graph. It's just another form of metadata.
We will continue to improve the coverage of operational signals, freshness, completeness, support for partition datasets, support for data lakes, ML ecosystem. There's no end to the operability signals. Right? We will continue to just make it more and more comprehensive. And lastly, we will implement a trigger framework, and this is coming in q 4 of this year, which allows evaluation of predicates based on this comprehensive corpus of operational signals and have the ability to evaluate complex predicates and being able to alert or even trigger downstream workflows. Right? This can be trigger a data quality check that you want to in response to a schema change event, for example. Or you may want to actually go and halt some downstream workflow that you wanted to run. So there lots of things that can be accomplished if you have a flexible trigger framework and a predicate evaluation mechanism and the ability to run your own lambdas in response. So that's how we are tackling it, building all the right primitives, and then assembling it altogether.
We believe that, you know, it needs to be the same metadata substrate that handles
[00:38:04] Unknown:
it. And as far as the specifics of what you're building at Acryl and the system architecture that you're building to be able to support the SaaS aspects of the business, I'm wondering if you can talk through some of the sort of design and architecture considerations that you're focusing on and some of the technology choices that you're bringing in to be able to support the business and operating data hub as a service for your customers?
[00:38:27] Unknown:
Right. So as we thought about the best way to bring Data Hub as a SaaS product to the market, we realized we have to do a few things really, really well and make it really simple. 1st, you know, for a hosted model, we need to be very aware of the security considerations and the privacy considerations. So we have a model where metadata emitters can run with the right isolation levels in the customer's account, and they can send it over a private link to the data hub hosted product and where we have the right isolation and security guarantees. But, you know, there are companies out there who are lot more sensitive even about metadata, so it's important to have the capability to run even the metadata graph, the storage, and indexing tiers in the customer's account and only the control plane running in our account. That's becomes table stakes these days in terms of handling data privacy concerns. So, you know, having an architecture that can flex based on the needs.
And like I said, offering this native integrated experience when you want to run these downstream workflows, how do you make it super easy for customers to register their workflows and being able to subscribe to interesting events and trigger them on demand and how to make it all happen in the SaaS environment. That's what we think about a lot in terms of making
[00:39:52] Unknown:
it delightful to use. And as you have started with building out the business and dug further into the problem space and worked with some of your initial customers and design partners, what are some of the interesting lessons that you've learned or assumptions that you had going into the creation of the business that have been challenged or updated in the process?
[00:40:13] Unknown:
I would say that the biggest thing that we've noticed was even though ingestion emitters and all those were built easily and, you know, it's very easy to run them, people were still saying, can I just get you to manage even that? So we're building a managed ingestion. We already have the ingestion capabilities, but we're just adding the management aspect on top of it so that customers can just run the ingestion agents in their environments, and we can still administer them. So we can still push upgrades to them, schedule them, debug issues that might happen every once in a while when a connector fails. And so we're working hard on kind of making that last mile also extremely easy and trivial almost for the onboarding onto Acro.
Finally, we're also working on, like as sort of mentioned, a tightly integrated inverse, like the trigger framework for metadata actions and become prepackaged with several actions for, you know, data quality and data governance, being able to do things like auto classification of datasets as, you know, silver, gold, bronze, being able to improve ownership of datasets. The other thing we heard a lot from our customers was, well, it's great that there's Data Hub, and I can go to it and find everything I need there. But I also want the same metadata inside my tools. So I might be in redash or I might be in Looker, and I want to be able to see kind of metadata from Data Hub being reflected inside those tools as well. So we're working on, you know, bidirectional API integration with other tools. So being able to do in context metadata inside those tools. And, of course, better search ranking and recommendations, no 1 says no to that. And in many ways, that's 1 of the biggest improvements we can add on top of the amazing platform is being able to learn from the signal and being able to get better and better at recommending to people what are the datasets they should be paying attention to.
1 of the successes and failures of having an incredibly scalable and flexible metadata platform is people connect you to everything. And before you know it, you have a 100, 000 dashboards and a 100, 000 charts and a 1000000 plus datasets. And so that the search relevance and the ranking problem becomes like a really important and hard problem for you to solve well. And then as far as some of the
[00:42:33] Unknown:
additional convenience features that you've built in for part of the Acryl platform, what are some of the sort of rough edges that you have identified as you've brought on more customers who aren't already familiar with Data Hub and who aren't part of, you know, LinkedIn at the time you were building it or Airbnb where you have a whole team to support them and people who are sort of coming into the Data Hub community and the Data Hub project with their own assumptions about how things should work and then, you know, bumping their shins against sort of mismatched expectations?
[00:43:02] Unknown:
Yeah. I mean, 1 thing that we notice in our community is often central teams are coming to us because, you know, they have to tame this complexity. We tend to resonate very well with the problems that central teams bring to the table because we have lived that life. And we often talk to them about how they themselves can get immediate value before they can actually do the wider rollout across the company. So being able to have this bird's eye view through metadata analytics about, hey. Here's what is going on in your ecosystem. Here are the problem spots.
Here's where documentation is missing. Here's where, you know, business glossary coverage is missing. Being able to summarize it like that and break it down by platform, break it down by all kinds of interesting facets immediately makes it actionable for them. Right? And then they're a lot more willing to kind of work with us on navigating, quote, unquote, rough edges. I mean, we are still in private beta, but being able to get to that value on day 1 is really important. Right? And the other thing is being able to integrate with other critical workflows, not just the discovery application, which is important, but it takes a while before your tool becomes a daily use product.
Right? Which is very much our goal. How do we make Data Hub be your daily check-in tool? Being able to integrate with their other workflows is also the other thing. So what we have observed in terms of rollouts is being able to connect to enough systems to first deliver value for the central teams and being able to work through the other challenges over a period of time is usually very helpful.
[00:44:43] Unknown:
And you also mentioned that as part of the work towards the observability goals and the capacity for Data Hub to act as the mechanism by which you identify and act upon quality issues in your data. I'm wondering what are some of the trigger points and action steps that you have been focusing on for being able to add in these event hooks to then trigger downstream actions?
[00:45:07] Unknown:
Yeah. I think, you know, customers have definitely found value in connecting us with classification providers. So, you know, these are systems that can send classification information like data types, semantic data types. Like, oh, this is an email. This is a Social Security number. This is a possibly PII. Those kind of tags through Data Hub. And Data Hub can then trigger downstream business workflows or sometimes even programmatic workflows in response to those state changes. Like, this field just got a new tag proposed, and this tag proposal is this. And maybe something else needs to happen. Like, maybe access needs to get logged down for this dataset, or maybe someone needs to get an email or or a Slack alert.
And, of course, you know, the entire platform is very open. And so data observability events like schema changes, data publish times, data completeness times, data completeness metrics. These are all an important source of information, and they want to be alerted. Like, data engineers would want to be alerted at a minimum about these things happening, but they might also want to define actions to be run. You know, Swaroop talked about being able to halt data pipelines. Like, there's no point in running a pipeline that is destined to fail if you know that there's a problem with the quality of the data. But we don't necessarily need to be running all the data quality rules ourselves.
You could have a data publish event happen, then you could trigger a data quality run. Once that manifest is available, then you can decide in your airflow DAG whether you want to run this forward or not. So these kind of really multi system signaling and being able to enable multiple systems to react to changes in metadata, we're finding a lot of use cases where customers are using those kind of workflows. There's, of course, like, a lot of governance workflows always, like automated ways to improve coverage of ownership, automated ways to categorize datasets into gold, silver, blondes, maybe based on metadata, but also maybe based on data quality, scores.
Able to monitor and alert dataset owners when the business glossary term association coverage is low. I mean, semantic tags show up, I believe, to lock down datasets or, you know, maybe open up access for dataset that's no longer contains any PII. And, of course, there's, like, a host of other actions that we're working on that are still under wraps, and we're working with our customers on those. But they honestly generally fall in the same 3 categories, improving discovery or productivity, accelerating data observability, and improving federated governance for organizations.
[00:47:32] Unknown:
As far as the sort of integration aspect of it, what are some of the challenges that you see teams encountering as they're starting to onboard some of their workflows and link it into Data Hub and start to observe the metadata graph in their organization and figure out how best to sort of structure the schema of the metadata events, structure the workflow of publishing events from their different data systems, and then also in terms of just understanding what are the best practices and how I should even be thinking about what I can use this metadata for? Yeah. We strongly
[00:48:11] Unknown:
believe in the developer led ways here. You know, integrating into the operational fabric is an important mechanism that we emphasize versus just observing what happened several days ago on reporting metadata. We, of course, gather a lot of these insights from our community. Right? It's we don't have to figure all this out on our own. We we hear a lot of feedback from our community about best practices and what are they actually finding in their production environments that really works well. So, anyway, going back to what I was saying about the operational fabric versus just observing what happened few days later and reporting, that actually makes a big difference in terms of how actionable your metadata is. Right? Calibra and Alation need armies of people to deploy, train, and roll it out widely. We've heard this from some of the customers.
They don't actually bring the developers along in their rollout. Right? We've heard that months go by before any value is derived at all. Right? So it's really important in terms of the SDKs that we develop, in terms of the other tool integrations that we we bring to the table, the CICD practices, getting into the operational fabric as much as possible is critical. For example, we have a native integration with Air flow lineage back end. Anytime the DAG runs, we get that lineage edge for free. People don't have to do rather than declaring inlets and outlets, people don't have to do any work. And we're gonna be doing similar integrations with other data tools, which makes it really easy to report metadata events automatically.
Right? The other thing we emphasize is pushing metadata versus pulling. Of course, there are systems out there which we have no option but to go and crawl and get that metadata out. But wherever possible, we try to emphasize more incremental ways of getting the metadata and piecing together the truth later on in the metadata graph. So, again, in terms of emitter SDKs, there are always in house tools, in house pipelines that people build that we don't have the connectors for. So giving a very lightweight emitter SDK for people to bridge that gap allows people to complete the picture. Right?
The other thing that comes up is source of truth. We ingest metadata from many source systems, but then we also ask humans to go and edit the documentation, add tags, and whatnot into the metadata graph through the UI or maybe through APIs and what happens in terms of reflecting that truth back into source systems. It's really important to have free flowing metadata flowing back to even source systems and being able to reconcile all the differences that you see, having the ability to merge different versions, and, you know, having the capabilities to do that is really important. And lastly, as I was saying earlier, it's really important to be in the critical workflows of data developers.
Of course, in addition to making data be freely discoverable, you have to be able to inject into their workflows. That's why it's so important for us to be focusing on observability also in addition to helping teams with governance and discovery initiatives.
[00:51:24] Unknown:
And in terms of the actual integration path, I'm wondering what your thoughts are on the potential for projects such as open lineage and open metadata to simplify that process by having some common format where all of the different tools in the data ecosystem can speak to each other without having to do custom implementations
[00:51:44] Unknown:
for each 1? Yeah. I think both of those projects are, you know, great initiatives much needed. We can always benefit from having better standards or in some cases, standards. But we also really like to think about this problem from the perspective of the customer. And what we see is that the real problem is actually that metadata is just locked up behind tools, and it's not free flowing. So it's not even that the APIs are bad or opinionated or very specific. It's just there are no APIs. So we need to first solve that, in my opinion. So that's step 1. It's okay to be a bit too specific. It's okay for it to be a bit opinionated as long as it's useful and as long as it's available in real time. I like to think about it almost as an analog to the data world. Now we've lived for years in a schema on breed world powered by data lakes.
And I think we can survive quite well using similar techniques for metadata. Like, if I had to choose, right, I would like not to choose. But if I had to choose, I would rather have a free flowing schema on read metadata world than wait for a perfect schema on right standards based metadata world to emerge eventually. Because I think the real problem is in many cases that the APIs don't even exist. It's not that they are very specific or opinionated. So if we can solve that first, then the standards can emerge from that, and we will all win, honestly. I do think standards are very valuable, but I'm not holding my breath because the problems are here and now. And there's also quite a few emerging standards besides the 2 you mentioned. There's also Ejeria's open metadata standard.
And there's also a few other companies, including Microsoft, I believe, that have tried it in the past. So it's not clear which standard would will win. And like with all standards, it might take a few years. But we are definitely watching the space closely, cheering for standardization, and we are more than happy to build interoperability with the best standards. In fact, like, our push based architecture does make it really easy to do this. So we are not worried about it taking too long once the market makes a choice. It's literally a matter of writing that 1 adapter that transforms 1 format into, you know, data hubs types. So but we're not waiting for it. We're just forging forward in asking both open source projects as well as vendors to make their APIs more accessible, to make, like you know, there's some good examples, and then there's some okay examples. But I have to say that pretty much everyone I have talked to has been very much in the customer's camp in all of these discussions. So we're we're very hopeful for the future of metadata across
[00:54:28] Unknown:
tools. In terms of the governance of the Data Hub project, I'm wondering if you can speak to that a little bit and just some of the current state and future plans.
[00:54:37] Unknown:
Sure. Yeah. I mean, Acura Data and LinkedIn are the main drivers of the project. I am, I guess, the overall project lead and I, you know, lead all community decisions. I build joint tech roadmaps with LinkedIn and the community. We have a very vibrant community. Like I said, 100 plus contributors. Pretty much every month, we get 20 or more contributors from different companies every month. In fact, this has been, you know, the stats from the past year since Acura Data started. And this is a great day to be doing this podcast because I'm really delighted that Maggie Hayes has actually joined Acral Data as Data Hub's community manager. We actually hired her from the community after we worked closely with her on a design sprint along with several other members of the community. She's a force of nature, and the community is really lucky to have her. And in your experience
[00:55:29] Unknown:
of building the Data Hub project and founding the Acryl business on top of it and running the Data Hub project both internally at LinkedIn and now as a community effort, What are some of the most interesting or innovative or unexpected ways that you've seen it used? Yeah. It's really interesting. The community is always several steps ahead of us in many ways because of the sheer scale. Right? We've seen all kinds of interesting
[00:55:52] Unknown:
use cases emerge. Often, people have been very API first and integration with other tools heavy in their approach before they even consider using the UI. We've seen companies like Vault who have basically written their own SDKs on top and have enabled other developers with these SDKs so that they can go and work with metadata in any tool. So in many ways, they're forming our vision alongside us. We've also seen companies like Warung Pintar from Indonesia. So they have a very interesting implementation of how they collect lineage. So they run SSIS workflows and we have, you know, obviously, no support for it yet, but they have somehow managed to scrape together all information and, you know, record it in a spreadsheet. And from there, they built an adapter to easily push that information into data hub. So we see, like, all kinds of, you know, innovation happen in the most unexpected places.
And, of course, there's also very kind of industry leading practices that are coming from our community. Saxo Bank, for example, has really pioneered how the old school business glossary that everyone is tired and, you know, very worried about in terms of how process heavy it is. They have actually reimagined that space and they've said, hey. Let's make business glossaries be schemas. Let's make it be code that is versioned and let's empower data engineers to just tag the schemas with those terms and make it all be linked and make it all be, you know, check that compile time. It's it's amazing, like, the amount of automation they've actually put in place. Klarna has been working on automated ways to evaluate downstream impact of any changes. Like, they look at the lineage and they have very efficient ways to navigate that and be able to, like, evaluate impact.
So there's also a lot of innovation going on in the MLOps area, how to enable end to end debugging. There's a class of problems which is local to the model training and evaluation world, but there's also upstream problems. How do you distinguish those 2? So we've been blown away honestly by the amount of innovation happening, and in many ways, we are catching up. That's why it's so great to be building alongside the community.
[00:58:09] Unknown:
Something interesting that I have noticed was there are a couple of companies in our community that are themselves data companies in the sense that they hand curate or, you know, crowdsource data for other companies. And they have actually been considering using Data Hub to actually offer up as a catalog for their partners. And I thought that was interesting. Like, we normally think of Data Hub as a business facing catalog, right, in internal tool that you're using and all your employees are using. But there are a bunch of companies that have actually thinking about turning it around and actually making it catalog that their partners can come in and explore all the datasets that they have for not sale maybe, but for, yeah, I guess, sale for download or for for sharing, you know, in a I thought that was quite int making it a product itself. In your experience of building and working on the Data Hub project and then founding Acryl Data to carry it forward, what are some of the most interesting or unexpected
[00:59:09] Unknown:
challenging lessons that you've learned in the process?
[00:59:11] Unknown:
So I can definitely say that I did not know how payroll worked until I founded Accral Data. So that's definitely been interesting, somewhat challenging, but now I know how payroll works. So that's been great, and that's completely, first time founder story. But in terms of Data Hub itself, I would say, you know, early on when we did our plans for the year and we said, what do we wanna do and how do we want to build and how what are our founding philosophies? We really embraced, you know, do things that don't scale philosophy. And that meant getting on a Zoom call with someone in on Saturday and debugging their quick start issues because they are on Microsoft laptop with Windows, you know, subsystem WSL v 1, and somehow the Docker container isn't starting. And you, you know, literally sitting down with them and solving it with them until it actually worked.
And so that has been great, but it has also meant that as the community engagement has exploded in the past few months, we are starting to hit some real limits in terms of how much we can accomplish and how much we can help individuals. And we're really fortunate to find an excellent community manager from within the community, Maggie Hayes, who has recently joined us. And so really looking forward to being able to scale out the community engagement a bit more and make it much more structured and much more efficient. We're also thankfully, seeing the effects of having a vibrant community. So we're seeing the community start to help each other out. Our Slack is usually often buzzing by the time we wake up. We're in the Pacific time zone, and there's a ton of people from Europe and India that are in the community. And, you know, Slack is just crazy by the time we wake up at 7 AM or 6:30 in some cases for the early folks.
But we're starting to see, hey. You know, this person actually helped this other person and, you know, they were able to resolve their issues. And we're no longer, you know, the single point of failure holding people back from getting to success. And that's great. That's great. Another interesting learning I've had, you know, when I was at LinkedIn, I used to run these big data meetups. And we had, you know, pizza and, you know, beer and appetizers and a lot of budget and lot of planning to do 1 big event a quarter. And when as we moved into kind of the pandemic phase and everything is just virtual and everything is now a town hall and a Zoom meeting away, I actually realized that these town halls actually work better.
Because, 1, you don't have to wait for a quarter to get everything approved. 2nd, we are able to take away a lot of the power dynamic and the location dynamic that happens with people who are local versus people who are not. You don't have a swarm of people in a room and then everyone else is on their own little screens feeling left out of the conversation. Everyone is on a screen, and so the it's all live and it's all happening, and it's fair for everyone. And I think I've actually been blown away by how effective virtual town halls are. We get a ton of feedback and we're able to share a lot of content and we're able to reach all the corners of the world. We have people from Indonesia. We have people from Australia. We have people from New Zealand. We have people from South Africa, like, all over the planet joining us for these town halls. And I've just been blown away by how much worldwide our presence has become over the past year. Do you have any lessons to add, Swaroop?
[01:02:36] Unknown:
Yeah. I'll just say that reminding ourselves constantly that, you know, being community first as a commercial company doing an open source project, it's a classic question that often comes up. You know, what do you do? Put in your SaaS product and what do you put in your and just being very community first and treating the community as your friend in navigating some of these choices has been a constant kind of reminder to ourselves.
[01:03:03] Unknown:
And for people who are interested in the capabilities of Data Hub and what you're building at Acryl, and they're interested in the promise of metadata management and being able to have a cohesive view of their overall data assets. What are some of the cases where Acryl is the wrong choice and they might be better suited either with using the open source data hub or using a different metadata platform entirely?
[01:03:26] Unknown:
Yeah. So, you know, in terms of our philosophy or approach, we are very much developer friendly, focused on technical users of data, data engineers, data scientists, ML engineers, and so on. We also have a delightful product experience for business users. You know? Don't get me wrong. We love business users. You know, they also want to be alerted when their KPI dashboards break. But we are not necessarily the best choice if you want to, you know, run queries from within the tool, want to visualize your query results from within your catalog. Catalogs like Alation have tried to do that, and we just think that you can't really replace the ease of use of a BI tool like Looker and trying to bring in, like, totally unrelated concerns into the catalog. If you're trying to do that, we may not be the best choice, and we believe in a much more automated approach to data management using a data ops, GitOps philosophy.
You know? If you prefer, more central process heavy governance approach, we may not be the best approach. We want to emphasize the federated governance model that is advocated by data mesh. So if you just want a prettier UI than Colibra but still want to use the old heavy handed governance approaches, we may not be the right choice. But we do believe the timing is right for companies to take a more developer friendly
[01:04:48] Unknown:
operational fabric first approach to metadata, and that's why we are seeing the space heating up so much. As you continue to iterate on the Data Hub project and iterate on the product that you're building at Acryl, what are some of the things that you have planned for the near to medium term? Well, our first priority
[01:05:05] Unknown:
continues to be growing the adoption of data hub in enterprises of all sizes and skills. We're tackling data discovery, data observability, and data ops driven governance as our initial sweet spots. We are a remote first team. And so, of course, looking to grow our engineering product and go to market teams. We are getting a lot of traction for our private beta SaaS product. And really, if you say, you know, 5 years down the pipe, where do you wanna see ourselves? We want to be the essential weapon in a data developer's arsenal, not to use a military term too lightly. But really there's a lot of complexity in the modern data stack. And we want to really data professionals get more effective and get more creative with the data that they have, and we just want to be an essential part of that.
[01:05:49] Unknown:
Are there any aspects of the Data Hub project and community and the work that you're doing at Acryl and the business that you're building there that we didn't discuss yet that you'd like to cover before we close out the show? I think we covered a lot of it in terms of gaps and the tooling and the technology for data management today. I think we're repeatedly seeing that, you know, fragmentation
[01:06:09] Unknown:
and all this hyper specialization is ultimately hurting the customer. And so the cracks that are forming due to fragmentation are creating issues for data platform teams who are unable to drive data quality initiatives. They're unable to drive data governance initiatives. And executives are, you know, purchasing a lot of things, but they're unable to see the returns from those investments. And we really believe that the biggest gap is metadata fabric that can connect all these tools together. So maybe that's the only thing that I wanna reemphasize that we're missing that important piece of the puzzle.
[01:06:46] Unknown:
Alright. Well, for anybody who wants to get in touch and follow along with the work that you're each doing, I'll have you add your preferred contact information to the show notes. And you've already answered my question about the biggest gap that you see in the tooling of technology for data management today. So just wanna say thank you both for taking the time today to join me and for all the effort that you're putting into the Data Hub project and best of success for you at Acryl. So definitely thank you again for all of the time and effort you're putting in, and I hope you enjoy the rest of your day. It's been a pleasure chatting, Tobias. Thanks for having us again. Same here, Tobias. It's been an absolute pleasure. I think we blew past our 1 hour mark. I get easily excited.
[01:07:23] Unknown:
Thanks a lot for the really invigorating conversation.
[01:07:32] Unknown:
Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site of dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts atdataengineeringpodcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and workers.
Introduction to Acryl Data and Data Hub
Shrashanka's Journey in Data Management
Swaroop's Experience at Airbnb
Acryl Data's Mission and Vision
Building and Evolving Data Hub
Lessons from Goblin and Data Integration
Market Perspectives and Opportunities
Comparison with Metaphor
Recent Changes in Data Hub
Data Observability and Quality
Acryl's SaaS Architecture
Customer Lessons and Feedback
Integration Challenges and Best Practices
Trigger Points and Action Steps
Open Lineage and Metadata Standards
Governance of Data Hub
Innovative Uses of Data Hub
Challenging Lessons Learned
When Acryl Might Not Be the Right Choice
Future Plans for Data Hub and Acryl