Put Your Whole Data Team On The Same Page With Atlan

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Prakalpa Sankar about Atlan, a modern data workspace that makes collaboration among data stakeholders easier, increasing efficiency and agility in data projects. So, Prakalpa, can you start by introducing yourself? Thanks for having me. My name is Prakalpa. I'm 1 of the founders of Aspen.

I've been a data practitioner my whole life. Prior to this, built and ran the data teams that built India's national data platform, which is 1 of the largest public sector data lakes of its kind, you know, really learned all of the challenges with building and running data teams firsthand. And a lot of those learnings and failures is what led to action today. And do you remember how you first got involved in the area of data management?

So I've been a data practitioner my whole life. My cofounder, Warren, and I actually met in university.

And what started out as a passion project turned into something more than that. We said, hey, large scale problems like national level healthcare and poverty alleviation.

They don't use data, and it feels like they should.

You know, let's go do something about that. And, bit by bit, we ended up building the world leading data for good organization. You know, we were a world economic forum debt pioneer. We were doing a lot of work with folks like the United Nations and the Gates Foundation and several large governments. And,

you know, I think just became a data practitioner by teaching ourselves ways to solve these important problems that we cared about. And data management was the biggest challenge that we had in every 1 of the projects that we implemented. So very organic journey.

As you mentioned, some of the lessons that you brought in to Atlan were learned in the process of building out the data platform for India.

And I'm wondering if you can just start by giving a bit of an overview about what it is that you have built at Atlan and some of the story behind deciding to turn it into a business and some of the core principles that you wanted to be able to address as you were building out the product?

So the 2 liner math then, we sort of see ourselves as a collaborative

work space for a modern data team. So what a GitHub is for an engineering team or a Figma is for a design team,

knowing fully well how diverse a modern data team is. Right? Analyst, engineer, scientist, business,

all with their own tooling preferences and skill sets.

How do you sort of enable this team to work together most effectively? So we sort of see ourselves as that glue that binds

the people together with the tools and technology.

I think what's useful is, you know, how we got started. Right? We actually started out as a data team ourselves.

We were, as I was saying, doing a lot of work in the data science for social space, but, you know, working with a wide variety and scale of data, right, from billions of pixels of satellite imagery to processing data for over 500, 000, 000 Indian citizens. And while processing data for over 500, 000, 000 Indian citizens.

And while many of these are dream projects for a data practitioner,

the reality for us every day was that every day was chaos.

Our Slack channels would be filled with messages like, what does this column name mean, or can I use this final clean version of this data set? I still remember this 1 time where 1 of the cabinet ministers of India called me at 8 in the morning and he said, Prakalpa,

the number on this dashboard doesn't look right. And then I go like, okay, I open up my laptop, there's a 2x spike in a day. So clearly it was wrong.

But there was nothing I could do at that point, right? So I, you know, said, so I'll call you back. And then I called my project manager, who called my analyst, who called my engineer, who pulled out audit logs, but he couldn't troubleshoot it because he didn't know what the variables meant. You know, long story short, like, we were spending 50, 60% of our time dealing with issues like this. And if you think about these challenges, they're not as much technology problems as much as they are human collaboration problems.

We actually sort of got to this point where we realized we couldn't continue to stay like this. And so we started an internal project to make our team more agile and effective over time. So over a couple of years, built tooling that made us about 6 times more agile. That was actually why we went on to do things like build India's national data platform, which is awesome because the prime minister uses it. But what was really cool was that it was built by an 8 member team start to finish in 12 months, which is 1 of the fastest of its kind globally,

and sort of realized we'd end up building some tooling that might help other data teams around the world. I was reading with that question that, Aspen was born. We were like, hey, you know, can these tools that made our data team do these things that we thought would be impossible to accomplish?

Can we help every data team in the world be a little bit more efficient, a little bit more agile, and what could that do, I guess, to the broader world? Right? So that's actually how we started. So you mentioned that Atlan is built as this

focal point for data teams. And I'm wondering if you can just give a bit of an example of sort of who the target users of Atlan are and just some of the ways that it might fit into their existing workflows or if there's any sort of change in the way that they approach their work and the problems that they're trying to solve that Atlan provides.

Typical users of Atlan tend to be anyone who's a data user in an organization. Right? So, obviously, the core data team, this could be the analyst, the data engineers, the scientists.

But increasingly, we're also living in this world where everyone is an analyst in some ways. Right? So increasingly, you're seeing an option in what is being called the citizen data scientist or the power data users inside an organization

in the business as well. As you think about the workshop, I think it would be useful to take a step back and just talk about some of the fundamental principles behind the, you know, product at Attle.

So the product itself is built on, like, 3 fundamental principles.

The first is this principle that we call a data asset.

Now traditionally, a data asset would be just a table or a database. Right? But if you think about in today's world,

a model is an asset, a BI dashboard is an asset, code is an asset, pipelines are an asset. So what we've tried to sort of build is this concept of what does it mean to be a data asset and how can you treat and maintain a data asset the way it should be.

Around every data asset, we have this concept that we call a data asset profile.

The best analogy to think about this is a GitHub code repo. Right? So when you're onboarding an engineer on a team, you just share a link to a GitHub code repo. It has your code, your documentation, your revisions, everything that an engineer needs to actually just get started to work. But if you think about it, when it comes to data,

it's really hard to get down to that single source of truth.

And that's also because the single source of truth in the data ecosystem is spread across a couple of different places. Right? So, you know, the truth about who else is using this data is in the BI tool. The truth about where is this data coming from is in the ETI tool. And so

what we've tried to do is, like, say, how can we create the single source of truth in this asset profile

that makes it, you know, possible for an end user to understand everything that they need to know about the data asset and trust it

knowing that this truth itself is gonna be dispersed in a couple of different

places. Right? And finally, we've tried to create this concept of what we call embedded

collaboration,

right, which is once you've come in and discovered your data asset and understood everything you need to know about it, how do you start using it. Right? And we think about this as almost microflows.

So right from things like, why can't you share a data

asset as simple as sharing a link on Google Docs, right? Or when you request access to a data asset,

why can't you just let an owner get a request on Slack who should just be able to, like, approve it? Or how do you just quickly open up that link in a Tableau or a Jupyter? So what that means for our end users, if you think about these 3 principles, is that, you know, fundamentally, we end up becoming a layer on top of that existing data stack. So, typically, you know, our customers have implemented,

you know, the best tools in the modern data stack. Right? So a Snowflake or a Redshift or a Databricks,

you know, they have an existing PI tool like a Looker or a Sisense or a Tableau in place. They're using DBT. Like, typically, that's the workflow of a typical team that things in Azure. Search discovery virtual hub layer across all of these, you know, tools in their ecosystem. So the first step

for a user is, like, you know, you don't need to mess with someone on Slack and say, hey, like, you know, which dataset did we use in this project, or can I trust this, or, you know, essentially, replace

that ad hoc flows that are happening inside an org and essentially just bring what an ideal 21st century world experience of collaboration should look like on these data assets?

In terms of the

components of the data stack, I'm wondering

for for somebody who has an existing data platform, they've got their workflows established.

They're feeling the pain that most data teams are as far as not understanding, okay, what are some of the lineage systems or

what was the use case, what's the context around these data assets. You know, there are different systems that are being built out to be able to address some of those. I'm thinking in terms of, like, metadata management tools, so things like Data Hub or Dataken.

There are data catalogs. There are access control layers, such as things like Immuta or Apache Atlas.

And it sounds

like Atlan is

either trying to integrate with or

supplement or supplant some of those systems. So I'm wondering sort of how does Atlan fit within the broader ecosystem of data tools that people might already be using, and what are the pieces that might be replaced wholesale by Atlan?

At Atlan, we love open source. We're built on open source. The entire product is built on, like, a fully open API architecture.

And, you know, the problem we're trying to solve is, like, we actually fundamentally believe a lot of the fundamental technology infrastructure challenges

are gonna be solved in the open source ecosystem or the managed cloud ecosystem.

And so, you know, our goal is really to say how do you solve the human collaboration problem? How do you create a product experience that users love, that users want to sort of use on a daily basis and can get up up and running in, like, 30 minutes. Right? And so what that means from a core capabilities perspective is that, you know, Atland basically becomes the people use a lot of different words in the industry these days, but, like, your data discovery, data cataloging,

you know, data governance, observability

sort of layer as that sort of virtual layer on top of all of your data assets.

Behind the scenes, again, we leverage the best of open source, and we've built sort of a know, we parse through your SQL query history and we auto

construct column

you know, we parse through your SQL query history and we auto construct column level image across your data ecosystem. We have a bunch of bots that read through your columns and actually suggest column descriptions and, you know, linkages in all of your data ecosystem. So a bunch of those those automatic

Fundamentally, I mean, you know, Ashland itself gets adopted in all that are building out a modern data platform just like you said. Right? So, typically, we don't end up replacing any older tech. We're typically greenfield. It's typically

the ideal team for for Atlan is somebody who's looking to bring in

a full fledged end to end solution in that metadata management clear broadly.

To get started in 30 minutes and actually just get down to the work. So I think that's sort of where we would fit in. But I think all in all, that category itself is still super nascent and still developing. Right? Truly believe that the right solution in this space is gonna look much closer to what a GitHub does in engineering or a Figma in design than any tool that exists today in the ecosystem.

Yeah. So the sort of metadata layer, as you said, is still evolving. There are a number of different players who are trying to carve out their own niche where,

you know, I'm thinking in terms of, like, data governance and metadata. There's Data Hub in terms of being able to catalog your assets and understand some of the data flows.

Datakin is very focused with the sort of Marquez open source core of it

on understanding the lineage aspects. And then there are things like Immuta and Apache Atlas that are very focused on

access control and sort of security controls, managing PII.

And it seems that Atlan is trying to make more of a broad platform play that covers all of those different aspects. And then, you know, also going into data quality, there's an explosion of companies that are trying to address different forms of that. And so

for somebody who is building out a new platform, they say, okay. I've got my

workflow management

system, so I might be using a DAXTER or an Airflow.

I've got my data lake where I'm using s 3 and Trino or maybe I've got everything landing in Snowflake.

Then they're saying, okay. Now I need to be able to see where all the flows across this. Would they also be using some of those other tools like SOTA

and Data Band for the observability aspects and then plug that into Atlan for visibility

and using Data Hub for the metadata management that then has a layer in Atlan that they view, or would Atlan replace the choice of bringing in those tools? Maybe I'll help you understand sort of behind the scenes, the open source that we integrate with. Right? And I think that will help you. So behind the scenes, Atlin, we leverage Apache Atlas for essentially the metadata management layer, Apache Ranger from a security perspective. We integrate into ecosystems that create expectations from a data quality perspective. So fundamentally, what we've tried to do is sort of take some of these fundamental open source projects, build a layer on top of that that makes it super DIY for ecosystems like Snowflake, you know, Trino, you know, Airflow. So we have a native plug in into Airflow.

So 30 minute setup on top of all of that. What that means typically is we would integrate very well into

for example, if it's in a in, you know, a Snowflake, Katrina, all of the fundamental data warehouses, lakes, orchestration tools are all native integrations that we've built with that ATL comes with. We would end up typically replacing a discovery solution like a data hub. I mean, mostly customers, it's a choice. Right? You choose between, like, do you wanna take open source and build on top of it, or do you wanna just go with, like, a managed approach? So it typically ends up being that choice, you know, for customers, and I think there are different teams that have different approaches based on the resources. They're ready to sort of invest into building on top of some of this. So 1 of those ends up being the real ecosystem. So basically, what where the layer that Atrium would come in on is your metadata management,

data discovery,

data lineage,

and aspects of access control. Those are the places that ACHELIN becomes really, really valuable as a full fledged solution.

We would integrate into ecosystems like an Immuta

or a Privacera,

from a policy enforcement perspective. So policy enforcement is something that we don't do. Does that answer your question? Yeah. That's definitely very useful context and useful framing because as you said, could be very nebulous sort of what are the dividing lines between all of these different systems. And a lot of times there's some slight overlap where you say, okay, there's a little bit overlap between these 2 tools, but I need both of them because of their core capabilities.

And so understanding sort of, like, where do all these different layers tie together. There's never very clear dividing lines because everybody starts with some core capability and then kind of expands out to the neighboring areas.

And so as far as the sort of specific data assets, you mentioned that there are some different ways of thinking about it. So it could be that there's a table in your data warehouse. It could be that there's a machine learning model that you've got deployed to production,

the code repository for managing the workflows.

How do you think about the

assets, and what are some of the variations in terms of what type of information you need to be able track for them and how you represent that within AtLAN for being able to

unify in a sort of single pane of glass, how they all relate to each other? So we fundamentally think about every data asset as its own unique entity.

Right?

So our table is its own unique entity, which is different from a BI dashboard, which is different from a BI widget, which is very different from code. So we think of these as very different unique data assets and data asset entities.

And as we think about the

ideal profile, right, like, the guiding light for us is, like, what does it take to create a single source of truth? And that always starts at the user. Right? Like, so if you're looking at a table,

what is all the information

that would create the single source of truth for that data asset? So every data asset has its own unique

type of profile in Aflm, and I think that's the first day. Right? Like, what does the ideal data asset profile look like? And, you know, we actually open up data assets asset by asset. Right? So we wanna make sure that so our approach on this has been, you know, a couple of different years. Right? 1, how do you create the ideal profile?

2nd, how do you create all the integrations that it takes to fill up that profile

with the single source of truth with as legit human intervention as possible. You know, we've taken a very DIY approach, a very core.

A core philosophy for us is that you should not need engineering

involvement

to set up something like an ACHELM. And this is a core philosophy for us because, you know, when we were a data team ourselves, our engineering resources were the most scarce resources we had, right? Like engineering time is so valuable. So you may be really, really careful about what you're investing that engineering time in. And so 1 of our core principles has been that, you know, in an ideal case, you know, you shouldn't need a data engineer's time to actually, like, set up something like an asset. And so that means that we invest in going really deep into our integrations. Right? So it's fully, you know, UI driven, 30 minute setup. We try and make it possible that even an analyst can set up Adele and, you know, internally.

And so that means basically saying different things for different tooling. Right? So if, say, you're picking Snowflake,

it means, you know, going super deep into all the integrations and the, you know, how do you make that DIY.

But when you're looking at a table, you probably also want information coming in from your ETL, from Airflow. Right? And saying, you know, what does freshness look like for this data asset? Or when was the last time that that run? Right? And so that's where a lot of the extensibility

and the plug ins in Airflow come in. So we have a native plug in into Airflow to allow sort of that information to just be directly pushed in via the API,

try to keep it super customizable and extensible. So that's really the 2nd layer. And the 3rd layer is the relationship layer like you talk about right now. What is this table linked to? You know, for example, what BI dashboards does this table power? And that's really where elements of lineage, but also other relationships in your ecosystem come in. Again, there, we've taken the approach of depth versus breadth.

For example, I talked about this with Snowflake,

but with any SQL ecosystem,

we basically pass through your SQL query history. We look through your logs.

And, again, through native integrations, we're able to say, you know, this Looker dashboard

is powered by exactly this table and this column in your ecosystem, and then a bunch of bots on that layer. Right? So this is where a lot of the intelligence of action starts kicking in. So if you know that only your top 20%

data assets are being used and queried, how does that go into the search algorithm? How do you basically link if you have a business term like annual recurring revenue?

Layers that we've tried to build, again, to make sure that,

you know, it's not just set up, but it's also time to value that becomes as efficient and with the, you know, least amount of human intervention possible. And so digging a bit more into the Atlant platform itself, can you talk through some of the architectural

core of it and some of the ways that it's implemented and just some of the overall

evolution and

shift in terms of the design and the broad goals of the fully open architecture, you know, built on a fully open API kind of ecosystem.

You

You know, Athens built on a Kubernetes cluster,

so lots of auto scaling and elements like that that are, you know, baked in.

We offer the product in 2 ways. So 1 is as a deployed

solution on a customer's virtual private cloud. So this, for example, on AWS is a cloud formation that essentially, you know again, like, the intent has been to make it, like, a 1 k deployment process, and then it's powered by a whole ton of things like Grafana dashboards and things like that that make engineering management super, super easy.

The other is is sort of as a managed SaaS kind of offering, in which case we take care of all of the management of the product itself. So the first step is, of course, just setting up the product and the architecture, which is relatively straightforward, typically a 1 k process.

The second aspect of implementation is connecting into your data sources. So this could be typically your data leaks or warehouses, your databases,

your BI tools, for example.

These are very DIY kind of integrations that are built into Ascent. So it's a fully UI driven. Just come in, add in your credentials,

set up a scheduler,

and say, you know, I wanna run this, you know, in this frequency or pieces like that. So we've taken away any of the management aspects of setting up and bringing your data assets into Azure and a lot of monitoring elements. I think the thing that people actually miss out a lot is actually not that hard to, like, set things up the first time. When things really break is when things break. Right? So the amount of data engineering time that gets, you know, spent troubleshooting something that you set up is far more than the time you spend setting something up, right? So again, a lot of pieces. We've tried to make that super UI driven. So, you know, lots of monitoring, Slack alerts, you know, just things like that that just if things are working the way they should. And, you know, ideally, prevent them from going wrong before they do.

So that's the second aspect. So, typically, customers actually get up and running with Adeline in, like,

1 hour, maybe, you know, worst case, 24 to 48 hours if you need, like, you know, credentials from your security team. So it's a pretty easy

setup process.

And then from there, you know, of course, there's a you know, there are elements of how do you want to organize your data assets in our end. Right? And some of those principles of how you want to set up for your own workspace or your own organization.

Those tend to be more human driven aspects that that come that kick in.

RudderStack's smart customer data pipeline is warehouse first. It builds your customer data warehouse and your identity graph on your data warehouse with support for Snowflake, Google BigQuery,

Amazon Redshift, and more.

Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and Zendesk help you go beyond event streaming.

With RudderStack, you can use all of your customer data to answer more difficult questions and send those insights to your whole customer data stack. Sign up for free at dataengineeringpodcast.com/rudder

today.

And you mentioned that a lot of the core of Atlan is built on top of different open source capabilities. I'm wondering what were some of the main systems that helped you get to where you are today,

and what were the pieces that you

saw as missing in the broader open source community that you decided you needed to rebuild from scratch either because

there was a solution that was adjacent to what you were trying to build and it wasn't quite sufficient, or there just wasn't anything that was available that sort of met the needs of what you were trying to build? I think some of the fundamental components, I think, have been great. Atlas, Ranger, for example. I think we had to modernize these tools a little bit. Right? Because they fundamentally built in the Hadoop era, right? So they don't natively come with out of the box support for a lot of the modern data stack tooling. But if you actually think about the fundamental open source ecosystem, I think they're still far ahead of of almost anything else that we see in the open source ecosystem right now. So I think those were actually some really good

engineering technology choices that we made. Also, because we started building Agilent for ourselves. Right? So we didn't wanna rebuild anything that exists in the ecosystem. Like, almost use

engineering resources only to solve real problems, not to rebuild something that exists.

So, obviously, I think 1 piece was just definitely modernizing and building, you know, essentially making a lot of these fundamental platforms

compatible with with what is the modern data stack. I think that's where we invested a lot of energy.

The other, I think, from a quality perspective so actually, there's, like, you know, automated data quality profiling and pieces like that. I think there, we've actually

tried to I think that's why the open architecture helps a lot. Because 1 of the things we've realized is that the reality of the fundamental technology

layer in the data ecosystem is it's going to continue to evolve really, really quickly, right? So for example, we used to use DQ, you know, open source by Amazon from as a data quality framework, but then, you know, had been watching Great Expectations for a while. And, you know, I think they've done a fantastic job with like, you know, being able to sort of drive, you know, high quality profiling at like much lower cost. So within a month, we swapped out sort of DQ with great expectations that meant 75%

savings on profiling

for our end customers.

And so again, there, like we've tried to keep the ecosystem very, very open. We contribute back to all of these open source communities.

And, again, the approach here has been, you know, we wanna solve for the end human experience layer. And as much as possible behind the scenes, we've kept architecture super open because we actually think that there will continue to be a lot more innovation in the open source ecosystem. Right? Like and, you know, what's right for our end customers, which are data teams, is that they should not be logged in to a individual technology choice or an architecture choice. And that's really where I think, you know, the open architecture has helped a lot.

In terms of some of the

early assumptions that you had as you were starting to build out the system for your own use cases and then began realizing that there was the opportunity for building a business around it,

What were some of the ideas or assumptions that you had going into it that have been challenged or invalidated

as you continue to build out the platform and bring on customers and learn more about

some of the various ways that people are approaching their data problems?

The interesting story is we actually tried building a tool tool like Adeline for ourselves 3 times, and we failed. And it was only the 4th time that we got it right. And so lots and lots of learnings to share about, you know, just early assumptions and I think the the way to think about sort of tooling in this category.

I think the

first and the most

early assumption that we made, and I think this is a very easy 1 to do as technologists,

right, is to go after the hardest

technology problem to solve. Problem to solve in a lot of ways, right? So I remember the first time we, you know, started an internal project to

solve for the problem, satellite solving.

We basically ended up building, I think, 1 of our most even till date, like, technically advanced systems. So it literally had like, you know, NLP based search engine. If you search for something like schools

in Bangalore

with, you know,

10 toilets or, you know, things like that. Like you would literally get like, you know, only the list of rules for the question that you're searching for. I think, you know, graph based ecosystem. Like we did like all of this technology innovation with the product. And I think what we ended up realizing, and I think this is a constant reminder until the date I keep reminding, you know, both ourselves and the team is that

this wasn't

a technology problem as much as it was a human problem. You know, things like search, for example, are have actually been, like, commoditized in some ways. Right? Like with Elasticsearch and Algolia, the challenge isn't really there. The challenge is how do you actually add context

is always going to lie with humans. Now if you think about the data ecosystem, like, the beautiful thing about data as a practice

is that it lies as an intersection between business, which is fundamentally human driven, and technology. And so all of the context that lies

with an individual data practitioner about users and entities and business, that is the context that is going to power an amazing,

search and relevancy and all of those things that come with an amazing metadata management field, right? And so we ended up then, you know, rebuilding like

couple of times. We actually tried to buy a solution, which is, like, a very funny story for another time, and we failed at that miserably.

Then we, like, you know, build out actually an internal hack tool,

which I think was, like, semi successful,

but

we never got users to use it on a daily basis. And I think this is the part that people miss. Right? Airbnb talked about this, I think, in the blog announcing their internal data portal. They talked about this concept of, you know,

design can never be an afterthought when you are designing a data platform.

And this is so true because

to make

context real, context is a team spot. The context about

the business use case of the data assets is gonna lie with the business user. The context about the technical pieces is gonna be with the data engineer. You know, business user, data engineer fundamentally

different people. Right? Like data engineer just wants to, like, sort of use APIs to push whatever information they have. Business user wants like a super easy sort of Dropbox paper, like experience to add in their context. Like very, very different people. But without both people loving the product, there's no way you're going to get shared context. And so I think the last time and, you know, I think this was the time that we were successful and, that laid the foundation of Adeline today. We actually started the other way around. We started with with the user interface and prototypes. So we would basically run these analyst anonymous sessions in our own

team. We would call the team together. They would talk about their problems. Data engineer would say something about, you know, I'm never gonna use that. And so we sort of

started like building like flows for each and every 1 of our personas

in internally, right? Like what would make an analyst use this on a daily basis? What would make a data engineer use this on a daily basis? And we actually built like we built from there. And I think that

was ultimately the time that we were successful. And so I think, you know, the early assumptions of this being a, you know, really hard tech problem to solve, I think, you know, this is just a human collaboration problem. How do you help these very, very diverse humans come together and and solve problems together? Digging more into the sort of business and process aspects of

how people are using AtLAN and the problems that AtLAN solves for,

what are some of the

organizational

practices and organizational processes

that can and should be put in place to

help people

gain the most value from systems like Atlan and being able to have a single place to look at for all of the different assets that they're using and the data flows and just some of the

education that's necessary

for being able to onboard all of the different user profiles into

of

if you believe that this is a human collaboration problem, then this is a lot more to do with culture than, you know, technology and great products are just gonna help you, you know, drive a culture. Right? But I think the starting point is really gonna have to come from the team. So something that we typically recommend,

is, you know, making sure that the team really

is bought in to the fact that this is a problem that the team wants to solve for. Right? I think that's really the first step. And the reality is that, you know, sometimes it might not be the the highest priority problem. So for example, I've spoken to like sometimes 3 or 4 member data teams who are still setting up their BI tools and their BI reports and their dashboards. Right? And and, you know, I think at that point, if you have, like,

20 tables and, you know, you're working with 4 teammates, you know, it's not that big of a challenge.

What we've seen is that typically the minute you get past, like, you know, 100 data assets, maybe more than that, you you start hitting the 500 data assets mark. Your data team started growing. You're working with a large number of business stakeholders. That's when the problem becomes real and becomes real very, very quickly because

there's this tipping point that I call where, you know, the time put a new data practitioner onto the team,

and the team is not

producing more outcomes as a result of the extra member on the team. Right? And a lot of that comes under context and, you know, your overheads and pieces like that. So I think the first step is just making sure that the team

is completely bought into the problem. And so, for example, something that we recommend to our customers is we will discuss ourselves as a team who do these, like, start, stop, continue sessions, which is, you know, every

quarter we would, as a team, say, you know, what are we doing really well? What should we stop doing?

What should we do better? What should we continue doing? And, you know, typically come down to, like, 2 or 3 priorities that we agree to as a team. The these are the top 2 or 3 priorities we wanna solve for. And there's going to be a time when that when problems like this become a reality, call the team. And when it does, that's when you start looking for a solution. Not before that, not after that, but at that exact point. I think from

there, I think it really comes down to the approach of how you wanna solve this problem. Right? I think

do you wanna, like, go the open source route and essentially,

engineering resources? Do you have that kind of engineering resource that it's going to take? Like, all of those questions, do you wanna just buy off the shelf? There's all those questions that you have to answer, but I think that's the relatively easier part of the puzzle. And the final part is once you bring in a tool, how do you actually start driving Adobe?

I think there, I think it's definitely important in the initial part of setting up a product like Athlean or, you know, any other tool in this ecosystem

to ensure that you have someone who's driving it. So I've been thinking about this a lot and we've been thinking about this concept of a of a data ops manager

who's actually slightly closer to like a sales ops manager than a dev ops manager in some ways, right? So if you think about, you know, actually learnings from ecosystems like sales where CRMs are real, like, in some ways, like, you know, data teams need a CRM.

And that's, I think, where, like, sort of a role like that is important.

It doesn't need to be a full time position. It can just be somebody in the team who takes it on as an additional responsibility, but I think it's really important to have a directly responsible individual.

Ideally, somebody who has had you know, who's been there long enough and has enough context about, you know, the overall functioning of the data team and quality the context of different business units and, you know, things like that.

So I would definitely recommend making sure that there's someone in the team who can dedicate time in the first couple of months, which is basically what you're trying to do is, like, you know, you have this context in confluence and you have this context in, you know, your airflow pipelines and all of these different places, and you're trying to bring it in and make sure you create that single source of truth in some ways. I think that's where it's heavy. I think from there, we've seen adoption kickoff. You know, if you if you invest in the right tooling, you know, you don't

and users love it, you don't need to,

you know, put in significant effort

beyond the initial setup. And the question we try to ask is, like, for 80% of the data assets that your users are looking for, if they come on the platform, will they find the single source of truth? If they do, maybe not 100% of the single source. Even if, like, they find 80% of their single source of truth, I think that's fine. And so as long as that

sort of problem gets answered or that question gets answered in your setup phase, then there are ways to implement this as a part of your sprints. There's a way of implementing this as a part of your sort of day to day data culture.

Yeah. I really like the idea of having somebody who's sort of the internal salesperson to the rest of the organization to help them understand

what is the pain point that they're feeling, and this is actually the solution for it that they sort of weren't able to put a name to yet. And then helping to

sort of evangelize for the work that's being done by the data team and help to provide it as a solution and sort of treat the rest of the organization as a customer. That's which is something that, you know, engineers are often encouraged to do, but often

sort of lack the either motivation or capabilities to do because they're sort of heads down on building all the technical aspects? Yeah. So I think the

ideal person in a role like that is typically someone who's maybe been an analyst but wants to move to the product side

or, you know, someone who has great communication skills, understands data deeply,

and is has been in the shoes of the user. I think typically folks like that tend to be great fits in roles like this that we've seen. I think increasingly, like, if you think about a sales team, they have, like, the sales ops manager who's responsible for maintaining Salesforce

and maintaining, like, look at dashboards on top of it and making sure that, like, you know, there's hygiene across the team. Right? And so as our data stack continues to evolve,

I think, you

know, every team is gonna use, like, 5 or 6 tools at the least. Teams are gonna get larger. It's important. Like, someone needs to be managing the

sprints. There's an element of that person could create a lot of leverage for the rest of the data organization if brought into the team at the right time. Probably I would say at the time that you're like, what, maybe like 10 member data team, I think that's when it's like a good sweet spot to bring on someone like that. Yeah, I've increasingly been thinking about like we've always had actually people

at HOC who've paid that role. I paid that role. I think there was never really a career path or a designation or any of that. So been thinking about what that would mean for the broader data ecosystem

and how to create more and more evangelists, I guess, for the data team and the larger organization.

Yeah. That plays into a conversation that I was having recently that I'm actually gonna be doing another episode about soon about sort of the

continued segmentation

of roles within the data ecosystem where, you know, at the beginning, it was, you know, you had the data scientist and then you had the sort of BI engineer,

and then the organizations realized that they actually need a data engineer to support the data scientist.

MLOps came out as sort of a middle ground between the 2, and now there's sort of data ops and data platform and, you know,

continued fragmentation

of roles and responsibilities

as data continues to increase in the level of importance across such a wide variety of organizations.

There's actually this great book called analyzing the analyzers,

which talks about this concept of actually

I mean, if you've got, like, a data like, what is the true difference between like a data scientist and an analyst? And like, there are like some skills here and there, but they're like, you know, there also overlaps. This concept of actually, you know, looking at your skill sets across like 3 different

create the role that you fit

most into in the Oregon. Been thinking a lot about that, but I completely agree it's nuts, right? I mean, like, it's ridiculous how many people call me. I'd say, Hey, we wanna hire a data scientist in our team. And I'm like, Okay, like, how big is your data org at this point? Unfortunately, they realized that actually they don't need a data scientist, they just need an analyst. But, you know, I think there's there's just so much confusion and hype in the ecosystem that's been created. And so hopefully, I think, you know, as an industry, we'll create more frameworks and standardization

around around some of these in the next 3 to 5

years. Yeah. Or the common problem of we need a data scientist and, okay, well, how many data engineers do you have to support them? The the answer is none.

And so

in terms

of Atlan, so somebody has decided, okay. This is what I need. I've got it set up.

What are the steps to actually start using it as part of their day to day workflow?

And as they continue to embed it in their broader data platform, what are the available

points for being able to

integrate new data sources,

new sources of information about lineage and data quality?

What are the available

embedding points for bringing Outland more fully into their existing systems?

So I'll answer the first question first, which is around adoption and, you know, integrating the product into, like, daily workflows.

So we see this in 2 phases of implementation. The first is we think of it as almost the enrichment phase, and this is to answer the question of, you know,

20% critical data assets in an organization

across, you know like, there's a good 80 20 tool when it comes to data assets mostly.

And then, you know, on those 20% most

useful, most used data assets,

you know, will you find the single source of truth for 80% of them on ASN? This is where I think there is a little bit of heavy lifting for the human context there that comes in. Right? There's a bunch of things you can automate. Like, you know, for example, we do, like, frequency distribution so that someone doesn't need to write a column

description. Like, there's a bunch of things you can do, but there is a human context here to this, which is really critical.

And I think that's where the initial enrichment to make sure that the power users or the, you know, the data analysts that have been around for the longest time make sure that they take the time to sort of add that context into the product.

That phase itself, it truly depends. Like, it depends on how team is. Like, for example, like, we've had teams that just have such good documentation anyway that they end up, like, you know, setting up in, like,

less than a week. And then, you know, in some cases, if you have to, like, do all of this from scratch, there are ways to run sprints and things like that. But I think that's the really the enrichment piece. And I think the phase from there is adoption.

I know traditionally

in this

category,

people have struggled with adoption. I think in our case,

because

we spent a whole bunch of time thinking through user experience, that's actually been a relatively easy phase for us.

Atlan owns the sharing of a data asset. Give these work, you know, simple things, like, for example, command k

and advanced search allows you to, like, get to your most effective data asset in a couple of seconds instead of a couple of minutes.

Or a Slack bot, so you can quickly just, you know, share a data asset directly into a Slack channel. So there are, like, these micro flows we've built in that help drive a lot of that adoption naturally inside an org. That's really the human aspect of adoption. The other is what I call the programmatic aspects for adoption, which is how do you integrate

again because, you know, you sort of sit in the

overarching layers or the metadata management layer across the data stack. It's really important that context should be integrated with all of the other tools. Because if you think about ML collaboration, I talk about this all the time. You know, if someone's in Looker and they're looking for what a column name means, they shouldn't have to come to Atlant for like, that should be available in Looker. And so that's really where the open API aspect of Atlant sort of kicks in. So we have a 2 way open API. So 1 aspect of it is, you know, being able to take everything in Appling and integrating it into the tools of choice. So this could be, you know, a specific Jira workflow

or this could be a specific, you know, BI tool that you wanna add in context. And so I think that's 1 aspect of integrating deeply into the workflows.

And the other is sending in information from internal tools, so quality frameworks, pipeline tools

into action. And I think that's where so for example, we have teams that run these custom data quality checks, and they want to win their data asset profile,

enrich it with those custom data quality metrics that they're already computing. So I think that's the second part of bringing in all of your context from other ecosystems into Atlan.

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Monitoring data quality, tracing incidents, and testing changes can be daunting. It often takes hours or days.

DataFold helps data teams gain visibility and confidence the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection.

DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values.

DataFold integrates with all major data warehouses

as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.

Go to data engineering podcast.com/datafold

today to start a 30 day trial of DataFold.

Once you sign up and create an alert in DataFold for your company data, they'll send you a cool waterflask.

As you have built out Atlant and used it internally and onboarded customers to the system, what are some of the most interesting or innovative or unexpected ways that you've seen it used?

Yeah.

Every day, I'm amazed by the kinds of use cases that their teams start sort of using Airtland for. I think there's some, obviously, the standard ones. Right? Like data discovery and things like that. I'm not gonna go into that. Some of my favorite ones, there was this data science team that actually started using AppLint as their data exploration layer

because their only other way of exploring their data assets ended up being Spark cluster, setting up a Jupyter notebook and writing some code to figure out like missing values and frequency distributions and things like that. So,

you know, very proud of being able to save them in their minds. Like, they said, you know, 50% of their time on on data exploration. So 1 element is is definitely that phase of, you know,

before you get started on a project, making sure you know everything you need to know about your data without human intervention. Other, interesting 1 has been actually budget prioritization.

This 1 was totally unexpected, but

we act as that sort of layer on top of all of the data assets. We actually can tell what are your top used data assets and what are your, like, most unused data assets, what are your tools that are being used the most vis a vis what's not. And interestingly,

we've seen sort of teams use this as a way to prioritize and say, oh, like,

this table gets used only, like, once a week.

Why are we spending on, like, a pipeline that updates this data every 15 minutes? Right? So that's been an interesting 1.

A few others, like, there's a Fintech startup

that's using Atlant to actually,

propagate their confidentiality

ratings across their ecosystem. I think that's an interesting 1 because if you have column level lineage automatically across your ecosystem,

if you mark 1 column as confidential, technically, you can have all the other columns inherit this, and that can

as a part of a cloud data lake, cloud data warehouse initiative.

What's become interesting is that

I think increasingly people are realizing that the modern data stack is incomplete

without sort of a overarching metadata solution. I think that's been an interesting 1 that we're seeing a lot more of. In terms of your own experience

of building the Atlant platform and growing the business around it, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

It keeps coming back to the same 1, which

is ensuring that you keep remembering

or at least for us that are not star is that this is a human collaboration and a human experience problem

and not a fundamental

technology infrastructure problem.

It's always very easy as technologists to go into the really complex technical problem. But, you know, for example, like, the other day I was speaking to an analyst

who was, like, you know they basically they're in biz ops. They get assigned to different business units all the time. Every time they do that,

they basically have to, like, you know, fill out this long form

that they send to IT to get access to data. And, typically, they have to say what datasets they have to get access to, which they don't know the exact name of. It then goes to a business owner who approves it. So, like, you know, 24 to 48 hours later, you get access to something. And most of the time, it's not even the reason that you requested for it because you mostly are not even like you don't even know, like, what you're looking for in the 1st place. And that is a human problem. That is a problem between how do you have this analyst collaborate with IT and that business owner in a way that the minute he's assigned to like, the ideal solution here should be the minute he's assigned to a different

business

to solve a different business problem, she should automatically be given rights to, you know, the data assets that the business owner has access to. Right? Like, there are so many ways to think about solving this sort of collaboration problem. And I think we've just scratched the surface of what's possible there. So I think the the challenging lesson has been to just

remind ourselves constantly

that to not go after the shiny problem, but go after the real user problem at the end of the day. As a brief tangent on what you were just saying as far as sort of gaining access to the different assets owned by the various business units,

I'm starting to think about sort of

each different asset

at the very source level has a different business unit that it's associated with. But then as it starts to propagate through these analytical systems, it's often going to be enriched with or merged with or combined with in some fashion assets from the other organizations. And I'm wondering

how you or some of your customers think about the way that that influences

the ownership and the types of information that you want to maintain about sort of who owns the derived data asset. You know, is it a combination of all of the different business units where that, you know, the biz ops person

needs access to those sort of third level derived assets, or is it something where they're primarily concerned with the source systems that people are trying to build analysis from?

I think team just start figuring this out in some ways. I think the

reality for most data teams today is that there isn't a clear structure in placing. You know, if this happens, this is who's gonna own the data versus this. It does tend to be human driven. Right? There's this 1 data engineer who's been in the company for, like, 5 years. And so he tends to be the person that everybody pings for, like, all of the issues. Right? Like, it tends I mean, the reality is today,

it is still a human problem. And, you know, every team is very, very different.

We're starting to see more approaches emerging. So, you know, there are approaches of, you know, how do you think about centralization versus decentralization?

Can you think about hub and spoke models? I think there are more and more sort of approaches evolving, and we've seen all of them. So we've seen, like, a fully centralized data platform team that owns

all of your business critically.

We've seen a fully decentralized model where, for example, you have data analysts and the marketing team who own all of the marketing data, and we've seen sort of approaches in the middle as well. I think what that translates to from a product perspective is we've realized going back to that principle. Right? You know, data teams are diverse, not just in the people, but also in the way that they're structured, which means that, you know, we need to be able to support all of these structures and also the fact that they will evolve over time. I don't expect our data teams to have I don't expect

our data teams to have this 1 fixed structure because the reality is that all of these things are evolving more time, and you have to do what's best for the team and the company

and not just stick to, like, some governance model that was created by some management consulting that said this is how, you know, we should structure the ownership of our data assets. Right? So, yeah, I wish I had a more clear

answer for this, but I think the reality is that most data teams are still figuring this out. As people are starting to build out their platforms, they're looking for this sort of unifying layer to be able to gain visibility into their assets across

the overall business and across their life cycles? What are the cases where Atlan is the wrong choice?

So couple of different times when Atlan's the wrong choice.

1st, you know, if on prem is a part of your stack, we are not the right choice. We are fully cloud first, cloud native. We only support

cloud first, cloud native choices. So if there's an on prem component of the stack, we're just not the right fit. Another aspect is

sometimes we see the starting point for these

data discovery, metadata management kind of problems

starting from compliance

as a use case. Right? So, you know, engines to drive policy and compliance and, you know, that layout is what's driving a lot of this.

Again, in that case as well, we are the wrong choice.

We're built a lot more for the value side of things. How do you help your team collaborate and get the most value out of your data? And so I think those are the 2 choices where we would we would definitely be the wrong choice. As you continue to

iterate on the product and build out the business and learn more from the customers that you're working with. What are some of the plans that you have for the near to medium term? I think we've still scratched the surface of what is really possible if you think about the future of collaboration.

I think we have, you know, still, I think, as an industry just started out on figuring out what collaboration really means in the data ecosystem.

So the way we think about this, I mean, you know, obviously from a product perspective,

a lot more intelligence.

This is actually a funny 1, right? Like, we are all data scientists and data practitioners. We spend our lives creating intelligence for business problems. But when it comes to like our own lives, there is almost close to nothing.

Our approach in this is

solves for everything.

Our approach in this is slightly different. We actually think that there will not be 1 overarching

machine learning algorithm that just solves for context across all business use cases and teams and domains and industries.

We actually think that, again, going back to that concept of context. Right? Context is gonna be different in different use cases.

And so our approach has been to almost make this a platform. So think of what Slack bots are in some ways. We've tried to bring that same ecosystem. How do you make it really, really easy

for different teams to actually intelligence bots on top of Atlan

for, you know, basically classifying and tagging and organizing

and building context on top of their data assets. Another, you know, aspect to this is we're actually

adding on, like, a layer of plugins to the ecosystem. So what that'll allow teams to do is tomorrow, if there is a new data asset that they wanna integrate Aspen into. Right? Or tomorrow, they want

to build a custom workflow for the way that Aspen should work with their Jira

or their ServiceNow or a custom Slack

bot who are actually opening sort of like what NPM did for coaching. We're opening that that platform up, which basically then means, you know, hopefully, the community starts actually building out, like, a whole bunch of templates

that can help other data teams around the world. So I think those are some of the things that we're going a lot deeper into in the short term.

Are there any other aspects of the work that you're doing at Atlan or the overall space of data discovery and metadata management and sort of team unification

in the data ecosystem that we didn't discuss yet that you'd like to cover before we close out the show? I think the only 1 thing that I think is important for teams to figure out is, like, the right timing.

And also, I think the 1 mistake I've seen teams make is, like, investing too late. But the other mistake I've seen teams make is investing too early. So I think it's really important for data team leaders

who realize the time that they're going to get to that inflection point and start investing in essentially tools to make your team more agile maybe 3 to 6 months prior to that. So I think just the importance of timing in initiatives like this. Well, for anybody who wants to follow along with the work that you're doing and get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I might be biased with this answer,

devoting my life to solve, hopefully, 1 of these problems.

I think 2 2 layers.

1 is definitely I see as the human collaboration layer. You know? And I think that's really a user experience. Like, how do you bring principles of what superhuman has done and Notion has done and Slack has done into the day to day lives of data practitioners?

And the other I touched on this briefly is the intelligence layer. I think we all are great at applying intelligence to everything else except our own daily lives, hoping to see a lot more innovation, intelligence, or in data management.

Well, thank you very much for taking the time today to join me and share the work that you're doing at Atlan. It's definitely a very interesting project and something that is absolutely necessary to help more data teams be able to unify around the problems that they're trying to solve for and help the businesses that they're working in be able to realize more value from their data assets. So I appreciate all the time and energy that you've put into that, and I hope you enjoy the rest of your day. Yeah. Thank you so much for having us.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links