Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Atlin is the metadata hub for your data ecosystem.

Instead of locking your metadata into a new silo, unleash its transformative potential with Atlin's active metadata capabilities.

Push information about data freshness and quality to your business intelligence,

automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value.

Go to data engineering podcast.com/atlan

today, that's that's a t l a n, to learn more about how Atlas Active Metadata platform is helping pioneering data teams like Postman, Plaid, WeWork, and Unilever achieve extraordinary things with metadata.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.

With their new managed database service, you can launch a production ready MySQL,

Postgres or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs.

Go to dataengineeringpodcast.com/linode

today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Reagan Avon about her work at Align AI to help organizations

standardize their technical and procedural approaches to working with data. So, Reagan, can you start by introducing yourself?

Yeah. Thanks so much for having me on your show. I'm Reagan. I am cofounder and CEO of Align AI

based out of Ohio. Also founder of Women in Analytics

and have been in the space for

the last 10 years. Super passionate about building technology and communities

in the analytics, data science, and data space.

Do you remember how you first got started working in data?

Yeah. So I went to Ohio State. I was an engineering major there, and

I actually studied industrial systems engineering.

They had a specialization

in analytics basically for that major. And so they were trying to combine the stats departments with computer science and some of the engineering programs

and so they created this minor which is essentially just a computer science minor,

but it focused on kind of data mining

and elements that maybe are a little more specific to the data science and analytics realm. And so I got my very first internship

at a start up in

Austin, Texas around data

where I was actually just troubleshooting

SQL queries.

And it was, like, my first time trying to learn SQL, and it was actually 1 of the best experiences

trying to find errors in other people's SQL code. So it was more of a QA position, and and then I went on to actually, my first job out of college was as a data engineer

building data pipelines for the back end of this software product. So that was kind of how I got my my start. I just became completely fascinated

with the data space.

At the time, the the big

trend was, like, around big data and Hadoop and distributed systems for data. So

that's really where I got my start.

In terms of what you're building at Alliant dotai, can you give a bit of the overview about what your

goals are and your focus and some of the story behind how it got started and why you decided that this was the problem you wanted to focus your time and energy on?

I saw a lot of these different kind of interdisciplinary

areas having lots of crossover

at organizations who are trying

to mature

their capabilities in the space. So you had your kind of data functions, data management, organization,

data pipeline, development, deployment management,

data quality initiatives around that, more observability

type functions.

And then you have these intelligence layers on top of that, which is like analytics engineering and metrics and,

you know, data science models,

dashboards,

all of these things that kinda sat on top of data

and were fed data to make ultimately

value for the organization.

So

whether that was automation

or, you know, business intelligence or some sort of decisioning

system

that actually drove value for the business. And so you have all of these different areas intersecting

that are trying to coordinate

basically on business value, on intelligence to drive important decisions, and then ultimately the data that feeds all of that.

And it was so

what I from what I observed working as a practitioner and then also as, you know, a software vendor,

you know, embedded in a bunch of different customers of ours.

I saw a lot of these same patterns happening over and over again. The the coordination between those different functions kept breaking

down. So the time it took,

you know, a team to respond to quality issues that were noticed by operations or the business, the time it took

to iterate on a model enough to the point where it's actually useful for the organization.

Some of the standards that were a little bit of the wild west in the last, like, 5 years that are now starting to solidify,

no company had a playbook on any of this. And there were usually a couple of folks that

knew how to do

those standards inside of the company, but the rest of the organization was struggling to get, you know, mature enough to be able to do that as well. And it's just a different

transformation

process than it was with, like, software development and DevOps and things like that. So

we noticed this time and time again, it was always the number 1 reason that companies couldn't move quickly and couldn't actually build things

in a meaningful way with data. It was never

the tooling or the technology

or,

you know, the ability of individuals on the teams. It was just

this lack of standards, lack of coordination. And so

that's why we started Align AI. We wanted to get all of these teams aligned, have clear definitions of handoff points, have an entire playbook set of standards the company

could maintain easily and could also actually use

instead of these

really hideous kind of intranets and

unorganized

confluence pages and things that just aren't practical to maintain, keep updated, and ultimately

utilize.

I have definitely been privy to some pretty unfortunate

organizational practices around company intranets and trying to find anything is an exercise in futility.

And so as far as the core problems that you're trying to address, you mentioned these challenges around teams who are trying to do something with data. They've got people who know what they're doing. They've got good tooling. They've got good platforms, and yet they're still not able to

succeed or bring their ideas to fruition. And I'm curious, what are some of the ways that you're trying to

address those core issues with the AlignAA platform and some of the tactical elements of the work that you're doing to help solve those problems?

Yeah. I think there's always kind of 1 of 2 ways that people are trying to do this today, and

we're kind of fitting into both

at the same time. So first is these, like, giant initiatives

where they have, like, learning programs where you're trying to get everyone

basically up to speed on best practices around I'll use data management as an example, around data management.

And,

you know, you've got lineage and metadata capture

and data observability platforms and all of these core elements of tracking quality and stewardship programs. And so they've, you've got this kinda like big bang

type of initiatives that happen. And then on the other side of the coin, there's more of these, like, check list type approaches. So you've got a couple of people who are like, you know, here are the core fundamentals

in terms of capabilities that we need to have in place that there needs to be people accountable for and responsible for the organization. And every time we deploy a pipeline, you know, we have to make sure we go through these set of standards.

And every time we troubleshoot

for quality issues, you know, here are the different ways that we've adopted

in terms of approaches that we can do that. And so

as Align AI, we're trying to fit into those both of those workflows in a way that is more seamless because

what's happening today is people are

building out this

knowledge hub from scratch. So they're using, in some cases, PowerPoint, you know, SharePoint, Confluence, GitHub.

It's kind of all over the place, which is good and bad.

You know, it's closer to examples. It's closer to data. It's closer to use cases, which is important. But or it's on the other side, you know, way too generic or way too general and not applicable to what people are doing.

And so we're trying to meet that in the middle. So tactically, what that looks like is if you can think of, like, Canva

as a designer, you can go in and grab all of these templates, and you can basically start

at, like, 50 to 60% of the way there. And it has all of the core elements that you need.

And there's always specific things you wanna tailor or customize to what you're trying to do, your brand colors, your, you know,

language, your fonts.

Right? Your logo,

iconography, whatever.

It's the same kind of concept for some of these standards. Like, there are industry best practices out there. They are very general, and so it's hard to

make those customized to the organization. But if we could get you 60 or 70% of the way there, now it's just

that last 30 to 40% of kind of pointing to the right references

and tweaking the workflows and tweaking the terminology that

you have to do and maintain.

And it becomes really applicable.

You know, we're trying to design it so that it's in the flow of work. So it's not this,

like, massive

playbook that you never reference.

So it's very practical, and

that's kind of been our approach so far.

The kind of core of what you're doing sounds like

it's oriented around this premise of knowledge sharing and making sure that people have the information that they need when they need to access it.

And as you mentioned,

intranets have been a way that people try to build up these knowledge bases and do this knowledge sharing, but they seem to invariably go wrong in some fashion. And I'm wondering what are some of the

strategies that you're incorporating in Align AI to make that kind of knowledge capture and knowledge distribution

a more kind of uniform experience? Because it seems like the problem that generally happens is that you have to go to the knowledge platform to add anything or retrieve anything, and first, you have to know that it even exists.

Whereas if you're in the middle of doing some complicated,

you know, machine learning model or analysis, you don't want to have to context switch and say, oh, where was that thing? Now I have to spend the next 30 minutes digging through to find it, and then I forgot what I was doing. Like, what are some of the ways that you're trying to

address some of that existing friction and bring the kind of knowledge element

closer to the work that's being done?

I could probably drone on

on this topic for, like, hours, so I'll try to keep it a short response. We look at it from 3 different fronts. The first is the creator or the contributor or the knowledge contributor. Right? So I've got information. I wanna make a tweak. I wanna make an adjustment.

I'm setting a

standard. I'm updating a standard. Like, that whole flow

needs to be fairly seamless and needs to be

formatted in a way that is optimal to whoever is retrieving that information. So what we're doing from that approach is essentially incorporating

instructional design

best practices

into the format of this content. So

we're basically saying, you know, when you create a document, when you open up a Confluence page, like, you're deciding, okay, a description goes here, some bullet points go there, a screenshot goes here, you know, whatever it is. And

you're doing it without thinking, but you're trying to optimize

to the reader at least in some way, shape, or form.

So I know when somebody comes in here, I can hyperlink this. I can click it to that. It'll make sense. There are general flows that consumers going to experience when they read it. And so we're anticipating all of that and basically baking it into the format so that the creator doesn't even have to think about that,

which I think is very, very important and 1 of the big time savers that

we're trying

to enable inside of the tool. So from the creator perspective,

we also don't wanna dictate where it goes. Like, we know that there are existing

documentation platforms

today, and they're general on purpose because that's what the company is using, whether it is SharePoint or Confluence or some of the newer tools like Notion or Coda.

And,

you know, we don't wanna deviate people's workflows away from that either. So our goal is to kind of integrate into some of those systems so that that retrieval experience is easier. And a lot of those systems are

heavily investing in search capabilities,

querying capabilities so that that consumer experience is

is better. And so,

you know, we're still kind of writing a very fine line there

on integration versus bringing our own because in some cases, companies

don't want to even try to fit it into existing documentation paradigms, which

is also a reasonable approach.

And so,

you know, that's kind of from the consumer side as well, but they're very related.

And in terms of, you know, referencing in the flow of work or learning in the flow of work or consuming in the flow of work,

I think there's a lot of different approaches to that. Today, it is more of a retrieval

workflow or motion. But in the future, we wanna be able to have that information readily available for individuals who are in their Databricks environment, in their SQL environment, their Python, whatever they are doing, we want them to be able to understand what the set of standards is associated with that workload. And so right now, we kind of have a 1 way reference system where we will reference out to the tech stack

examples

or use cases that people can look at, demo environments, things like that. In the future, it's definitely a problem that we need to tackle

in terms of getting people the right information at the right time. And so to wrap it up, the 3rd view is this macro view, which is, like,

how effective are these standards at a macro level of the organization?

What kind of ROI do we receive as a company

that everybody's adhering to these best practices or these standards we've put in place?

And that's another element that we absolutely

wanna provide the organization visibility to, which today they have 0 visibility into.

In terms of the

specific actions of building analytics and machine learning projects,

what are some of the ways that this

lack

of information or organizational awareness can impact the

effectiveness or the success rate of those projects and just some of the ways that they go wrong because of the fact that there is incomplete information at the point of execution?

Yeah. It I mean, we have spent a lot of time group causing

a lot of problems that people are experiencing

in these ecosystems as they're trying to make, you know, significant efforts to mature.

I'll be specific in this response because I think it's helpful to kind of ground it into a specific example. So, you know, as an organization, if I'm trying to enable more individuals to interact with data, which is what we call kind of that self-service type of environment,

you've got different layers of what that means. So you have very technical folks who wanna access raw data. Maybe they're using that for machine learning development purposes.

You've got data engineers who wanna access raw data. They wanna generate some sort of metric or pipeline.

But then you have these higher level

individuals who are looking for insights. They're looking at interfacing with the metrics layer so they can build dashboards or reports. They're looking at

even in some cases, like q and a interfaces with data where I can ask a question of the data and it can give me some sort of insight or response or answer to that quickly.

All of these elements require

a lot of thoughts about how data is curated,

how quality is monitored,

and how metrics are defined or

what that interface level looks like. And those are standards that the company needs to put in place. Like, that doesn't just happen automatically. You have to

intentionally design systems to do that.

And so

what we've seen is if they have really good set of standards, like, if they have a process of stewardship where people are actually tagging

data appropriately with really good definitions and there's kind of metadata associated with it and there's a full lineage view,

you know, of data, then those self-service environments become more of a reality because I can actually understand where the data came from and I can understand the context of it. I can understand

how other people have used it. And so without those core capabilities or fundamentals or standards,

I can't do any of those things. And so you end up getting into this, like,

horrible cycle of

asking the same person to run a query for you slightly different

every 2 days. You know? It's just like these things that frustrate the crap out of everybody,

but it's because they haven't intentionally designed the system with a set of standards

that will enable more of that self-service ecosystem. And this is not just true for data management environments, but it's also true for, like, machine learning development deployment management.

It's true for

dashboard

development deployment management.

That's kind of how we've approached this. Like, every time we've looked at some of those frustrations or inefficiencies

or

quality issues people are experiencing with other solutions or data, it always runs back to the fact that there is a complete lack of standards and process to curate and maintain

these assets at the company.

1 of the common themes when it comes to any form of documentation, but particularly when working with data, is that when you have a small enough group, like if you're at a company that's just starting off, you have 3 people all sitting in a room together,

you don't really have the

incentive to spend a lot of time focused on knowledge capture

because you can just ask the person sitting next to you. Everybody has the whole system in their head. And then at a certain point you

tip into the space where

1 person has most of the system in their head, but nobody else knows that they know that or what they know. And I'm curious what you see as some of the symptoms that suggest that you've kind of crossed over that divide of we really need to have a holistic knowledge management

kind of protocol for being able to make sure that everybody knows what they need to know and that there isn't any 1 person who's the bottleneck in getting things done.

And just some of the ways that you've seen companies effectively be proactive about that so that they don't all of a sudden find themselves kind of out in the cold of, we used to be really effective, and then we hired 5 more people, and now everything's broken.

Yeah. This is literally

the same

story, different company every time we go in.

It's like the same thing over and over again, which is exactly what you described.

You've got we call it the key person problem.

Right? You've got the person who has all the knowledge of where everything sits and where everything lives and

have all the nuances and how to access this and what that means.

And, you know, well, it's not an automated process. So, you know, you've gotta do this and you gotta go to this tool and you I mean, it's just it's a nightmare. And so

I think the way you get out of that is by starting this really early. The problem that we see

with a lot of organizations

is, a, nobody likes standards and nobody likes documentation,

period. Like, we're definitely not tackling a super sexy area of the space,

but, like, a necessary 1. And so

if we can make it a little less, like, hey. I'm gonna

spend all of this time documenting down this process, which I know is gonna change in 2 days. Like, if we can make that process less painful or it's like, I just need to configure a couple of things and as I make improvements to the system,

that improvement process is less painful because today, it's just a ton of text bullet pointed typically or a bunch of Lucid charts that are referencing a ton of different

parts of the ecosystem.

And the daunting part is making the update.

Like, I can sit down and

document for,

you know, 3 days out of the month.

And

the daunting part's figuring out what changed and how do I make that update and keep it up to date so that anybody going through that standard can be onboarded really quickly? I mean, it really should be

formatted in a way that's super consumable, very interactive,

and references

kind of the latest tech stack so people can get to doing their job much faster.

And so

what we've seen is the people that start early

do a great job because they're dedicating time towards making updates to it. And they started early enough where they didn't get into this, like,

depth of complexity that they have to try to unwind

to explain to somebody else.

And so, like, these giant initiatives where we have to have everything figured out before we go back and document it is the wrong approach. It's the wrong mentality.

It is supposed to be super iterative. It is supposed to morph and

change as the standard changes, the tech stack changes, and so on. And then as you mentioned, when teams grow, you experience different pains at scale

than you would have beforehand. So

if we have 3 pipelines in production that we're managing

and then we hire a team of 5 engineers and now we've got, you know, 20 or 25 that we're managing,

that set

of standards you need for 5 pipelines versus 25 is different.

And it's not just because there's more people, but the system is different. Like, the type of monitoring you need is different.

And I think that's where people start to

screw up as they're waiting until there's this scale tipping point. And by then, they have to try to untangle all of the things that they've done.

Your comment about the teams who start early on documenting their practices

tend to do better. It brings to mind the

principle of kind of test driven design, where if you write your code in a way that it is easily testable,

it makes it easier

to compose the logic. It makes it easier to understand the logic because you're naturally going to break it down into smaller pieces so that it's easier to test. And this seems like the kind of documentation driven

analog to that where if you are writing down the steps to do something, it's going to force you to think about how you're doing it.

And as you evolve the system, you want to do it in a way that it's easier to document rather than just having something grow organically, and then after the fact, having to say, how on earth did this monstrosity come to be?

Right.

That's exactly right. And, like, where we are now in terms of this industry, it's like, we have solidified on standards

pretty well.

It's not like people are coming out with very revolutionary

ways. Like, we'll get a couple of paradigm shifts here and there,

on how people are approaching data management and data engineering practices. But, you know, there hasn't been 1 that's, like, absolutely

fundamentally changed everything. And so I think

I say that because there's a set of principles that we're starting from.

And why should people have to recreate all of those principles first

and then also all of the elements that are, you know, specific to that company and their tech stack and their data and their use cases.

That's a lot of work. And so I think that's the area we're trying to solve for. It's like, can we get you most of the way there and then just make it easy for you to configure and reference?

I'm also interested in if there are any kind of commonalities

in the failure modes that you see for teams who don't have a

cohesive

or kind of evenly distributed

awareness of the system and how it operates and

some of the ways that that incomplete knowledge or understanding

of what has been done

translates into a lack of understanding of what can be done, where maybe they are missing out on opportunities to leverage the data that they have because they don't know that they have it or what it semantically means or

ways that maybe teams will

kind of be overambitious

and think that they have more capability than they actually do or that they're going to be able to deliver because they don't understand the true complexity of what they're trying to build?

Yeah. I think it's

the biggest

trends that we've seen

is that

there are

a few different phases in which we see, like, these particular challenges come to fruition. So the earlier ones are, like, kind of we don't know what we don't know, and they get a little bit stuck in the cycle of, okay, well, what are the best practices?

And, you know, I'll give you an example of this. Like, if we have a team of, like, a couple of data engineers, does it make sense for us to be building customized pipelines in Python?

Or should we be using some of these, like, automation tools that kind of manage pipelines

for us or, like, when should we pull the trigger on something like a DBT or, you know, there's all of these questions that they have in terms of the best approach

considering where they're at.

And I think they get kind of hung up on that and spending a lot of time trying to design that system,

but they also, at the same time, have to manage operations

of what's been built and what is being used. And I think that's where people kind of break down a little bit. You're it's the whole, like, changing the engine while the plane is flying. You've got all of these things that you've built and are being actively used are now business critical

functions,

and you're also trying to make fundamental architectural decisions and improvements. And

and so that's kind of like the earlier stages. And then the later ones are like when you start

expanding your team and growing. And then it's, like, who's on first and who's responsible for what. And that might sound like an oversimplification

of, like, a very complex problem, but

it really is, like, a poor definition of hand off points between teams and who's responsible for which part of the ecosystem.

And I think that's where those teams start to break down a lot. And an example of that would be,

you know, do we have,

like, data quality

monitoring and who is responsible when something happens or goes down and who's responsible for digging in and troubleshooting?

Like, you can't have everyone doing everything anymore,

whereas that's what you used to do. So you start getting into these more

defined functions and therefore, dependencies

between teams and managing all of that. And then the bigger ones, which are, like, kind of at the enterprise level

of making

fundamental improvements. Now you've got kind of hub and spoke models where you've got a centralized group and then you've got all these

embedded groups across the organization and trying to keep all of those teams rowing in the same direction. Like, you make a massive change to your metadata structure or to the tools you're using to manage metadata.

Now you've gotta roll out all of those changes to, like, 40, 50 different people.

And that process we've seen can take, like, 12 to 18 months in some cases because it's just

they don't have a coordinated effort of doing it. It isn't well defined to begin with.

And they're leaning really heavily on the tooling to drive process,

which is

just never effective.

You know, if they do get a new tool in, they'll be like, well, we have the tooling vendor who's gonna train all of our people,

and

they typically like to be agnostic on opinions

because their tool was highly configurable. And so,

you know, they don't wanna do solutions engineering work usually. Some of the bigger vendors do, but,

like, then you run into those problems. And so I just think it's, you know, as the growth phases happen, there's just not a lot of thought that goes into,

like,

what challenges come with that growth of that function.

Yeah. It's definitely

amazing seeing some of the ways that

if you don't have somebody

who has a strong opinion about things, then it becomes easy for everyone else to just start flailing around because there's no kind of true north star about this is what we're doing. This is how we're going to do it. If you don't like it, then, well, you're just gonna have to like it because otherwise it's then it becomes designed by committee, and nobody wants to be the person to

kind of put out the hard line of this is how it should be done because they don't want to step on anybody's toes, and so then you would end up with a monstrosity

that that doesn't do what it was supposed to do.

Totally. And change is so hard that they feel this insane amount of pressure to get it right the first time because they know that they're gonna have to roll out the change and so they're like,

I'm only gonna do this once and it's like a big bang approach as opposed to these, like, more incremental

adjustments to the ecosystem.

Absolutely. Yeah. We seem to keep finding new ways to reinvent waterfall design approaches. Totally. It's so crazy.

And so digging into Align AI specifically, I'm wondering if you can talk through some of the design and implementation of the platform that you're building and some of the ways that the overall goal and

focus of the project has changed since you first started working on it?

Yeah. So implementation,

sometimes it nicely pairs with, like, bigger initiatives happening at the company. So if they're doing a big catalog rollout or if they're expanding their team quite a bit, so they're trying to get people onboarded really quickly, like, that could be a trigger point for someone to get the tool in and get it, you know, start working effectively immediately.

Whereas

the other approach is more of a passive approach where you've got kind of a a process in flight and

you wanna get it all situated and documented

and also do kind of a health check on your existing process

on are we missing anything fundamental?

How should we be prioritizing improvements in the future and so on? So

that's usually how we like to get

started.

So there's kind of a 3 phase approach. 1 is, okay, what capability are we focused on? So are we focused on data enablement? Are we focused on data quality, data stewardship?

Kind of going in with a very specific focus

or, you know, model ops or

data ops. And then the second's

getting the workflow configured. So they take the appropriate templates from that list under that capability. So for data ops, we have a bunch of different what we call modules that they can grab whichever modules are applicable to them

that they're currently doing, and they can take those workflows and

kind of opt in, opt out of whatever is applicable to them. So, like, yes, we are doing data, you know, quality monitoring

in an automated fashion, and so this module is gonna be applicable to us.

And then the next part after that is

configuring all of the examples.

So can you point to the tech stack? Can you point to specific use cases

that demonstrate

an example of that idea or concept? So 1 specific example could be, alright, in data ops, we are

monitoring quality of this pipeline.

And if an alert goes off, here's the system

that we kind of go into to troubleshoot that. So here's what we look at, and here's why we look at that. And here are the different paths that we can go down if there is an issue. Like, here are the different things that we've seen

typically be, you know, root cause

of that. And here are the different systems you can go to to continue troubleshooting. And so

we start getting into the nitty gritty of

referencing out to the stack so that there's very tangible ways

to demonstrate those concepts.

So that's usually what an implementation process would look like, just starting high level and then getting towards the workflow level and then getting towards the specific examples. Then you publish it and you make it available and we

having somebody kind of run through the program.

Some organizations like to do kind of bulk,

like completion of running through a program, and some of them like to do it in small chunks and increments. That's what we

suggest and recommend.

I think what has changed

since we started working on this, you know, initially we were taking much heavier of a learning approach

to solving this problem. So like your typical kind of training and educational

approaches

to solving the problem and we explored that space a lot. And what we found was,

number 1, a lot of things out there are super generic and hard to apply to people's day to day. So the things out there that are highly available are just not very transferable.

And number 2, nobody ever sets aside time for learning. In fact, most companies see it as just, like, nice to have,

especially when

the economy

heads in the direction that it's heading. It's like the very first thing that people cut out is this, like,

you know, learning element, which is interesting. And so we've kind of taken more of a harsh

pivot into kind of documentation with

subtle hints of learning elements incorporated because

we don't think about it that way. But when you read documentation,

you know, what are you doing? You're consuming information. You're learning about a system, and then you're using that to go make a decision.

And so we've kind of pivoted a little harder into that direction. However, a lot of the core learning elements

that make that process efficient

are still incorporated

in the way we're designing the product. So

I'd say we've kind of floated around all of these different domains

quite a bit just to see how organizations

think about this today and what is gonna be the most natural way to integrate

this new workflow into

people's day to day jobs.

Yeah. I really like that you mentioned the challenge of actually

setting aside the time to engage with that information and really try to understand the system that you're working with versus just, I really need to get this done. I just need to take the shortest path to get there. Maybe I'll look at the documentation for this 1 thing, but I don't have the time to really try and get a full view of what it is that I'm trying to work on because I've also got 5 other things that I have to focus on and some of the ways that organizations

are able to

structure

the the incentives of the workday

so that it does encourage people to actually have a more

complete understanding

of the systems in which they are operating and the work that they're doing. And I'm curious if there are any kind of design elements of what you're putting into AlignAI to make that a kind of a smoother path where it doesn't have that

that kind of perceived barrier of,

oh, well, now I have to go and take this 3 hour long course before I can get the rest of my work done. And just being able to kind of chunk that up into

consumable pieces that have the right amount of information to get the next step done, but also have the kind of encouragement

to proceed further even if that's not directly what you're trying to achieve this second, but it will be kind of beneficial to your overall success.

Yeah. We're we're definitely trying to go more of a pull versus push approach. So if you think about it, like, the courses are kind of a pull approach. Like, okay, you You know, it's more of a push approach from our side to theirs. So, like, we're pushing information to them as opposed to, like, if you're actively working with an API and you're looking at documentation,

you're gonna go find that part of the API documentation that's applicable to the code you're writing. And that is more of a pull approach from the user where it's like, I'm gonna go pull the information I need at the time that I need it. And

I think it highly, highly depends on where the organization is at with the capabilities. So I have this, like,

spectrum that I keep referencing

to people in conversations because I think it lays it out pretty nicely. So there's kind of this net new

aspect. So all the way in 1 side of the spectrum is net new, and all the way in the other side of the spectrum is sustained.

And so net new capabilities

are like, we don't have a catalog today. We don't do metadata

capturing today in a way that's meaningful.

And we don't have a way to find data,

discover data, search data, access it. Like, that's kind of a net new thing. So as an organization, we want that capability. We don't have it. And if we do, it's in a very,

like, early early stages of

maturity.

And so that's on 1 side. And the other side is we do have a catalog. People are using it today. We have stewards. We have people who are in the catalog.

And we're trying to sustain that function and make improvements, incremental improvements, shoot over time so that people aren't as frustrated.

And so I think on the 1 end, the net new that's behavioral change and there's really no way about it besides kind of that push approach like we're pushing new information to you. Your goals and responsibilities are gonna change as an employee at the company. Like, you're gonna have to interface with this new tool, and you're gonna have to do these new functions that

we didn't hire you for originally.

And so there's really no way of getting around that. Like, they're gonna have to consume a decent amount of information

to understand the general concepts and why, and it's more of a foundational

shift. And then the other side of that coin is more of the pull side where, like, I'm actively doing this function as a day to day thing in my job.

And I have expectations

operationally

around this function as an individual. And so I'm just trying to get the information I need.

And there might be small changes that I need to adjust to,

but I should be able to reference something and have more of that pool feeling.

And I think there's expectations set on both sides of that. So we're trying to figure out how do we design

elements of the product that support both of them.

So can we

get a,

you know, program is what we call them up and running inside of the the products? And if they are net new,

create ways for them to receive that information or facilitation that makes sense. So can we create cohorts of individuals

where there's dedicated time to learning this new function that they're going to be responsible for? So there is still a forcing function from the business

and or we take those published programs that people are actively using and figure out ways that they can reference them in the flow of work? So it's more of like an end function for us because we've seen

the capabilities on all sorts of spectrums around maturity.

Are you struggling with broken pipelines,

stale dashboards,

missing data?

If this resonates with you, you're not alone.

Data engineers struggling with unreliable data need to look no further than Monte Carlo, the leading end to end data observability

platform.

Trusted by the teams at Fox, JetBlue, and PagerDuty,

Monte Carlo solves the costly problem of broken data pipelines.

Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, airflow jobs, and business intelligence tools,

reducing time to detection and resolution from weeks to just minutes.

Monte Carlo also gives you a holistic picture of data health with automatic end to end lineage from ingestion to the BI layer directly out of the box.

Start trusting your data with Monte Carlo today. Go to dataengineeringpodcast.com/montecarlo

to learn more.

In terms of the kind of organizational

workflow around Align AI, I'm wondering if you can talk to some of the ways that it fits into the day to day work of the different people who are interacting with the data and some of the ways that you think about making it

accessible and addressable for the kind of different types of roles that need to interact with the data, so data engineers, analysts, business users,

and also some of the ways that you work with organizations

to build up the organizational awareness of the fact that we do have this solution. We're investing in AlignAI

because we want to be able to be more effective with how we manage our data,

and this is the system that is going to be our repository of knowledge so that if you have a question, this is a place that you can go to get it answered?

Yeah. I think it's really in this handoff point of the intersection of all these individuals. So

because it is so cross disciplinary, like, I think that's the biggest organizational shift that companies have experienced with unlocking

these capabilities.

Yeah. There's really neat technical

aspects to this change and transformation, but a lot of it's organizational.

Like,

you know, I think for the first time, organizations were like, oh, we should have technical teams or analysts or data scientists or data engineers embedded within our business function.

Like, that was not a thing. It was like everything was centralized under IT

and, you know, technical support. And so I think this is absolutely changing the way people are collaborating

at companies

because there are so many different personas and so many different individuals

who have to

coordinate

on a specific topic, and they're all related, and it's all dependent on each other. So

the tool is meant to

create that knowledge hub with that individual in mind. So if we do have a data stewardship class or, like, course or program,

the different personas that engage with that are going to see the information relevant to them. So if I'm a data engineer or if I'm a business data steward or if I'm an analyst, we all touch data stewardship in some way, shape, or form. As a data engineer, I have to make sure that that data is getting

to the catalog. I have to make sure that it is connected.

In terms of lineage, I have to make sure that all the metadata from the technical systems is getting populated.

And as a steward, I have to find the data that I'm responsible for. I have to make sure that the definitions are correct.

I have to

make sure that the usage of that data is correct. And so there's,

like, almost like and I'm gonna use, you know, RACI, like, the responsible, accountable,

informed, consulted

element of that. Like, there's so much coordination and collaboration

for that capability

that that program should be able to meet the needs of all those individuals and allow them to collaborate much better because they do understand where the handoff points are.

And they can also see it from a more macro perspective. And so I think that's 1 of the most interesting things that we're able to facilitate

is that kind of softer piece of collaboration

that's not just

like, a Slack group, you know, of answering questions between people

or

some of these other like, the documentation

hubs,

like, confluence, for example. It is more

in line with and specific to roles and responsibilities across these functions.

So I think that's been really interesting to see, like, how people engage with that and

how well those standards are adopted by individuals because they can understand who they have to talk to for what and where their work starts and stops.

In your work of building the AlignAA platform

and working with organizations

to help them get onboarded and

address some of these gaps in information sharing and information capture? What are some of the most interesting or innovative or unexpected ways that you've seen your platform applied?

I think there are

so what's interesting is, like, this desire for community inside of companies.

Like, hey. Can we create for this couple times an expert community through this?

Like, can we get all of the experts across the organization

who are building models or deploying models and get them connected to each other, especially for large enterprise organizations? Or how do we

generate more of a community

element to this capability at the company.

That was kind of surprising to me in some ways just to see

the major desire

to connect folks that are doing

similar functions across the organization. I think

that's

an interesting element, and it's kind of driving us in terms of functionality

towards

the ability for individuals

to collaborate and review

new standards

with each other

in a way that is structured. So it's not just, hey. Read this like 7 page document and tell me what you think.

It's more of a, I have a suggestion

on a way to do this better because of this

example we just saw. So, like, the standard that's published breaks down

when I try to do x.

And, like, does that mean that we need to make a fundamental change to the entire standard, or is this an exception to the rule?

And I think providing

a way for people to

contribute and review

and understand where those deviations are happening

very quickly and efficiently and effectively and, like, approve and not approve, I think of get branches,

right, when it comes to this except for information.

Like,

are we able to effectively and efficiently do that and facilitate that instead of the tool? So I think that's been a very interesting

observation

of how people want to be able to manage this at a macro level. Because you mentioned there's always that person who's, like, leading the charge,

you know, or maybe there's not. But in some cases, I'm gonna, you know, put a line in the sand, and here's the way we're gonna do it as a company, and I'm gonna put my neck on the line because of that.

Or is it more of a collaborative

type

of workflow

of how we're going to define it and test it and and make improvements to it over time?

Another interesting aspect of the space that you're operating in, of focusing

particularly

on the data workflows in an organization

is, I'm curious what are some of the types of integrations that you either have had to build or are considering building to be able

to bring in some of the

relevant details so that when you're looking at a piece of documentation, it's not just a wall of prose. You also have maybe snippets of SQL or, you know, data lineage views or views on maybe some of the statistical elements of the tables that you're working with. Oh, it's this many rows. It's the last time it was updated. You know, there are this is the number of times that it's being accessed and some of those kind of more

real time evolving

of the data that you're working with and the context that you're trying to capture around that data with Olin AI?

Yeah. I'd say there's 2 core areas of of integrations.

The first is more on, like, a portfolio

level. So, like, today, we just kick people out to like, we reference out to the tech stack so we don't have full integrations with

their tool, like, you know, today.

That would also just be a very, very hard thing to support in terms of integrations because there's just

a massive amount of tooling in all of these different areas. And so we're almost kind of kicking them out to somewhat of a portfolio of examples

that describe that idea

well enough.

And so we're kind of avoiding the the integrations from that perspective

as much as we can.

The other element of integration

is,

you know, in some cases,

we are seen more as a content engine

that feeds all of these other kind of consumer

level

platforms

like a confluence or like a SharePoint

where that workflow already exists. And so, you know, they don't want yet another documentation tool that people have to log into and interface with, and that's totally reasonable.

And so

it seemed as more of like a content engine from the creator side where we can kind of push some of these more structured elements out into the existing ecosystem

that is helpful for people consuming that already have kind of a standard workflow for that. So

it's kind of an integrator, build their own. In some cases, they don't have anything they're comfortable with using

that they think is effective today. And so, you know, we can kind of bring our own element of that. But in other cases, they definitely don't want to add to the confusion in terms of that, you know, initial access point.

So, yeah, those are kind of the 2 areas of integrations that we're very focused on.

In your work of building this business and working with your customers and exploring this challenge of knowledge sharing and knowledge capture for data practitioners and data applications, what are some of the most interesting or unexpected or challenging lessons you've learned in the process?

I'd say, like,

companies

don't have as much of an emphasis on

standardization

in this area as I thought they would,

like, as an existing element.

What's very interesting is there's going to be more and more kind of regulations,

like federal regulations

and legislation

around

being organized

about a lot of these areas.

And the reason I said that is because there's,

you know, requirements around auditability and transparency

for models that are being developed

and deployed and being used and interfaced with by consumers. There's

all sorts of

demand for responsible AI and

responsible solutions being built on top of data and responsible usage of data. And so I think there's also going to be kind of a

a legislative

element

to that at some point. And I think if companies don't start getting

their ducks in a row, they're probably gonna be hit with a lot

of challenges just like they were with GDPR.

And so I think that's an interesting kind

of observation.

I kind of assumed that people had

more of this

in place

and more of a handle on the elements that they were actively

working in. But that's, you know, generally not really been the case.

For people who are experiencing the struggle of how do

I popularize the knowledge that I have or that my team has around how we're working with data and the data that we have and some of the challenges that we're experiencing with it? What are the cases where Align dotai is the wrong choice?

Yeah. I'd say

there are very few

people we've talked to that have

a homegrown

solution they've been able to put together

at the stage that they're at that is working for them. And I think that's great. Like, if you've got something working for you today

that you've been able to build internally and maintain

and, you know, is scalable,

you know, I I definitely think that's not a need for us to come in and replace that. I'd say the other

piece to that where those tend to break down honestly is when there's massive change or massive scale. So if the company is growing that function quickly

or if there's massive change that happens in terms of the

capabilities or tech stack,

that tends to get

outdated very quickly, and it's just another thing that you have to maintain and manage internally.

I'd also say if there are companies who are not putting a ton of investment into this area

on making improvements

or building out those functions or capabilities. They probably don't have a huge need for for Align AI

because our our product does

thrive with

improvements, change. It thrives with people who are actively

working in that function

and need to adhere to a set of standards.

And so if they're not heavily investing

in data or AI, then it's probably not a good fit for them.

As you continue to build and iterate on your product and work with your customers and the community, what are some of the things you have planned for the near to medium term or any particular

problem or project areas that you're excited to dig into?

Yeah. We are gonna focus probably all of 2023

on

really honing in on those workflows and optimizing

to,

like, utilization,

of the user interfacing with our products regularly.

But 1 thing I'm really excited for that's more kind of mission setting and feature focused

is a marketplace

where it's not just the templates that we've built over the last couple of years available to our customers, but other experts in the industry who have an opinion on approaches

that have worked or workflows or or specific elements of capabilities

that are useful in certain scenarios can build their own templates very easily through Align AI and make them available to

companies

and make money off of that as a creator. And so I think there's a lot of

really fun potential for that marketplace

where we can kind of open it up a little bit more and provide a mechanism for experts in the industry to interface with companies directly

in a way that's very efficient and very practical.

Today, people do that mainly through consulting services. And so this would provide kind of a repeatable

recurring income source for those

experts. They have proven methods and approaches to some of these key areas. And so I'm actually really excited

about that.

Definitely very interesting and definitely look forward to seeing that become a reality. So

are there any other aspects

of the overall problem space of knowledge capture and knowledge sharing for data projects and the work that you're doing at AlignAI that we didn't discuss yet that you'd like to cover before we close out the show?

I think 1 area that we haven't specifically

or intentionally focused on is, like, documentation

about

individual

projects. Like, oftentimes, when you create a solution, a model, a dashboard,

you know, data pipeline, there's documentation that comes with that, whether it's in the code or, you know, usage of that solution.

And that's definitely not an area that we touch on today. It's more of a workflow capability level, but definitely could be something that we explore in the future that makes sense in line with

what we're doing. We've seen in a lot of our engagements that there tends to be

a pretty structured way of documenting, like, how to use the dashboard or getting people onboarded to different solutions that are being built on top of data. And so I do think that that could be an interesting area to continue exploring.

Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

This one's so fun

because there's just, like,

so much being built

in the data, like, ecosystem

right now.

For, you know, a while, it was obviously kind of more on the warehousing

element, cloud migration,

efficiency,

update of processing,

and kinda like then into the virtualization

layers of, like, create logical layers of data without moving data

and incurred costs on moving data, which is kind of a new more expensive paradigm now with with tools like Snowflake. And then

on top of that, we've seen a huge

surge of catalogs

hitting the market that aren't these massive kind of monolithic platforms that have always existed. They're kind of all in 1 platforms.

They're now kind of smaller, more focused niche

focuses on data catalog and metadata capture and lineage and things like that, which I think is really interesting. And now it's more so, to me, on the analytics engineering side, like

properly

defining

and

developing metrics,

collaborating on those metrics, creating deviations

of those metrics, creating context on those metrics, like this kind of, like, analytics layer on top of data,

I think is really interesting

and how that affects how data is stored underneath it

and

how people can iterate between those 2 layers appropriately.

I think that's pretty fascinating. I also think some of the other tools coming out around,

you know, pipeline automation, it's kind of interesting as well, but I've more so seen a lot of organizations

really struggle

with this

metrics layer. Like, some companies are using DBT for it. Others

are using

tools like Alteryx

to automate some of these, like,

workflows on top of data that are generating metrics.

I just think that's kind of a fascinating space because there's so much ambiguity there still and a lot of best practices to be developed in terms of creating those stores that people can engage with.

I think that'll be an interesting space to watch.

Yeah. It's definitely 1 of the interesting new areas that nobody has come to any real agreement on yet. So, definitely look forward to watching that space as well. So thank you again for taking the time today to join me and share the work that you're doing at Align AI and your perspectives on the challenges that organizations

face

in being able to

capture and spread information about how data is being used and some of the protocols and practices around that. So appreciate all the time and energy that you and your team are putting into making that a more tractable

solution. So thank you again for that, and I hope you enjoy the rest of your day and have a happy new year.

Thank you so much. I really enjoyed being on the show. Your questions are excellent and you have, like, the perfect podcast voice. So

thanks again for having me on here, and happy New Year to you as well.

Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links