Summary
Making effective use of data requires proper context around the information that is being used. As the size and complexity of your organization increases the difficulty of ensuring that everyone has the necessary knowledge about how to get their work done scales exponentially. Wikis and intranets are a common way to attempt to solve this problem, but they are frequently ineffective. Rehgan Avon co-founded AlignAI to help address this challenge through a more purposeful platform designed to collect and distribute the knowledge of how and why data is used in a business. In this episode she shares the strategic and tactical elements of how to make more effective use of the technical and organizational resources that are available to you for getting work done with data.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show!
- Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
- Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more.
- Your host is Tobias Macey and today I'm interviewing Rehgan Avon about her work at AlignAI to help organizations standardize their technical and procedural approaches to working with data
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what AlignAI is and the story behind it?
- What are the core problems that you are focused on addressing?
- What are the tactical ways that you are working to solve those problems?
- What are some of the common and avoidable ways that analytics/AI projects go wrong?
- What are some of the ways that organizational scale and complexity impacts their ability to execute on data and AI projects?
- What are the ways that incomplete/unevenly distributed knowledge manifests in project design and execution?
- Can you describe the design and implementation of the AlignAI platform?
- How have the goals and implementation of the product changed since you first started working on it?
- What is the workflow at the individual and organizational level for businesses that are using AlignAI?
- One of the perennial challenges with knowledge sharing in an organization is managing incentives to engage with the available material. What are some of the ways that you are working to integrate the creation and distribution of institutional knowledge into employees' day-to-day work?
- What are the most interesting, innovative, or unexpected ways that you have seen AlignAI used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on AlignAI?
- When is AlignAI the wrong choice?
- What do you have planned for the future of AlignAI?
Contact Info
- @RehganAvon on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- AlignAI
- Sharepoint
- Confluence
- GitHub
- Canva
- Instructional Design
- Notion
- Coda
- Waterfall Design
- dbt
- Alteryx
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- MonteCarlo: ![Monte Carlo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/Qy25USZ9.png) Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit [dataengineeringpodcast.com/montecarlo](https://www.dataengineeringpodcast.com/montecarlo) to learn more.
- Atlan: ![Atlan](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/ys762EJx.png) Have you ever woken up to a crisis because a number on a dashboard is broken and no one knows why? Or sent out frustrating slack messages trying to find the right data set? Or tried to understand what a column name means? Our friends at Atlan started out as a data team themselves and faced all this collaboration chaos themselves, and started building Atlan as an internal tool for themselves. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to [dataengineeringpodcast.com/atlan](https://www.dataengineeringpodcast.com/atlan) and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription.
- Linode: ![Linode](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/W29GS9Zw.jpg) Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to: [dataengineeringpodcast.com/linode](https://www.dataengineeringpodcast.com/linode) today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Atlin is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlin's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to data engineering podcast.com/atlan today, that's that's a t l a n, to learn more about how Atlas Active Metadata platform is helping pioneering data teams like Postman, Plaid, WeWork, and Unilever achieve extraordinary things with metadata.
When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Reagan Avon about her work at Align AI to help organizations standardize their technical and procedural approaches to working with data. So, Reagan, can you start by introducing yourself?
[00:01:40] Unknown:
Yeah. Thanks so much for having me on your show. I'm Reagan. I am cofounder and CEO of Align AI based out of Ohio. Also founder of Women in Analytics and have been in the space for the last 10 years. Super passionate about building technology and communities in the analytics, data science, and data space.
[00:02:00] Unknown:
Do you remember how you first got started working in data?
[00:02:03] Unknown:
Yeah. So I went to Ohio State. I was an engineering major there, and I actually studied industrial systems engineering. They had a specialization in analytics basically for that major. And so they were trying to combine the stats departments with computer science and some of the engineering programs and so they created this minor which is essentially just a computer science minor, but it focused on kind of data mining and elements that maybe are a little more specific to the data science and analytics realm. And so I got my very first internship at a start up in Austin, Texas around data where I was actually just troubleshooting SQL queries.
And it was, like, my first time trying to learn SQL, and it was actually 1 of the best experiences trying to find errors in other people's SQL code. So it was more of a QA position, and and then I went on to actually, my first job out of college was as a data engineer building data pipelines for the back end of this software product. So that was kind of how I got my my start. I just became completely fascinated with the data space. At the time, the the big trend was, like, around big data and Hadoop and distributed systems for data. So that's really where I got my start.
[00:03:21] Unknown:
In terms of what you're building at Alliant dotai, can you give a bit of the overview about what your goals are and your focus and some of the story behind how it got started and why you decided that this was the problem you wanted to focus your time and energy on?
[00:03:35] Unknown:
I saw a lot of these different kind of interdisciplinary areas having lots of crossover at organizations who are trying to mature their capabilities in the space. So you had your kind of data functions, data management, organization, data pipeline, development, deployment management, data quality initiatives around that, more observability type functions. And then you have these intelligence layers on top of that, which is like analytics engineering and metrics and, you know, data science models, dashboards, all of these things that kinda sat on top of data and were fed data to make ultimately value for the organization.
So whether that was automation or, you know, business intelligence or some sort of decisioning system that actually drove value for the business. And so you have all of these different areas intersecting that are trying to coordinate basically on business value, on intelligence to drive important decisions, and then ultimately the data that feeds all of that. And it was so what I from what I observed working as a practitioner and then also as, you know, a software vendor, you know, embedded in a bunch of different customers of ours. I saw a lot of these same patterns happening over and over again. The the coordination between those different functions kept breaking down. So the time it took, you know, a team to respond to quality issues that were noticed by operations or the business, the time it took to iterate on a model enough to the point where it's actually useful for the organization.
Some of the standards that were a little bit of the wild west in the last, like, 5 years that are now starting to solidify, no company had a playbook on any of this. And there were usually a couple of folks that knew how to do those standards inside of the company, but the rest of the organization was struggling to get, you know, mature enough to be able to do that as well. And it's just a different transformation process than it was with, like, software development and DevOps and things like that. So we noticed this time and time again, it was always the number 1 reason that companies couldn't move quickly and couldn't actually build things in a meaningful way with data. It was never the tooling or the technology or, you know, the ability of individuals on the teams. It was just this lack of standards, lack of coordination. And so that's why we started Align AI. We wanted to get all of these teams aligned, have clear definitions of handoff points, have an entire playbook set of standards the company could maintain easily and could also actually use instead of these really hideous kind of intranets and unorganized confluence pages and things that just aren't practical to maintain, keep updated, and ultimately utilize.
[00:06:32] Unknown:
I have definitely been privy to some pretty unfortunate organizational practices around company intranets and trying to find anything is an exercise in futility. And so as far as the core problems that you're trying to address, you mentioned these challenges around teams who are trying to do something with data. They've got people who know what they're doing. They've got good tooling. They've got good platforms, and yet they're still not able to succeed or bring their ideas to fruition. And I'm curious, what are some of the ways that you're trying to address those core issues with the AlignAA platform and some of the tactical elements of the work that you're doing to help solve those problems?
[00:07:12] Unknown:
Yeah. I think there's always kind of 1 of 2 ways that people are trying to do this today, and we're kind of fitting into both at the same time. So first is these, like, giant initiatives where they have, like, learning programs where you're trying to get everyone basically up to speed on best practices around I'll use data management as an example, around data management. And, you know, you've got lineage and metadata capture and data observability platforms and all of these core elements of tracking quality and stewardship programs. And so they've, you've got this kinda like big bang type of initiatives that happen. And then on the other side of the coin, there's more of these, like, check list type approaches. So you've got a couple of people who are like, you know, here are the core fundamentals in terms of capabilities that we need to have in place that there needs to be people accountable for and responsible for the organization. And every time we deploy a pipeline, you know, we have to make sure we go through these set of standards.
And every time we troubleshoot for quality issues, you know, here are the different ways that we've adopted in terms of approaches that we can do that. And so as Align AI, we're trying to fit into those both of those workflows in a way that is more seamless because what's happening today is people are building out this knowledge hub from scratch. So they're using, in some cases, PowerPoint, you know, SharePoint, Confluence, GitHub. It's kind of all over the place, which is good and bad. You know, it's closer to examples. It's closer to data. It's closer to use cases, which is important. But or it's on the other side, you know, way too generic or way too general and not applicable to what people are doing. And so we're trying to meet that in the middle. So tactically, what that looks like is if you can think of, like, Canva as a designer, you can go in and grab all of these templates, and you can basically start at, like, 50 to 60% of the way there. And it has all of the core elements that you need.
And there's always specific things you wanna tailor or customize to what you're trying to do, your brand colors, your, you know, language, your fonts. Right? Your logo, iconography, whatever. It's the same kind of concept for some of these standards. Like, there are industry best practices out there. They are very general, and so it's hard to make those customized to the organization. But if we could get you 60 or 70% of the way there, now it's just that last 30 to 40% of kind of pointing to the right references and tweaking the workflows and tweaking the terminology that you have to do and maintain.
And it becomes really applicable. You know, we're trying to design it so that it's in the flow of work. So it's not this, like, massive playbook that you never reference. So it's very practical, and that's kind of been our approach so far.
[00:10:09] Unknown:
The kind of core of what you're doing sounds like it's oriented around this premise of knowledge sharing and making sure that people have the information that they need when they need to access it. And as you mentioned, intranets have been a way that people try to build up these knowledge bases and do this knowledge sharing, but they seem to invariably go wrong in some fashion. And I'm wondering what are some of the strategies that you're incorporating in Align AI to make that kind of knowledge capture and knowledge distribution a more kind of uniform experience? Because it seems like the problem that generally happens is that you have to go to the knowledge platform to add anything or retrieve anything, and first, you have to know that it even exists.
Whereas if you're in the middle of doing some complicated, you know, machine learning model or analysis, you don't want to have to context switch and say, oh, where was that thing? Now I have to spend the next 30 minutes digging through to find it, and then I forgot what I was doing. Like, what are some of the ways that you're trying to address some of that existing friction and bring the kind of knowledge element closer to the work that's being done?
[00:11:17] Unknown:
I could probably drone on on this topic for, like, hours, so I'll try to keep it a short response. We look at it from 3 different fronts. The first is the creator or the contributor or the knowledge contributor. Right? So I've got information. I wanna make a tweak. I wanna make an adjustment. I'm setting a standard. I'm updating a standard. Like, that whole flow needs to be fairly seamless and needs to be formatted in a way that is optimal to whoever is retrieving that information. So what we're doing from that approach is essentially incorporating instructional design best practices into the format of this content. So we're basically saying, you know, when you create a document, when you open up a Confluence page, like, you're deciding, okay, a description goes here, some bullet points go there, a screenshot goes here, you know, whatever it is. And you're doing it without thinking, but you're trying to optimize to the reader at least in some way, shape, or form.
So I know when somebody comes in here, I can hyperlink this. I can click it to that. It'll make sense. There are general flows that consumers going to experience when they read it. And so we're anticipating all of that and basically baking it into the format so that the creator doesn't even have to think about that, which I think is very, very important and 1 of the big time savers that we're trying to enable inside of the tool. So from the creator perspective, we also don't wanna dictate where it goes. Like, we know that there are existing documentation platforms today, and they're general on purpose because that's what the company is using, whether it is SharePoint or Confluence or some of the newer tools like Notion or Coda.
And, you know, we don't wanna deviate people's workflows away from that either. So our goal is to kind of integrate into some of those systems so that that retrieval experience is easier. And a lot of those systems are heavily investing in search capabilities, querying capabilities so that that consumer experience is is better. And so, you know, we're still kind of writing a very fine line there on integration versus bringing our own because in some cases, companies don't want to even try to fit it into existing documentation paradigms, which is also a reasonable approach.
And so, you know, that's kind of from the consumer side as well, but they're very related. And in terms of, you know, referencing in the flow of work or learning in the flow of work or consuming in the flow of work, I think there's a lot of different approaches to that. Today, it is more of a retrieval workflow or motion. But in the future, we wanna be able to have that information readily available for individuals who are in their Databricks environment, in their SQL environment, their Python, whatever they are doing, we want them to be able to understand what the set of standards is associated with that workload. And so right now, we kind of have a 1 way reference system where we will reference out to the tech stack examples or use cases that people can look at, demo environments, things like that. In the future, it's definitely a problem that we need to tackle in terms of getting people the right information at the right time. And so to wrap it up, the 3rd view is this macro view, which is, like, how effective are these standards at a macro level of the organization?
What kind of ROI do we receive as a company that everybody's adhering to these best practices or these standards we've put in place? And that's another element that we absolutely wanna provide the organization visibility to, which today they have 0 visibility into.
[00:15:04] Unknown:
In terms of the specific actions of building analytics and machine learning projects, what are some of the ways that this lack of information or organizational awareness can impact the effectiveness or the success rate of those projects and just some of the ways that they go wrong because of the fact that there is incomplete information at the point of execution?
[00:15:28] Unknown:
Yeah. It I mean, we have spent a lot of time group causing a lot of problems that people are experiencing in these ecosystems as they're trying to make, you know, significant efforts to mature. I'll be specific in this response because I think it's helpful to kind of ground it into a specific example. So, you know, as an organization, if I'm trying to enable more individuals to interact with data, which is what we call kind of that self-service type of environment, you've got different layers of what that means. So you have very technical folks who wanna access raw data. Maybe they're using that for machine learning development purposes. You've got data engineers who wanna access raw data. They wanna generate some sort of metric or pipeline.
But then you have these higher level individuals who are looking for insights. They're looking at interfacing with the metrics layer so they can build dashboards or reports. They're looking at even in some cases, like q and a interfaces with data where I can ask a question of the data and it can give me some sort of insight or response or answer to that quickly. All of these elements require a lot of thoughts about how data is curated, how quality is monitored, and how metrics are defined or what that interface level looks like. And those are standards that the company needs to put in place. Like, that doesn't just happen automatically. You have to intentionally design systems to do that.
And so what we've seen is if they have really good set of standards, like, if they have a process of stewardship where people are actually tagging data appropriately with really good definitions and there's kind of metadata associated with it and there's a full lineage view, you know, of data, then those self-service environments become more of a reality because I can actually understand where the data came from and I can understand the context of it. I can understand how other people have used it. And so without those core capabilities or fundamentals or standards, I can't do any of those things. And so you end up getting into this, like, horrible cycle of asking the same person to run a query for you slightly different every 2 days. You know? It's just like these things that frustrate the crap out of everybody, but it's because they haven't intentionally designed the system with a set of standards that will enable more of that self-service ecosystem. And this is not just true for data management environments, but it's also true for, like, machine learning development deployment management.
It's true for dashboard development deployment management. That's kind of how we've approached this. Like, every time we've looked at some of those frustrations or inefficiencies or quality issues people are experiencing with other solutions or data, it always runs back to the fact that there is a complete lack of standards and process to curate and maintain these assets at the company.
[00:18:24] Unknown:
1 of the common themes when it comes to any form of documentation, but particularly when working with data, is that when you have a small enough group, like if you're at a company that's just starting off, you have 3 people all sitting in a room together, you don't really have the incentive to spend a lot of time focused on knowledge capture because you can just ask the person sitting next to you. Everybody has the whole system in their head. And then at a certain point you tip into the space where 1 person has most of the system in their head, but nobody else knows that they know that or what they know. And I'm curious what you see as some of the symptoms that suggest that you've kind of crossed over that divide of we really need to have a holistic knowledge management kind of protocol for being able to make sure that everybody knows what they need to know and that there isn't any 1 person who's the bottleneck in getting things done.
And just some of the ways that you've seen companies effectively be proactive about that so that they don't all of a sudden find themselves kind of out in the cold of, we used to be really effective, and then we hired 5 more people, and now everything's broken.
[00:19:32] Unknown:
Yeah. This is literally the same story, different company every time we go in. It's like the same thing over and over again, which is exactly what you described. You've got we call it the key person problem. Right? You've got the person who has all the knowledge of where everything sits and where everything lives and have all the nuances and how to access this and what that means. And, you know, well, it's not an automated process. So, you know, you've gotta do this and you gotta go to this tool and you I mean, it's just it's a nightmare. And so I think the way you get out of that is by starting this really early. The problem that we see with a lot of organizations is, a, nobody likes standards and nobody likes documentation, period. Like, we're definitely not tackling a super sexy area of the space, but, like, a necessary 1. And so if we can make it a little less, like, hey. I'm gonna spend all of this time documenting down this process, which I know is gonna change in 2 days. Like, if we can make that process less painful or it's like, I just need to configure a couple of things and as I make improvements to the system, that improvement process is less painful because today, it's just a ton of text bullet pointed typically or a bunch of Lucid charts that are referencing a ton of different parts of the ecosystem.
And the daunting part is making the update. Like, I can sit down and document for, you know, 3 days out of the month. And the daunting part's figuring out what changed and how do I make that update and keep it up to date so that anybody going through that standard can be onboarded really quickly? I mean, it really should be formatted in a way that's super consumable, very interactive, and references kind of the latest tech stack so people can get to doing their job much faster. And so what we've seen is the people that start early do a great job because they're dedicating time towards making updates to it. And they started early enough where they didn't get into this, like, depth of complexity that they have to try to unwind to explain to somebody else.
And so, like, these giant initiatives where we have to have everything figured out before we go back and document it is the wrong approach. It's the wrong mentality. It is supposed to be super iterative. It is supposed to morph and change as the standard changes, the tech stack changes, and so on. And then as you mentioned, when teams grow, you experience different pains at scale than you would have beforehand. So if we have 3 pipelines in production that we're managing and then we hire a team of 5 engineers and now we've got, you know, 20 or 25 that we're managing, that set of standards you need for 5 pipelines versus 25 is different.
And it's not just because there's more people, but the system is different. Like, the type of monitoring you need is different. And I think that's where people start to screw up as they're waiting until there's this scale tipping point. And by then, they have to try to untangle all of the things that they've done.
[00:22:51] Unknown:
Your comment about the teams who start early on documenting their practices tend to do better. It brings to mind the principle of kind of test driven design, where if you write your code in a way that it is easily testable, it makes it easier to compose the logic. It makes it easier to understand the logic because you're naturally going to break it down into smaller pieces so that it's easier to test. And this seems like the kind of documentation driven analog to that where if you are writing down the steps to do something, it's going to force you to think about how you're doing it. And as you evolve the system, you want to do it in a way that it's easier to document rather than just having something grow organically, and then after the fact, having to say, how on earth did this monstrosity come to be?
[00:23:37] Unknown:
Right. That's exactly right. And, like, where we are now in terms of this industry, it's like, we have solidified on standards pretty well. It's not like people are coming out with very revolutionary ways. Like, we'll get a couple of paradigm shifts here and there, on how people are approaching data management and data engineering practices. But, you know, there hasn't been 1 that's, like, absolutely fundamentally changed everything. And so I think I say that because there's a set of principles that we're starting from. And why should people have to recreate all of those principles first and then also all of the elements that are, you know, specific to that company and their tech stack and their data and their use cases.
That's a lot of work. And so I think that's the area we're trying to solve for. It's like, can we get you most of the way there and then just make it easy for you to configure and reference?
[00:24:33] Unknown:
I'm also interested in if there are any kind of commonalities in the failure modes that you see for teams who don't have a cohesive or kind of evenly distributed awareness of the system and how it operates and some of the ways that that incomplete knowledge or understanding of what has been done translates into a lack of understanding of what can be done, where maybe they are missing out on opportunities to leverage the data that they have because they don't know that they have it or what it semantically means or ways that maybe teams will kind of be overambitious and think that they have more capability than they actually do or that they're going to be able to deliver because they don't understand the true complexity of what they're trying to build?
[00:25:21] Unknown:
Yeah. I think it's the biggest trends that we've seen is that there are a few different phases in which we see, like, these particular challenges come to fruition. So the earlier ones are, like, kind of we don't know what we don't know, and they get a little bit stuck in the cycle of, okay, well, what are the best practices? And, you know, I'll give you an example of this. Like, if we have a team of, like, a couple of data engineers, does it make sense for us to be building customized pipelines in Python? Or should we be using some of these, like, automation tools that kind of manage pipelines for us or, like, when should we pull the trigger on something like a DBT or, you know, there's all of these questions that they have in terms of the best approach considering where they're at.
And I think they get kind of hung up on that and spending a lot of time trying to design that system, but they also, at the same time, have to manage operations of what's been built and what is being used. And I think that's where people kind of break down a little bit. You're it's the whole, like, changing the engine while the plane is flying. You've got all of these things that you've built and are being actively used are now business critical functions, and you're also trying to make fundamental architectural decisions and improvements. And and so that's kind of like the earlier stages. And then the later ones are like when you start expanding your team and growing. And then it's, like, who's on first and who's responsible for what. And that might sound like an oversimplification of, like, a very complex problem, but it really is, like, a poor definition of hand off points between teams and who's responsible for which part of the ecosystem.
And I think that's where those teams start to break down a lot. And an example of that would be, you know, do we have, like, data quality monitoring and who is responsible when something happens or goes down and who's responsible for digging in and troubleshooting? Like, you can't have everyone doing everything anymore, whereas that's what you used to do. So you start getting into these more defined functions and therefore, dependencies between teams and managing all of that. And then the bigger ones, which are, like, kind of at the enterprise level of making fundamental improvements. Now you've got kind of hub and spoke models where you've got a centralized group and then you've got all these embedded groups across the organization and trying to keep all of those teams rowing in the same direction. Like, you make a massive change to your metadata structure or to the tools you're using to manage metadata.
Now you've gotta roll out all of those changes to, like, 40, 50 different people. And that process we've seen can take, like, 12 to 18 months in some cases because it's just they don't have a coordinated effort of doing it. It isn't well defined to begin with. And they're leaning really heavily on the tooling to drive process, which is just never effective. You know, if they do get a new tool in, they'll be like, well, we have the tooling vendor who's gonna train all of our people, and they typically like to be agnostic on opinions because their tool was highly configurable. And so, you know, they don't wanna do solutions engineering work usually. Some of the bigger vendors do, but, like, then you run into those problems. And so I just think it's, you know, as the growth phases happen, there's just not a lot of thought that goes into, like, what challenges come with that growth of that function.
[00:28:51] Unknown:
Yeah. It's definitely amazing seeing some of the ways that if you don't have somebody who has a strong opinion about things, then it becomes easy for everyone else to just start flailing around because there's no kind of true north star about this is what we're doing. This is how we're going to do it. If you don't like it, then, well, you're just gonna have to like it because otherwise it's then it becomes designed by committee, and nobody wants to be the person to kind of put out the hard line of this is how it should be done because they don't want to step on anybody's toes, and so then you would end up with a monstrosity that that doesn't do what it was supposed to do.
[00:29:27] Unknown:
Totally. And change is so hard that they feel this insane amount of pressure to get it right the first time because they know that they're gonna have to roll out the change and so they're like, I'm only gonna do this once and it's like a big bang approach as opposed to these, like, more incremental adjustments to the ecosystem.
[00:29:47] Unknown:
Absolutely. Yeah. We seem to keep finding new ways to reinvent waterfall design approaches. Totally. It's so crazy. And so digging into Align AI specifically, I'm wondering if you can talk through some of the design and implementation of the platform that you're building and some of the ways that the overall goal and focus of the project has changed since you first started working on it?
[00:30:11] Unknown:
Yeah. So implementation, sometimes it nicely pairs with, like, bigger initiatives happening at the company. So if they're doing a big catalog rollout or if they're expanding their team quite a bit, so they're trying to get people onboarded really quickly, like, that could be a trigger point for someone to get the tool in and get it, you know, start working effectively immediately. Whereas the other approach is more of a passive approach where you've got kind of a a process in flight and you wanna get it all situated and documented and also do kind of a health check on your existing process on are we missing anything fundamental?
How should we be prioritizing improvements in the future and so on? So that's usually how we like to get started. So there's kind of a 3 phase approach. 1 is, okay, what capability are we focused on? So are we focused on data enablement? Are we focused on data quality, data stewardship? Kind of going in with a very specific focus or, you know, model ops or data ops. And then the second's getting the workflow configured. So they take the appropriate templates from that list under that capability. So for data ops, we have a bunch of different what we call modules that they can grab whichever modules are applicable to them that they're currently doing, and they can take those workflows and kind of opt in, opt out of whatever is applicable to them. So, like, yes, we are doing data, you know, quality monitoring in an automated fashion, and so this module is gonna be applicable to us.
And then the next part after that is configuring all of the examples. So can you point to the tech stack? Can you point to specific use cases that demonstrate an example of that idea or concept? So 1 specific example could be, alright, in data ops, we are monitoring quality of this pipeline. And if an alert goes off, here's the system that we kind of go into to troubleshoot that. So here's what we look at, and here's why we look at that. And here are the different paths that we can go down if there is an issue. Like, here are the different things that we've seen typically be, you know, root cause of that. And here are the different systems you can go to to continue troubleshooting. And so we start getting into the nitty gritty of referencing out to the stack so that there's very tangible ways to demonstrate those concepts.
So that's usually what an implementation process would look like, just starting high level and then getting towards the workflow level and then getting towards the specific examples. Then you publish it and you make it available and we having somebody kind of run through the program. Some organizations like to do kind of bulk, like completion of running through a program, and some of them like to do it in small chunks and increments. That's what we suggest and recommend. I think what has changed since we started working on this, you know, initially we were taking much heavier of a learning approach to solving this problem. So like your typical kind of training and educational approaches to solving the problem and we explored that space a lot. And what we found was, number 1, a lot of things out there are super generic and hard to apply to people's day to day. So the things out there that are highly available are just not very transferable.
And number 2, nobody ever sets aside time for learning. In fact, most companies see it as just, like, nice to have, especially when the economy heads in the direction that it's heading. It's like the very first thing that people cut out is this, like, you know, learning element, which is interesting. And so we've kind of taken more of a harsh pivot into kind of documentation with subtle hints of learning elements incorporated because we don't think about it that way. But when you read documentation, you know, what are you doing? You're consuming information. You're learning about a system, and then you're using that to go make a decision. And so we've kind of pivoted a little harder into that direction. However, a lot of the core learning elements that make that process efficient are still incorporated in the way we're designing the product. So I'd say we've kind of floated around all of these different domains quite a bit just to see how organizations think about this today and what is gonna be the most natural way to integrate this new workflow into people's day to day jobs.
[00:34:55] Unknown:
Yeah. I really like that you mentioned the challenge of actually setting aside the time to engage with that information and really try to understand the system that you're working with versus just, I really need to get this done. I just need to take the shortest path to get there. Maybe I'll look at the documentation for this 1 thing, but I don't have the time to really try and get a full view of what it is that I'm trying to work on because I've also got 5 other things that I have to focus on and some of the ways that organizations are able to structure the the incentives of the workday so that it does encourage people to actually have a more complete understanding of the systems in which they are operating and the work that they're doing. And I'm curious if there are any kind of design elements of what you're putting into AlignAI to make that a kind of a smoother path where it doesn't have that that kind of perceived barrier of, oh, well, now I have to go and take this 3 hour long course before I can get the rest of my work done. And just being able to kind of chunk that up into consumable pieces that have the right amount of information to get the next step done, but also have the kind of encouragement to proceed further even if that's not directly what you're trying to achieve this second, but it will be kind of beneficial to your overall success.
[00:36:13] Unknown:
Yeah. We're we're definitely trying to go more of a pull versus push approach. So if you think about it, like, the courses are kind of a pull approach. Like, okay, you You know, it's more of a push approach from our side to theirs. So, like, we're pushing information to them as opposed to, like, if you're actively working with an API and you're looking at documentation, you're gonna go find that part of the API documentation that's applicable to the code you're writing. And that is more of a pull approach from the user where it's like, I'm gonna go pull the information I need at the time that I need it. And I think it highly, highly depends on where the organization is at with the capabilities. So I have this, like, spectrum that I keep referencing to people in conversations because I think it lays it out pretty nicely. So there's kind of this net new aspect. So all the way in 1 side of the spectrum is net new, and all the way in the other side of the spectrum is sustained.
And so net new capabilities are like, we don't have a catalog today. We don't do metadata capturing today in a way that's meaningful. And we don't have a way to find data, discover data, search data, access it. Like, that's kind of a net new thing. So as an organization, we want that capability. We don't have it. And if we do, it's in a very, like, early early stages of maturity. And so that's on 1 side. And the other side is we do have a catalog. People are using it today. We have stewards. We have people who are in the catalog. And we're trying to sustain that function and make improvements, incremental improvements, shoot over time so that people aren't as frustrated.
And so I think on the 1 end, the net new that's behavioral change and there's really no way about it besides kind of that push approach like we're pushing new information to you. Your goals and responsibilities are gonna change as an employee at the company. Like, you're gonna have to interface with this new tool, and you're gonna have to do these new functions that we didn't hire you for originally. And so there's really no way of getting around that. Like, they're gonna have to consume a decent amount of information to understand the general concepts and why, and it's more of a foundational shift. And then the other side of that coin is more of the pull side where, like, I'm actively doing this function as a day to day thing in my job.
And I have expectations operationally around this function as an individual. And so I'm just trying to get the information I need. And there might be small changes that I need to adjust to, but I should be able to reference something and have more of that pool feeling. And I think there's expectations set on both sides of that. So we're trying to figure out how do we design elements of the product that support both of them. So can we get a, you know, program is what we call them up and running inside of the the products? And if they are net new, create ways for them to receive that information or facilitation that makes sense. So can we create cohorts of individuals where there's dedicated time to learning this new function that they're going to be responsible for? So there is still a forcing function from the business and or we take those published programs that people are actively using and figure out ways that they can reference them in the flow of work? So it's more of like an end function for us because we've seen the capabilities on all sorts of spectrums around maturity.
[00:39:33] Unknown:
Are you struggling with broken pipelines, stale dashboards, missing data? If this resonates with you, you're not alone. Data engineers struggling with unreliable data need to look no further than Monte Carlo, the leading end to end data observability platform. Trusted by the teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes.
Monte Carlo also gives you a holistic picture of data health with automatic end to end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today. Go to dataengineeringpodcast.com/montecarlo to learn more. In terms of the kind of organizational workflow around Align AI, I'm wondering if you can talk to some of the ways that it fits into the day to day work of the different people who are interacting with the data and some of the ways that you think about making it accessible and addressable for the kind of different types of roles that need to interact with the data, so data engineers, analysts, business users, and also some of the ways that you work with organizations to build up the organizational awareness of the fact that we do have this solution. We're investing in AlignAI because we want to be able to be more effective with how we manage our data, and this is the system that is going to be our repository of knowledge so that if you have a question, this is a place that you can go to get it answered?
[00:41:14] Unknown:
Yeah. I think it's really in this handoff point of the intersection of all these individuals. So because it is so cross disciplinary, like, I think that's the biggest organizational shift that companies have experienced with unlocking these capabilities. Yeah. There's really neat technical aspects to this change and transformation, but a lot of it's organizational. Like, you know, I think for the first time, organizations were like, oh, we should have technical teams or analysts or data scientists or data engineers embedded within our business function. Like, that was not a thing. It was like everything was centralized under IT and, you know, technical support. And so I think this is absolutely changing the way people are collaborating at companies because there are so many different personas and so many different individuals who have to coordinate on a specific topic, and they're all related, and it's all dependent on each other. So the tool is meant to create that knowledge hub with that individual in mind. So if we do have a data stewardship class or, like, course or program, the different personas that engage with that are going to see the information relevant to them. So if I'm a data engineer or if I'm a business data steward or if I'm an analyst, we all touch data stewardship in some way, shape, or form. As a data engineer, I have to make sure that that data is getting to the catalog. I have to make sure that it is connected.
In terms of lineage, I have to make sure that all the metadata from the technical systems is getting populated. And as a steward, I have to find the data that I'm responsible for. I have to make sure that the definitions are correct. I have to make sure that the usage of that data is correct. And so there's, like, almost like and I'm gonna use, you know, RACI, like, the responsible, accountable, informed, consulted element of that. Like, there's so much coordination and collaboration for that capability that that program should be able to meet the needs of all those individuals and allow them to collaborate much better because they do understand where the handoff points are.
And they can also see it from a more macro perspective. And so I think that's 1 of the most interesting things that we're able to facilitate is that kind of softer piece of collaboration that's not just like, a Slack group, you know, of answering questions between people or some of these other like, the documentation hubs, like, confluence, for example. It is more in line with and specific to roles and responsibilities across these functions. So I think that's been really interesting to see, like, how people engage with that and how well those standards are adopted by individuals because they can understand who they have to talk to for what and where their work starts and stops.
[00:44:07] Unknown:
In your work of building the AlignAA platform and working with organizations to help them get onboarded and address some of these gaps in information sharing and information capture? What are some of the most interesting or innovative or unexpected ways that you've seen your platform applied?
[00:44:26] Unknown:
I think there are so what's interesting is, like, this desire for community inside of companies. Like, hey. Can we create for this couple times an expert community through this? Like, can we get all of the experts across the organization who are building models or deploying models and get them connected to each other, especially for large enterprise organizations? Or how do we generate more of a community element to this capability at the company. That was kind of surprising to me in some ways just to see the major desire to connect folks that are doing similar functions across the organization. I think that's an interesting element, and it's kind of driving us in terms of functionality towards the ability for individuals to collaborate and review new standards with each other in a way that is structured. So it's not just, hey. Read this like 7 page document and tell me what you think.
It's more of a, I have a suggestion on a way to do this better because of this example we just saw. So, like, the standard that's published breaks down when I try to do x. And, like, does that mean that we need to make a fundamental change to the entire standard, or is this an exception to the rule? And I think providing a way for people to contribute and review and understand where those deviations are happening very quickly and efficiently and effectively and, like, approve and not approve, I think of get branches, right, when it comes to this except for information.
Like, are we able to effectively and efficiently do that and facilitate that instead of the tool? So I think that's been a very interesting observation of how people want to be able to manage this at a macro level. Because you mentioned there's always that person who's, like, leading the charge, you know, or maybe there's not. But in some cases, I'm gonna, you know, put a line in the sand, and here's the way we're gonna do it as a company, and I'm gonna put my neck on the line because of that. Or is it more of a collaborative type of workflow of how we're going to define it and test it and and make improvements to it over time?
[00:46:40] Unknown:
Another interesting aspect of the space that you're operating in, of focusing particularly on the data workflows in an organization is, I'm curious what are some of the types of integrations that you either have had to build or are considering building to be able to bring in some of the relevant details so that when you're looking at a piece of documentation, it's not just a wall of prose. You also have maybe snippets of SQL or, you know, data lineage views or views on maybe some of the statistical elements of the tables that you're working with. Oh, it's this many rows. It's the last time it was updated. You know, there are this is the number of times that it's being accessed and some of those kind of more real time evolving of the data that you're working with and the context that you're trying to capture around that data with Olin AI?
[00:47:31] Unknown:
Yeah. I'd say there's 2 core areas of of integrations. The first is more on, like, a portfolio level. So, like, today, we just kick people out to like, we reference out to the tech stack so we don't have full integrations with their tool, like, you know, today. That would also just be a very, very hard thing to support in terms of integrations because there's just a massive amount of tooling in all of these different areas. And so we're almost kind of kicking them out to somewhat of a portfolio of examples that describe that idea well enough.
And so we're kind of avoiding the the integrations from that perspective as much as we can. The other element of integration is, you know, in some cases, we are seen more as a content engine that feeds all of these other kind of consumer level platforms like a confluence or like a SharePoint where that workflow already exists. And so, you know, they don't want yet another documentation tool that people have to log into and interface with, and that's totally reasonable. And so it seemed as more of like a content engine from the creator side where we can kind of push some of these more structured elements out into the existing ecosystem that is helpful for people consuming that already have kind of a standard workflow for that. So it's kind of an integrator, build their own. In some cases, they don't have anything they're comfortable with using that they think is effective today. And so, you know, we can kind of bring our own element of that. But in other cases, they definitely don't want to add to the confusion in terms of that, you know, initial access point.
So, yeah, those are kind of the 2 areas of integrations that we're very focused on.
[00:49:21] Unknown:
In your work of building this business and working with your customers and exploring this challenge of knowledge sharing and knowledge capture for data practitioners and data applications, what are some of the most interesting or unexpected or challenging lessons you've learned in the process?
[00:49:37] Unknown:
I'd say, like, companies don't have as much of an emphasis on standardization in this area as I thought they would, like, as an existing element. What's very interesting is there's going to be more and more kind of regulations, like federal regulations and legislation around being organized about a lot of these areas. And the reason I said that is because there's, you know, requirements around auditability and transparency for models that are being developed and deployed and being used and interfaced with by consumers. There's all sorts of demand for responsible AI and responsible solutions being built on top of data and responsible usage of data. And so I think there's also going to be kind of a a legislative element to that at some point. And I think if companies don't start getting their ducks in a row, they're probably gonna be hit with a lot of challenges just like they were with GDPR.
And so I think that's an interesting kind of observation. I kind of assumed that people had more of this in place and more of a handle on the elements that they were actively working in. But that's, you know, generally not really been the case.
[00:51:00] Unknown:
For people who are experiencing the struggle of how do I popularize the knowledge that I have or that my team has around how we're working with data and the data that we have and some of the challenges that we're experiencing with it? What are the cases where Align dotai is the wrong choice?
[00:51:17] Unknown:
Yeah. I'd say there are very few people we've talked to that have a homegrown solution they've been able to put together at the stage that they're at that is working for them. And I think that's great. Like, if you've got something working for you today that you've been able to build internally and maintain and, you know, is scalable, you know, I I definitely think that's not a need for us to come in and replace that. I'd say the other piece to that where those tend to break down honestly is when there's massive change or massive scale. So if the company is growing that function quickly or if there's massive change that happens in terms of the capabilities or tech stack, that tends to get outdated very quickly, and it's just another thing that you have to maintain and manage internally.
I'd also say if there are companies who are not putting a ton of investment into this area on making improvements or building out those functions or capabilities. They probably don't have a huge need for for Align AI because our our product does thrive with improvements, change. It thrives with people who are actively working in that function and need to adhere to a set of standards. And so if they're not heavily investing in data or AI, then it's probably not a good fit for them.
[00:52:41] Unknown:
As you continue to build and iterate on your product and work with your customers and the community, what are some of the things you have planned for the near to medium term or any particular problem or project areas that you're excited to dig into?
[00:52:53] Unknown:
Yeah. We are gonna focus probably all of 2023 on really honing in on those workflows and optimizing to, like, utilization, of the user interfacing with our products regularly. But 1 thing I'm really excited for that's more kind of mission setting and feature focused is a marketplace where it's not just the templates that we've built over the last couple of years available to our customers, but other experts in the industry who have an opinion on approaches that have worked or workflows or or specific elements of capabilities that are useful in certain scenarios can build their own templates very easily through Align AI and make them available to companies and make money off of that as a creator. And so I think there's a lot of really fun potential for that marketplace where we can kind of open it up a little bit more and provide a mechanism for experts in the industry to interface with companies directly in a way that's very efficient and very practical.
Today, people do that mainly through consulting services. And so this would provide kind of a repeatable recurring income source for those experts. They have proven methods and approaches to some of these key areas. And so I'm actually really excited about that.
[00:54:17] Unknown:
Definitely very interesting and definitely look forward to seeing that become a reality. So are there any other aspects of the overall problem space of knowledge capture and knowledge sharing for data projects and the work that you're doing at AlignAI that we didn't discuss yet that you'd like to cover before we close out the show?
[00:54:34] Unknown:
I think 1 area that we haven't specifically or intentionally focused on is, like, documentation about individual projects. Like, oftentimes, when you create a solution, a model, a dashboard, you know, data pipeline, there's documentation that comes with that, whether it's in the code or, you know, usage of that solution. And that's definitely not an area that we touch on today. It's more of a workflow capability level, but definitely could be something that we explore in the future that makes sense in line with what we're doing. We've seen in a lot of our engagements that there tends to be a pretty structured way of documenting, like, how to use the dashboard or getting people onboarded to different solutions that are being built on top of data. And so I do think that that could be an interesting area to continue exploring.
[00:55:23] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:55:39] Unknown:
This one's so fun because there's just, like, so much being built in the data, like, ecosystem right now. For, you know, a while, it was obviously kind of more on the warehousing element, cloud migration, efficiency, update of processing, and kinda like then into the virtualization layers of, like, create logical layers of data without moving data and incurred costs on moving data, which is kind of a new more expensive paradigm now with with tools like Snowflake. And then on top of that, we've seen a huge surge of catalogs hitting the market that aren't these massive kind of monolithic platforms that have always existed. They're kind of all in 1 platforms.
They're now kind of smaller, more focused niche focuses on data catalog and metadata capture and lineage and things like that, which I think is really interesting. And now it's more so, to me, on the analytics engineering side, like properly defining and developing metrics, collaborating on those metrics, creating deviations of those metrics, creating context on those metrics, like this kind of, like, analytics layer on top of data, I think is really interesting and how that affects how data is stored underneath it and how people can iterate between those 2 layers appropriately.
I think that's pretty fascinating. I also think some of the other tools coming out around, you know, pipeline automation, it's kind of interesting as well, but I've more so seen a lot of organizations really struggle with this metrics layer. Like, some companies are using DBT for it. Others are using tools like Alteryx to automate some of these, like, workflows on top of data that are generating metrics. I just think that's kind of a fascinating space because there's so much ambiguity there still and a lot of best practices to be developed in terms of creating those stores that people can engage with. I think that'll be an interesting space to watch.
[00:57:47] Unknown:
Yeah. It's definitely 1 of the interesting new areas that nobody has come to any real agreement on yet. So, definitely look forward to watching that space as well. So thank you again for taking the time today to join me and share the work that you're doing at Align AI and your perspectives on the challenges that organizations face in being able to capture and spread information about how data is being used and some of the protocols and practices around that. So appreciate all the time and energy that you and your team are putting into making that a more tractable solution. So thank you again for that, and I hope you enjoy the rest of your day and have a happy new year.
[00:58:23] Unknown:
Thank you so much. I really enjoyed being on the show. Your questions are excellent and you have, like, the perfect podcast voice. So thanks again for having me on here, and happy New Year to you as well.
[00:58:40] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to Reagan Avon and Align AI
Goals and Focus of Align AI
Addressing Core Issues with Align AI
Strategies for Effective Knowledge Capture
Challenges in Scaling Data Practices
Common Failure Modes in Data Teams
Design and Implementation of Align AI
Pull vs. Push Approach in Knowledge Sharing
Organizational Workflow and Role Integration
Lessons Learned and Future Plans