Summary
Data governance is a phrase that means many different things to many different people. This is because it is actually a concept that encompasses the entire lifecycle of data, across all of the people in an organization who interact with it. Stijn Christiaens co-founded Collibra with the goal of addressing the wide variety of technological aspects that are necessary to realize such an important and expansive process. In this episode he shares his thoughts on the balance between human and technological processes that are necessary for a well-managed data governance strategy, how Collibra is designed to aid in that endeavor, and his experiences using the platform that his company is building to help power the company. This is an excellent conversation that spans the engineering and philosophical complexities of an important and ever-present aspect of working with data.
Announcements
-
Hello and welcome to the Data Engineering Podcast, the show about modern data management
-
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
-
RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
-
Your host is Tobias Macey and today I’m interviewing Stijn Christiaens about data governance in the enterprise and how Collibra applies the lessons learned from their customers to their own business
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what you are building at Collibra and the story behind the company?
- Wat does "data governance" mean to you, and how does that definition inform your work at Collibra?
- How would you characterize the current landscape of "data governance" offerings and Collibra’s position within it?
- What are the elements of governance that are often ignored in small/medium businesses but which are essential for the enterprise? (e.g. data stewards, business glossaries, etc.)
- One of the most important tasks as a data professional is to establish and maintain trust in the information you are curating. What are the biggest obstacles to overcome in that mission?
- What are some of the data problems that you will only find at large or complex organizations?
- How does Collibra help to tame that complexity?
- Who are the end users of Collibra within an organization?
- Can you talk through the workflow and various interactions that your customers have as it relates to the overall flow of data through an organization?
- Can you describe how the Collibra platform is implemented?
- How has the scope and design of the system evolved since you first began working on it?
- You are currently leading a team that uses Collibra to manage the operations of the business. What are some of the most notable surprises that you have learned from being your own customer?
- What are some of the weak points that you have been able to identify and resolve?
- How have you been able to use those lessons to help your customers?
- What are the activities that are resistant to automation?
- How do you design the system to allow for a smooth handoff between mechanistic and humanistic processes?
- What are some of the most interesting, innovative, or unexpected ways that you have seen Collibra used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while building and growing Collibra, and running the internal data office?
- When is Collibra the wrong choice?
- What do you have planned for the future of the platform?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Collibra
- Collibra Data Office
- Electrical Engineering
- Resistor Color Codes
- STAR Lab (semantics, technology, and research)
- Microsoft Azure
- Data Governance
- GDPR
- Chief Data Officer
- Dunbar’s Number
- Business Glossary
- Data Steward
- ERP == Enterprise Resource Planning
- CRM == Customer Relationship Management
- Data Ownership
- Data Mesh
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Atlan started out as a data team themselves and faced all of this collaboration chaos firsthand, and they started building Atlan as an internal tool for themselves. Atlan is a collaborative workspace for data driven teams like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create a single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to data engineering podcast.com/atlan. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3,000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Stijn Christians, also known as Stan, about data governance in the enterprise and how Calibra applies the lessons learned from their customers to their own business. So Stan, can you start by introducing yourself?
[00:02:10] Unknown:
Yes. Of course. Thanks for having me over. I'm Stan from Collibra, 1 of the cofounders of the company. That's 2008. So we've been doing this for about 13 years now. And I've had a variety of roles in the company. I've been responsible for presales, post sales, partnerships, product, the whole 9 yards. But right now, what I'm responsible for is for our own data office, our own data lake. And on Friday evenings, we call that drinking our own champagne, and on Monday mornings, we call that eating our own dog.
[00:02:41] Unknown:
And do you remember how you first got involved in the area of data management?
[00:02:44] Unknown:
Yes. I do. And you're probably not gonna believe it, but I I studied as an engineer, an electrical engineer, but I'm color blind, so you can imagine me putting resistors on a circuit board. 1 of the courses we got there was on SQL, and we got this thick, thick book of SQL, the language, and databases. And I was thinking to myself, who needs this? What is this? And mind you, this is, like, 1998, the year 2000, maybe, something like that. I forget. Who needs this? You have to pay money for such a book. Right? So I was thinking, as soon as I run through this course of something that nobody will ever need, I'm gonna sell that book. So that was the only best book that I ever sold was a SQL book. So, obviously, I wasn't intending to go into data management necessarily when I was in school, but I went to a software company after my studies, also did some AI. So, obviously, the data immediately comes back.
And then after that software company stint, I ended up at the university again, but as a research engineer. And there we were, doing research in a semantics lab called STAR Lab, semantics technology and applications research lab with a database professor. So I worked there for about 3 years, and there I I was really sucked into the deep state of the art, if you will, when it when it came to data. That's how I got sucked back into it. And then I realized, our professor always thought of this, you know, legacy is always going to be around. Right? And people have been putting their data in relational databases and in data warehouses and whatever else. And that data is gonna be there for a while just because these large organizations don't change very quickly.
So that's how I ended up in data management. And from that university stint, we then saw the opportunity. Data really has a problem. Right? The way organizations deal with data, the way people deal with data, something can be improved. We can do better than this. And from that idea, from that belief, we then started the spin off, Calibra, and that's how I kept being in data management until today. Yeah. And your point about the legacy technologies never really going away is definitely
[00:04:53] Unknown:
a valuable thing to keep in mind because, you know, the current day and age where there are new databases and new technologies coming out seems like every other day, you know, it's easy to think, oh, well, you know, we'll just use this new system, and I'll have a greenfield. But there are companies that have petabytes or terabytes of data living in a data warehouse or living in a, you know, an appliance on a rack somewhere that's been chugging along just fine for a while. So why are they gonna spend money to migrate it into the cloud or use all the shiny new tools if they've already got something that works?
[00:05:22] Unknown:
Exactly. And, even if you're a person with a technology background and a techno technology job, Even then, it is hard to keep track of all the new technology that comes out. So there's so much new data technology that comes out, new ways to move data, new ways to store data, and they come up with new names all the time. Right? From the warehouse to the lake, and I'm sure it's gonna be called something else tomorrow.
[00:05:46] Unknown:
But even for the technology people, it's hard to keep track. And indeed, if you're in a large company and okay. I'm learning this new tool 3 years latest, There's new tools again to learn, but you're stuck also with managing the stuff you've already built. Right? You can't just resolve what is there all of a sudden. Yeah. And it it definitely seems like a lot of these technologies come out overnight. But then as you dig into it, you realize, you know, oh, this new data warehouse that I'm using today has actually been in development for the past 10 years, and it's using lessons that it learned from systems that are 2030 years old. Postgres has been around for, what, 30 or 40 years now, and it's still going strong. So there there's a lot to be said for the the value of legacy technology because it is so well understood and has so much existing tooling and, you know, general knowledge in the community about how to actually run it versus all these new systems that you have to constantly be trying to run and keep up? A 100%.
[00:06:38] Unknown:
The key principles and data, they don't change. Look at even Microsoft, right, who hasn't always been, popular with developers or data people, necessarily. But when you look at Azure now, it's a slice piece of work. But when you look at when they started with their SQL database, that's probably at least 2 decades ago. Right? So data and data management is definitely here to stay.
[00:07:00] Unknown:
Absolutely. And so digging more into Calibre itself now, can you give a bit of an overview about what it is that you're building there? And you mentioned already a little bit of the story behind the company, but maybe talk a bit of how the lessons that you've learned in the early days have allowed you to keep it going and know, some of the ways that it has maybe changed or evolved since you first had the idea and launched it as a product? Collibra today, we're known as the data intelligence company, and I'll get back to that. But the way our company grew up, if you will, and and got known in the market was as the leader in data governance.
[00:07:31] Unknown:
That's how we really started our platform, and I can tell you a lot of stories about, data governance and what it is and what people like about it or maybe dislike about it. But, essentially, that's where we started, and we also saw then, okay, next to governance, people are also working on data lakes, and and they wanna know what's in there, and they wanna have a catalog to do that. And then, let's say, around 2016, when the granddaddy of privacy regulations came out GDPR, Everybody's uncle and aunts knows about GDPR because of all the checkboxes they have to pick on websites these days. So we added privacy, We added lineage. Recently acquired a data quality company. So over the years, we've really tacked on a number of things in our platform related to more than just data governance. So that's why we started calling it data intelligence.
And in short, what what that is about is about trying to connect the data to the right people, insights, algorithms, and ultimately outcomes. Because sometimes people in data forget, okay, let me crunch this data or let me build a beautiful model. If you don't manage to harness that in inside the business and value it in growth and revenue or cost savings or what have you, then no matter how beautiful or elegant it is, it doesn't really affect the organization. I think if you're asking, you know, what did you learn about data governance, it's most people don't really understand fully what it is. So I came up with a spiel to explain to you what it is. And if you permit me, we can do it live. Yeah. Absolutely.
[00:09:07] Unknown:
Yeah. Definitely would be interested to get your take on, you know, what is data governance and what does it mean to you. Because depending on where you look, it could mean access control. It could mean, you know, data stewardship. It could mean business glossaries. It could mean lineage. So and then if you go to Wikipedia, it's actually all of those things and more. So I'd definitely be interested to get sort of your take on what is data governance for somebody who is uninitiated to the idea.
[00:09:30] Unknown:
Everybody who's sort of been in data governance for a while typically develops their own definition of what it which is probably why there's so many different definitions out there. Right? I've tried to keep mine short and simple, and I call it the control and enablement of any and all data management activities. Now let's break that down for just a second. Right? So data management activities, in my view, it's about you're storing data somewhere, databases. You're moving them somewhere, ETL, data pipelines. You're consolidating it somewhere, a warehouse, a data lake, and then you're reporting on it. And I'm purposely simplifying the whole space a little bit. We're talking about a collection of really software industries. Right? We're talking about tens of 1,000,000,000 of dollars. So I'm I'm sorry if I'm simplifying a little bit. That's what companies have been doing for decades. Right? Storing, moving, and consolidating, and reporting on data. And governance is sort of a layer above that. So governance is not about storing the data itself, but rather about, hey, how should you best store data? Right? And how should you best provide access to it? So governance is more about making sure that the people who have to do something with data, that they know what they have to do, how they have to do it, who is responsible for doing it or helping them.
That's all the things that governance is about. But the challenge, I think, that has happened in the market is that well, there's a number of challenges in the market, but essentially, people have overly focused on the control part of my definition. Because in, 2012, there was a regulation that came out for the banks to mitigate the financial crisis of 2,008. That regulation essentially said, hey, mister big banks. You now have to have a chief data officer. You now have to do data governance. So they did that, but their charter was regulatory driven. So they over focused on the control part, which is why data governance also got this. And people on the podcast can't see it. Right? But I'm waving my little finger here. Like, no. You cannot touch the data. Right? Data governance is the data police. So it became about the control part. But from my point of view, governance is like good governance in a company.
It's about organizing a company in such a way. It's about making sure that the management of stuff is done in the right way. And if you go into a company and the company has a good governance, then a lot of things are arranged from you. You know, you're getting a laptop. You can ask for vacation days and a whole bunch of other things. So the company the business just works really well. And the same is true for good data governance. If you have good data governance in place, then everybody in the company is better enabled
[00:12:14] Unknown:
to do their work with data. Does that help? Yeah. It's definitely a very good and holistic view of data governance where, like you said, a lot of times, you think about it and you think, oh, it's just a matter of control and making sure that the data doesn't get leaked or that you have good data security in place or personal information management. But my own understanding of data governance is definitely more in line with what you're saying of it's the entire sort of overarching piece of how the data is actually interacted with by the people in the business and understanding where it lives, why it is where it is, how it is where it is, and all of the sort of in some ways, it's sort of the metadata of the data platform, but on the sort of people and organizational scale more than just, you know, for in terms of how that underlying technology actually links things together?
[00:13:00] Unknown:
You should, come work for us.
[00:13:05] Unknown:
Maybe someday. And so with this definition in mind, you know, there are a number of different companies that claim to offer data governance in different capacities. And so given that Calibra's main mission is this overarching data governance capacity and you have acquired and built these different capabilities for data quality and data cataloging and all the different things that fall under the umbrella of data governance. I'm curious to get your perspective on what the current landscape of data governance looks like and how Calibra fits within that overall space.
[00:13:38] Unknown:
So I think for 1 part, you have the people who look at data governance as a people and process problem. Like, you and me were talking just now about what we think data governance is, our view on it. A lot of people can easily say, that's a people and process problem. Right? Because people have to do the right thing, then they have to follow the right steps. So this is not something you solve with software. So you have a lot of those situations in the landscape where people are just having something, some kind of technology that's maybe solving a piece or a portion of it. And everything else that's not solved, that's a people and process problem. Like, you can't it's an escape. You can't solve it with a piece of software. Obviously, that's not the thesis we subscribe to. This adds to your confusion in the market, I think, because you also have what I call the sticker people. People have some kind of, again, technology, like you said, maybe an access technology or maybe a glossary technology or some ETL capability and have some metadata sprinkled on top of it. And then they put a sticker on that and then they say, this is data governance. So those are the data governance sticker people.
And I don't understand that at all. Right? Because for some reason, they think that data governance is sexy, and they feel the need to put a sexy sticker on whatever it is that they're offering. But, typically, the sticker doesn't solve the problem that is there. Right? So the people in process teams, you have the data governance sticker people. And then I think you also have the players out there who are solving really, like, point solution, which they claim this is all data governance. So maybe they just solve glossary for example, or maybe they just solve 1 piece related to GDPR.
But governance is, to your point also earlier, more holistic. Right? So from our viewpoint, governance requires sort of platform capability that does cover glossary, that does cover GDPR, that does cover catalogs, and that does cover this and that and the other. And we like to refer to that as an old data intelligence stage. Think of it like this. When you have a data lake, right, and people are asking you, I need to know what's in my data lake. People typically answer this with, okay. Let me build you a catalog of data. Then you'll know what's in the lake. But then you also have on the other side of the organization people in legal or privacy who have to make something called a process registry for GDPR purposes.
And the process registry essentially tells you, okay, what business processes do you have and which data are they consuming. So essentially, it's a data map and how the data is being used, which you've also created for your catalog. Like, you're making a data map twice for different purposes, and it's obviously gonna be different. But in reality for organizations, those data maps should be the same thing. So that's why our viewpoint and I think we're quite unique in that in the market is that this space requires horizontal platform solution that touches on all of the capabilities.
[00:16:31] Unknown:
To your point about the sort of sticker companies where they say, this is data governance, and this is why you wanna buy it. We'll solve all your data governance needs, but it's really just a point solution in disguise. I think, you know, 1 of the driving factors there is that, you know, sort of fear based where a company realizes, oh, I have to be in compliance with this regulation. And so they say, you know, just sell me a product that'll solve my problem for this pain point, not really realizing the potential value that they can get by viewing it from that horizontal perspective.
[00:16:59] Unknown:
Exactly. Exactly. And, also, I get the fierce thing, right, with the fine and what have you. But what I've seen and, again, I go back to that 2,098 crisis and the 2012 regulation that come came out of it. At that point in time, I had the impression that the regulation started changing. They became more principles based, and they went more to the fundamentals of the underlying problem rather than saying, hey. I'm the regulator, and you need to provide me a report that you're doing a good job. Right? You needed to change how you were doing things. And GDPR was similar to that. GDPR also to that principles based approach. Obviously, people still put the checkboxes on their website. They got the easy stuff out of the way fast.
The real meat of the GDPR, just like with that financial regulation, is still out there waiting for most organizations how they deal with data. And that trend of regulation is not stopping. You probably also saw the AI regulation that Europe brought up this week, where again it's the same thing. Right? Like, you're putting a model in production. Do you know what data goes in it? Do you have explainability around the algorithm? Do you know that there's no bias in it? How do you know how this is put together? How do you know that you have control over your output? How can you explain it? And and a bunch of other things. You can't just checkbox that away.
[00:18:19] Unknown:
Yeah. And so digging more into kind of the more horizontal aspects of it and all of the different little detail pieces that go into data governance, in your experience of working in this industry and working in that particular space for a while now, what have you seen as being the elements of governance and those little detail points that often get overlooked or sort of willfully ignored, particularly in smaller to medium businesses as they're first getting started and building out their data capacity?
[00:18:49] Unknown:
Yeah. For smaller organizations, let's say you're 10 people. Right? You're a start up. I was gonna say you're probably all in the same room, but nowadays, that's not true. Right? You're probably all on the same Zoom room or something like that. But, essentially, you're still small and connected. You don't have that, what is it called, the Dunbar's number problem or something like that, where there's just too many people and too many connections to keep everybody aligned. So small organizations, in my view, what they will typically skimp over has is everything to do with risk. Smaller companies tend to be a little bit more risk taking than larger ones.
So you'll find less process in smaller companies. You'll find less clearly delineated responsibilities, and all of those are governance pieces. So in a small company of 10 people, nobody is gonna be talking about data owners. Right? Because they know it's Joe who manages the CRM system who's the owner of that data. And if there's a problem, he's gonna go into the next room and talk to Joe. Same thing with, like, business advisory, for example. If you have 10 people and you talk about ARR, you're just gonna assume you're talking about the same thing because you're talking to each other so much that you probably don't need to do anything else. But the risk part, even for small organizations, is not to be underestimated.
Recently, I heard of a story of some start up, a machine learning thing in production, and it wasn't controlled in any way. Whether it was bad garbage in, garbage out. I don't know what the cause was, but, essentially, it went haywire and the company went bust. So even if you're an SMB, having there's still some, like, sliver of government that you still have to do. Otherwise, you're exposing your business to just too much risk.
[00:20:31] Unknown:
All of that also goes down to the fundamental aspect of trust in your data, trust in the other people in your organization who are controlling that data and working with that data. And I think that a lot of the aspects, at least in terms of the people layer of data governance, is in managing trust and establishing trust both with the data that you're working with and across the organization, you know, in terms of the ways that the different units interact with each other and exchange data. And I'm curious what you see as being the kind of biggest challenges to establishing and maintaining that trust are the biggest obstacles to overcome, particularly if you start off as a small and medium business and don't have all of these principles and ideas baked in from the beginning of your working with data? So trust is a very complicated concept. So fundamentally,
[00:21:16] Unknown:
something in between people, I believe. But from a data point of view, there's trust because you know where it came from. Like, you know the data came from that system, and I have some kind of trust in that. Maybe you know how the data was manipulated or how it was changed. You know how it flows. That factors into trust. Quality also factors into trust. Like, the data, its accuracy is high. Its timeliness is correct. Its relevance is correct, and a bunch of other dimensions. So all of these can you can translate it into some kind of score, simplifying a bit. Right? But you could say the data is 80% up to par according to our planned use. So data quality factors into trusts. It has to have some kind of minimum level of quality before you can trust it. That's continuous. It's a data in that sense is a little bit, like, floating like water.
It's not because the data is qualitative enough today that tomorrow, it's still the same thing. Right? So it's a continuous thing that you have to do to make sure that the trust remains high. Understanding is a big thing. You know, maybe you found the data. Maybe you know where it came from. Maybe you check the quality. But if you don't understand the data, is this somebody's first name or last name? Is that a birth date or a credit card transaction date? If you don't understand what that field is, for example, or how the data can be used or what it really means or what it could mean, you're not gonna really trust it. Like, if like, we're speaking to each other, and you may be a friend of mine for many years. But if I cannot understand what you're saying, it's gonna be really hard for me to trust you. Right? So understanding of what data means or what it can mean, what it can be used for is super important.
Ultimately, like I said, trust is a social aspect. Right? Let's say I'm an expert in data in our company, and everybody knows I'm an expert in data. And I say about this dataset, 2 thumbs up. This is a great dataset. People will trust that. Right? Because I sort of put my reputation on the dataset by thumbing it up or by working with it. Or when I produce an output, a report, an insight, or a model, or what have you, people will trust it again based on the trust coming through me. And that's also, I think, where if you're a larger organization, it's a little bit harder. Right? Because maybe the colleague that you worked with yesterday is not there anymore tomorrow.
Things are a little bit more transient in a small organization and then also factors into trust, which is why in these larger organizations, you have to include the trust aspect also more in the process rather than solely depending on the people.
[00:23:53] Unknown:
As you scale up to these larger organizations, particularly when you talk about the, quote, unquote, enterprise or, you know, Fortune 500 companies, What are some of the data problems that you start to see come up in those more large and complex organizations that you're unlikely to run into in a smaller business like a startup or an SMB?
[00:24:12] Unknown:
Where do I start that? How much time do I have?
[00:24:16] Unknown:
As much time as you want.
[00:24:18] Unknown:
Again, some of these are gonna come back in smaller organizations as well, I think. Right? So nobody is really safe from these problems. But 1 of my favorite data problems is what I refer to as the data brawls. And I've seen it now so many times. This even all the way up to CEO level, CEO was asking silly questions like, hey. How many customers do we have? Right? And then 1 person from sales come back and says, oh, we have 5,000 customers. And then somebody from finance comes with their beautifully designed dashboard and says, no. No. We have 50,000 customers. And then somebody else says, no. We have 500,000 customers. So these kind of situations with this wide range of options, there's a liability that seems crazy.
These are real cases that happen. Right? So data brawls, like, my numbers are better than your numbers or my beautiful dashboard is better than your beautiful dashboard. They happen all the time. You know, politics factors into it. Culture factors into it because in the end, it's an important decision. Right? Based on whomever numbers is gonna be believed by that decision maker. You know, a certain decision or action will be taken, and then if it was your number, you're in the best position. If it was somebody else's number, then you're at a disadvantage.
So data browse is a big thing. Data lineage is a huge problem in these large organizations, not just because they're moving data all over, not just because they're moving it through all sorts of systems. Name a data pipeline or ETL tool and they have them all. Right? So it flows through all these systems which make the complexity harder. They've hired consultants or outsourced managing their data or the pipelines are building them. So 2 years later, nobody knows what's, you know, running in that box in the closet anymore. Quality. Data quality is a forever problem. In large organizations, it's a bigger problem than in smaller ones because they just have more data and more places to put and move the data.
Access to data is also a big problem. Recently, you heard of a story where they move data into the cloud. By the way, the cloud is in my mind commoditizing data management. So in 5 years' time, I would expect all of the data management software to in some way be in the cloud. Essentially, they moved their data lake into the cloud, put all the data in there. And then apart from the group who was really running that data lake, they didn't even dare to open it up to others because they didn't know who was allowed to have access to it or they they felt it was too big of a risk to so you're now having all this pretty data lake. And because you don't have a proper way to manage the access to that data, you're just keeping it hoarded up just because you think the risk is too big or too uncertain about who should be getting access. Big hot potato topic is data ownership.
If you wanna see somebody juggle in a Zoom meeting, like, just, oh, will you be the data owner for the data? And then, like, see how they're trying to, like, get away from that. Like, that's a big 1. But if there is no ownership, actually on the business side around data, then it means that you're not really treating it as an asset. Right? Ownership has to be established. And then there's collaboration. Right? How do you even know in a large organization who is that owner? Who do I call if I wanna get access to this data? The Ghostbusters? Right? So in in larger company companies, the collaboration with the different stakeholders and personas is also harder because there's just more of them, and you don't know who they are necessarily.
[00:27:51] Unknown:
So tying that in with what you've built at Calibra, can you talk through some of the ways that Calibra helps to address these various challenges and who who the actual end users of the platform are and just the overall process of getting it integrated into a business to help them tame and understand the overall scope of their data.
[00:28:11] Unknown:
Well, we're still back to the same question. How many minutes do you have? Right? I'll try to be short. But summarizing, I would say the position we've taken in the market is what we call a system of record position. All sorts of companies and people are familiar with systems of record. They have a system of record to manage their customer assets, the CRM system. They have a system of record to manage their talent asset, the human capital management system. They have a system of records to manage their money asset like an ERP or some financial so systems of record are key.
They're circling around some sort of asset in the company that the company wants to manage in a formal and properly defined way, which is always changing by the way. Right? Like the processes in companies always change. And CodeEra is seen as a system of record for the data assets. Right? And because we have that position, all of the things that circle around it in terms of shopping for data or glossaries or access or lineage or quality, they all fit neatly into that system of record concept. So if an organization works with Calibra, they start building up use case by use case, their system of record capability, flexing that muscle. You know, how do you treat data as an asset muscle?
That's how we step by step solve all of these large challenges. Now how do you do that? You're asking like, okay. There's a piece of software that helps, but how do you roll that out in a large company? And that's where I say a data office is always going to be a scarce resource. You will always have 10 or a 100 or a 1000 times more other employees in the company than you will have data office employees. And you have the data team that always be small. So if you wanna solve those big, big challenges in data brawls and whatever else, you have to take that scarce resource and focus it on the biggest problem that is most simple to solve and adds the most value.
And you solve that use case. That may be a grocery use case. It may be a privacy use case. It may be a lineage use case. Depends on where you're at as a company. Right? We solve that problem, And if by solving it in a system of record, you're sort of building up small Lego blocks of a bigger house that ultimately has solved all of those challenges. But you can imagine how for a significant, like, tens of thousands or hundreds of thousands employee company, this is a multiyear effort. Right? Until data as an asset is really busy as usual, and nobody's even talking about data governance anymore. They're just talking, this is how we've always done it.
[00:30:48] Unknown:
Patrick is a diligent data engineer, probably the best in his team. Yesterday, when trying to optimize the performance of a query running over 20,000,000,000 rows, he was so eager to succeed that he read the entire database documentation. He changed the syntax. He changed the schema. He gave it his everything and reduced the response time from 20 minutes down to 5. Today is not a good day. Sarah from business intelligence says 5 minutes is way too long. John, the CFO, is constantly slacking every living being trying to figure out what caused the business intelligence expenses to grow so high yesterday. Want to become the liberator of data?
Firebolt's cloud data warehouse can run complex queries over terabytes and petabytes of data in sub seconds with minimum resources. No more waiting, no more huge expenses, and your hard work finally pays off. Firebolt is the fastest cloud data warehouse. Visitdataengineeringpodcast.com/firebolt today to get started, And the first 25 visitors will receive a free Firebolt t shirt. Digging more into the Calibre platform itself, can you talk through how it's architected and how it actually gets deployed and used within an organization?
[00:31:58] Unknown:
If you go all the way back to how we started, of course, it's a very different piece of software back then from now, you know, from the semantics data to first us identifying digital pieces like glossary, stewardship, catalog, lineage, privacy, etcetera. But today, Collibra is a software as a service solution. Right? So cloud. But because we're touching on data with customers, people are still a little bit skeptical or afraid, oh, should my data really move into the cloud? Right? So next to our cloud in the architecture, we also have what we call an edge component. And that edge component lives inside the customer's network, connects to the data source that is maybe on premise with the customer or in their VPC, and then it pulls out certain information from a data source. Metadata, but also it takes some profiling information and some mobile information, and it only sends back to the cloud what is safe and acceptable. And in the cloud, the customer now has this catalog, like a shopping window, like you would buy a book online. Right? You would see a list of recommendations around all your favorite data assets. But you'll see those dressed up with the additional information that it pulled through the edge component, like profile, like metadata, and all other sorts of context.
So we have a cloud component and we have an edge component, and we shuffle safe information
[00:33:25] Unknown:
in between the 2. Does that does that sort of help? Yeah. That's definitely useful because particularly for a lot of software that you're selling into the enterprise, most of the time, they don't wanna necessarily deal with something that's running in the cloud or software as a service. They just want an appliance, whether it's a hardware or a virtual appliance that they just put into a rack on their data center, and then they manage it. And then they just have a support license where they get, you know, software upgrade every 6 months or every year or something. So understanding where you fit in terms of the SaaS versus on premise kind of deployment model is definitely helpful for my own understanding of how you're working as a business. And I think that that sort of hybrid approach is definitely in line with a lot of things that I'm seeing in the space now where in order for you to operate as a business, you need to have this kind of self contained, well managed SaaS platform, but then you also have these kind of agents or sensors that live inside a segmented network environment so that the customer has complete control over what data is actually being accessed, how it's being accessed, and then they can manage that egress on their own?
[00:34:23] Unknown:
Exactly. And this is a pattern that we've seen grow over the years, and it works. Now I have full belief that, like I said earlier, right, cloud is gonna be the big standard 5 years from now. Even today, the cloud vendors, the the 3 big ones, Google, Amazon, and Microsoft, together, their revenue is a $100,000,000,000. That's a lot of money. Right? And they're still only capturing, what is it, 20 or 30% of the market. So take it 5 years further and a lot will be in the cloud. But like we were discussing earlier in the beginning of this call, the legacy is not just gonna go away. Right? So people will always need that combination of, like you said, something that's in their stuff and something that's outside and that speaks to each other's the cloud and the edge.
[00:35:09] Unknown:
As you mentioned at the beginning, you are leading a team that's actually using Calibra to run against Calibra's own data sources, and you're using your own product to run your business. And I'm interested in digging into your experience of actually turning your product in on yourself and sort of giving an understanding of sort of the lessons that you've learned or the surprises that you've had as you've actually been your own customer and seen your product from the other side?
[00:35:34] Unknown:
Yeah. That's why I said on Friday evening, it's drinking your own champagne, and on Monday morning, it's eating your own dog food. Right? And and we're very serious about that because Griba has always had a thought leadership role. Right? In a way, we've created the software category of data governance where others believe it was a people and process problem. We believe in leadership and thought leadership by trying, doing, and failing, and learning. Right? So that's that's how you create leadership. So by eating our own dog food and drinking our own champagne, we've definitely learned a few lessons and incorporated those already in our product or in our practice. Here's a few examples. So 1, we have a lot of engineers, a lot of product managers, and they're all doing great work, creating all these wonderful features and capabilities in the software.
Now as a consumer, yay, there's another release and there's another bunch of features. I love them. Right? I I love that we're getting all these features on a regular basis. I think it's on a monthly cadence even today. But as a consumer who then has to roll that out to the rest of the organization, I'm realizing, well, that's a lot of work because every new feature that's there, I need to make sure that the users know about it, that they're training it, that it's set up or configured in the right way. So it's like, in a way, dealing with eating our own dog food, drinking our own champagne is a little bit of drinking from the fire hose of the innovation that the company brings.
You really have to find a good cadence or base to roll out use cases or capabilities to these users. Otherwise, they would and for them, it will just be too much at a given time. So basing the rollout in terms of use case is definitely a lesson. And then another lesson that's a personal lesson, I've been doing this for 13 years. Right? So I've seen everything in the kitchen sink, but Cordebra continuously grows with new people. I think we're about 700 people right now. That doesn't mean that all of those people who just joined yesterday have my 13 years of experience in data management. Right? So sometimes I have to sort of go back to level 1, if you will, and really take the time to explain to people, okay, data management 101. What does it mean to be a data owner, for example? Right? And how does that translate into your practical daily reality?
What is the conceptual model? Right? And so on and so forth. So the pacing of the use cases is 1. The remembering not everybody has been doing this for 13 years, so explain it to me like I'm 5 is another good lesson.
[00:38:02] Unknown:
And in your experience of actually helping those people who are new to the company and giving them that data governance and data management 101, what are some of the pieces of information that you find are commonly either missing or misunderstood or aren't being taught well enough in the current generation of people who are coming into data management?
[00:38:21] Unknown:
Well, I think that's a lot because I still don't think there's enough data teaching in, like, executive schools, management schools. I still don't think there's enough data teaching in any sort of curriculum. So in its broader sense, there has to be more teaching around data in school any school, right, in any curriculum. We need to we need to upgrade our data literally see scales as as human beings as, you know, when we go work in companies, everybody today is in some way, shape, or form using data to do their job. Right? So we like to say that everybody in that sense is a data citizen. By that data teaching, I mean the whole 9 yards. Right? SQL, maybe you can argue about that, but bias. Right? Like, when is a bar chart a good instrument versus a pie chart?
All of it just needs a lot more teaching. It's not just about teaching somebody how do you make a nice dashboard. Right? But it's also teaching somebody how do you do proper data analysis and how do you avoid making mistakes in your data analysis so that the decision maker isn't making the wrong decision based on a pretty dashboard. So, yeah, there's not 1 single thing except we have to upgrade people.
[00:39:33] Unknown:
In terms of the actual experience of using Calibra and running it against your own data, what are some of the sharp edges or weak points that you've identified and gone about resolving and just closing that loop of being able to actually take the lessons that you've learned as your own consumer to build a better product and help your customers as a result?
[00:39:52] Unknown:
Well, 1 of them is 1 that's always a goal at our company is usability. So whether you're talking to us today or if you're talking to us 5 years ago or 5 years from now, usability will always be a primary topic for us because it always needs to improve. And the reason for that is that because of in this process of a data intelligent organization, there's just so many different user personas involved. You know, we have the data stewards. We have the data scientist. We have the data analyst. We have my friend, the data architect. We have the data czarina. We have data ninjas. You have data privacy people. You have data engineers.
And they they all have different backgrounds and working context. Like, a data privacy person is often more of a legal type person, a lawyer with a legal background somehow. Right? A data scientist is all about the model and just give me data, and I'll I'll program my own tools to work with it. So you're talking about very different experiences. And because they all have to take part of the data process in some way, the usability for different pieces just has to be different for different personas, and those personas are still changing. 5 years ago, the data privacy role was less visible, less prominent as it is today. You know, 5 years ago, the data scientist was the king. And today, it seems like, well, maybe the data engineer is gonna be the new king. Right? And that last 1, by the way, that's also another lesson from that we learned in ourselves is that, k, we focused a lot on the business audience, but I think in the past, we were a little bit underappreciated of of the data engineer persona. Right? And when I started taking up this data office role, had to put up my my own data lake, had to run my own data pipelines, feed the data in there, and so on and so forth, I saw how crucial the role is when it comes to data products of the data engineers. We can't run, build, or maintain the data product properly without having proper date data engineers in place. So from Clibra's point of view, we said, okay. How can we also make sure that this user persona is having what they need? Because they they're also part of the process. Right? That's why you're seeing things like the data quality acquisition that we recently did because they're so close to it. Right? If you're building pipelines, you gotta put a quality control on there somehow.
[00:42:09] Unknown:
Absolutely. And in terms of the actual usability aspects, 1 of the things that I've been thinking about a lot is because of the variety of different roles that are all necessary and crucial for successful data outcome and the different ways that they're working with the data and the different environments in which they're doing that work. The actual access patterns and the ways that they wanna integrate with the system are much different. And so I'm curious how you have thought about and approached being able to actually surface the information at a useful point and reduce the impedance mismatch where, you know, the data scientist wants to work at a Jupyter notebook, and they wanna be able to understand, okay. You know, what what's the sort of rating of data quality of this set that I'm working with? Who's the data owner? You know, the business analyst wants to know where did this data come from? What are the kind of core metrics that are being used to calculate this? The data engineer needs to know who's the downstream consumer of this, and how do you actually
[00:43:06] Unknown:
expose all of that information at the point in time and in the tools that those people are working with. I like the word that you're using, the beans mismatch, because, indeed, even a data scientist and a data engineer or a data analyst, you would think, aren't they the same beast? But, no, they they do work in in different environments. They have very different backgrounds. Like, 1 is a PhD in physics. Right? And the other is a Kraken Excel or a master of spreadsheets, if you will. I think for us, the important bit is that they're all in a way having different lenses or perspectives on the same thing. And we call that thing the data intelligence graph, right, where a data asset sits in the middle, let's say, and then there's an owner associated with it. Right? There's a more business the sphere has a more business facing side.
But there's also more technical facing side to it, like the the the physical data dictionary, you know, the schema, the table, and a whole bunch of other things, lineage, quality, the whole 9 yards. And maybe the data scientist is looking at it from from the left angle, right, looking at it more from the physical aspect. But at a certain point in time, they'll also need to go talk to the the data product owner. So they'll need to go to the quote, unquote right hand side of that data intelligence graph, which is where typically more maybe the data analyst is watching. Right? So they're they're looking at different lenses of the same graph in the end. And what we are doing as a software company is saying, okay. How can we expose this graph as easily as possible in those working environments?
Right? So if you go into a Jupyter Notebook, you can do command, and all of a sudden, there's some information coming back from the catalog, from the data intelligence graph and showing up in that Jupyter notebook. If you're sitting in Tableau all day long, well, we have a small plug in in Tableau. We have Collibra for desktop. We have it on mobile. So we try to not make you go to the mountain, but make the mountain come to you. And that's again, in terms of usability, that's also a never ending exercise. Right? Because a few years ago, you had mobile. Maybe in 2 years from now, we'll need to put the data intelligence drawer on the watch somehow. Right? So the user interface keeps on changing as well.
[00:45:18] Unknown:
In terms of the actual breakdown of the pieces that you're able to control in software versus the human side, I'm curious what you've seen in terms of the pieces of data governance that are resistant to automation and management with software and how you manage that handoff between the mechanical and the human processes that are involved in the overall scope of data governance?
[00:45:41] Unknown:
I think a lot of things can be automated. Almost all things. But fundamentally, there are things which cannot be automated. Like, you and me agreeing on something, for example, on what something means or let's say, the both of us have a small business. Right? And it's a completely data driven business. And we have to pick a KPI on which we wanna measure this business. Like, this is gonna be the metric that we run the business on. Yes. We can automate, like, making a selection of the most important metrics, right, or having those metrics be calculated automatically. But in the end, we make a decision. As human beings, we have to make a decision. Like, this is gonna be our metric. This is what we're gonna run the business on, and this is how we're going to do it. This is how that metric is gonna be calculated and so on and so forth. And that is a human choice.
Now that part, you cannot automate. Like, a lot of other things, you can automate. But the human decision, in the end, the human choice that has to be made, the agreement between 2 human beings, like, this is how we're going to work and that's what it's going to mean, that's something that you fundamentally cannot automate. You can automate the whole surroundings around it. Right? We can facilitate the process and make it as easy as possible, but in the end, humans have to make that decision. And in terms of your question about the hand off between, you know, automation and the human parts, I think for us as a company, it starts on a strategic level. Like, strategically, this has to be important.
You have to make it a priority. You have to make it a culture. But in the end, it's also trial and error. Like, if we have this algorithm, for example, that tries to auto classify things. Oh, this is an email address. Oh, this is a Social Security number, what have you. But a human user still has to say yes or no at some point, right, to confirm the guess that the machine has made, if you will. And next to all the strategy and all the culture and all the priority around automation, in the end, you have to just do trial and error. Right? And see, okay. How how is it working for the end user? How is it working for that persona? And is it working good enough or fast enough? And if not, let's do better. In your experience of building Calibra
[00:47:48] Unknown:
and building the business around it and now running Calibra on the internal data for the business, what are some of the most interesting or innovative or unexpected ways that you've seen it used or challenges that you've run into in the process?
[00:48:01] Unknown:
I've lost track of all the customers when we started as a company. I knew all the customers by name and in detail what they were doing. But now, like, my brain has become too small, unfortunately. But there are a few cases that still stick out for me. 1 is from the bank. All the other banks were trying to, like, do their governance and comply with the regulations, and they were focused on this thing called critical data elements, your most important business data elements or data domains. And this bank was doing it slightly differently. They were focused on what they call data usages and data usage. If you move data between 2 departments or lines of business, that's a data usage.
And around the data usage, you need to wrap a data usage agreement. Like, who's the producer? Who's the consumer? Who's responsible for quality and whatnot? I love that. And you see that now in frameworks like the data mesh, you see that come back. And in other cases, there's a few tech companies who use our software more in an engineering context where APIs are more important. I like those. They're pretty cool as well. And then there's 1 that that will always remain with me, and there was an interesting 1 that they had done related to reference data. Reference data is like codes, classifications, like ISO country codes, stuff like that.
And they had ETL moving around. ETL and data pipelines always need these mapping tables. Like, if you get this source input, you have to map it according to that table to that source output. So what they've done is they had connected that ETL to our software, and then when the stewards would be working on the reference data and approve new ones or deprecate old ones, that would automatically be connected to the mapping table and used in run time in the data pipeline. And I like that as well. And in terms of your experience of building Calibra and growing it and running the internal data pipeline, and I like that as well. And in terms of your
[00:49:45] Unknown:
experience of building Calibra and growing it and running the internal data office, what are the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:49:54] Unknown:
What are the things that I learned in drinking our own champagne? Well, we were talking earlier about the people and process problem. So I said, okay. It's not a pure people and process problem, and there is software available that can help you solve it. But 1 of the things that I did learn that even with the software, there is still a people and process part to it. Right? So inside Collibra, I had to really get close with the business around ownership, around what are your goals, what are your priorities, what are your objectives, and how can I make data intelligence fit in? Right? How can data as an asset help you and according to those priorities? So really you have to stay close to the business. And the big importance of data quality, of course, I knew that. Right? I knew data quality was important. But in practice, once you have the pipelines up and running, the problem becomes very real, of course.
And, obviously, we're on to that. And last but not least, the speed of technology change. If you're in data and if you're in data technology, it feels like every day there's new tools and technologies that come out that you wanna play with or that you have to integrate with or what have you, that you have to learn. That's also a big lesson because from our viewpoint, no matter what new stuff comes out, no matter where in the cloud people put their data tomorrow, we as a software company and data intelligence will have to plug into it neatly, of course.
[00:51:18] Unknown:
For people who are starting down the path of trying to get control of their data governance and build out a strategy and build out a system for it. What are the cases where Calibra is the wrong choice and they might want to build something in house or use a collection of point solutions or go a different route?
[00:51:35] Unknown:
When is Collibra the wrong choice? Well, we've been doing this since 2008. Right? So what is it? About 13, 14 years almost? I would say that for me, personally, I'm very biased, obviously. But for me, has never been the wrong choice. No. But let me let me give a good answer. Right? I think if you're a company who has no data, then, yeah, Calibre is the wrong choice. We're about the data asset. But, also, if you're looking for, you know, tool to move data around or if you're looking for a data lake, this is not what we are. Right? We are not a tool to move data around. We're not a data warehouse. So if you're looking for a place to store your data, that's not us. We replace a system of record for that data asset. So we'll work with wherever you store the data. So if you're looking for a tool to move data or store it at them in the wrong place, you know, install that thing first, and a week later, you'll still come knocking because now you have a place to store the data, and now you have an asset that you have to manage. And that's where the need for CodeEBA comes back. As you look to the future of the business and the platform, what are some of the things that you have planned for the near to medium term? Our journey of being a thought leader in this space has been ongoing for 13 years and is not stopping yet. Right? Remember, we started with governance, added catalog, privacy, lineage, quality just now. So there's more pieces in that what we call data intelligence journey.
But obviously, the platform, the horizontality, the breadth, depth, etcetera, of the platform is a continuous effort for us. Usability is a continuous effort for us. And, ultimately, we and our customers, we go into that data intelligence journey together where where Cliff, the business analyst, the data analyst persona can shop for data as easily as they can shop for a book. The customers are going through that journey. So we are going in that journey with them, and their needs in that journey are also our priorities for extending the platform.
[00:53:44] Unknown:
And are there any other aspects of the work that you're doing at Calibra or the overall scope of data governance and the impact that can have on a business that we didn't discuss yet that you'd like to cover before we close out the show? I would say we have a big event, June 16th 17th atcitizens.colibra.com,
[00:54:01] Unknown:
our data citizen event. So whatever I missed, and that's probably a lot, we'll cover it there. Alright.
[00:54:08] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:54:23] Unknown:
The importance of people. Most of the data tools are focused around, I need to do something with data rather than I need to make sure the people who do something with data are properly serviced. So people.
[00:54:34] Unknown:
Well, thank you very much for taking the time today to join me and share the work that you're doing at Calibra. I'm sure I could easily talk to you about this all day. But given that we've got limited time, I appreciate you sharing it with me and appreciate all the time and energy you've put into helping people gain better control of their data and have a holistic view of their data governance model. So thank you for that, and I hope you enjoy the rest of your day. We share the same time, so thank you for yours as well. It was a pleasure, and hope you enjoy your weekend. Bye bye. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, the innovative ways it is being used.
And visit the site at data engineering podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Host Welcome
Guest Introduction: Stijn Christians from Collibra
Journey into Data Management
Overview of Collibra and Data Governance
Current Landscape of Data Governance
Challenges in Data Governance for SMBs
Data Problems in Large Organizations
How Collibra Addresses Data Challenges
Collibra Platform Architecture and Deployment
Using Collibra Internally: Lessons Learned
Automation vs. Human Processes in Data Governance
Interesting Use Cases and Challenges
Future Plans for Collibra
Closing Remarks and Contact Information