Summary
The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show!
- Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder
- Build Data Pipelines. Not DAGs. That’s the spirit behind Upsolver SQLake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG-based orchestration. All you do is write a query in SQL to declare your transformation, and SQLake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQLake supports a broad set of transformations, including high-cardinality joins, aggregations, upserts and window operations. Output data can be streamed into a data lake for query engines like Presto, Trino or Spark SQL, a data warehouse like Snowflake or Redshift., or any other destination you choose. Pricing for SQLake is simple. You pay $99 per terabyte ingested into your data lake using SQLake, and run unlimited transformation pipelines for free. That way data engineers and data users can process to their heart’s content without worrying about their cloud bill. For data engineering podcast listeners, we’re offering a 30 day trial with unlimited data, so go to dataengineeringpodcast.com/upsolver today and see for yourself how to avoid DAG hell.
- Your host is Tobias Macey and today I'm interviewing Juan Sequeda and Tim Gasper about their views on the role of the data mesh paradigm for driving re-assessment of the foundational principles of data systems
Interview
- Introduction
- How did you get involved in the area of data management?
- What are the areas of the data ecosystem that you see the most turmoil and confusion?
- The past couple of years have brought a lot of attention to the idea of the "modern data stack". How has that influenced the ways that your and your customers' teams think about what skills they need to be effective?
- The other topic that is introducing a lot of confusion and uncertainty is the "data mesh". How has that changed the ways that teams think about who is involved in the technical and design conversations around data in an organization?
- Now that we, as an industry, have reached a new generational inflection about how data is generated, processed, and used, what are some of the foundational principles that have proven their worth?
- What are some of the new lessons that are showing the greatest promise?
- data modeling
- data platform/infrastructure
- data collaboration
- data governance/security/privacy
- How does your work at data.world work support these foundational practices?
- What are some of the ways that you work with your teams and customers to help them stay informed on industry practices?
- What is your process for understanding the balance between hype and reality as you encounter new ideas/technologies?
- What are some of the notable changes that have happened in the data.world product and market since I last had Bryon on the show in 2017?
- What are the most interesting, innovative, or unexpected ways that you have seen data.world used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on data.world?
- When is data.world the wrong choice?
- What do you have planned for the future of data.world?
Contact Info
- Juan
- @juansequeda on Twitter
- Website
- Tim
- @TimGasper on Twitter
- Website
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- data.world
- Gartner Hype Cycle
- Data Mesh
- Modern Data Stack
- DataOps
- Data Observability
- Data & AI Landscape
- DataDog
- RDF == Resource Description Framework
- SPARQL
- Moshe Vardi
- Star Schema
- Data Vault
- BPMN == Business Process Modeling Notation
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Upsolver: ![Upsolver](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/aHJGV1kt.png) Build Real-Time Pipelines. Not Endless DAGs! Creating real-time ETL pipelines is extremely time-consuming and engineering intensive. Why? Because when we attempt to shoehorn a 30-year old batch process into a real-time pipeline, we create an orchestration hell that makes every pipeline a data engineering project. Every pipeline is composed of transformation logic (the what) and orchestration (the how). If you run daily batches, orchestration is simple and there’s plenty of time to recover from failures. However, real-time pipelines with per-hour or per-minute batches make orchestration intricate and data engineers find themselves burdened with building Direct Acyclic Graphs (DAGs), in tools like Apache Airflow, with 10s to 100s of steps intended to address all success and failure modes, task dependencies and maintain temporary data copies. Ori Rafael, CEO and co-founder of Upsolver, will unpack this problem that bottlenecks real-time analytics delivery, and describe a new approach that completely eliminates the need for orchestration, so you can remove Airflow from your development critical path and deliver reliable production pipelines quickly. Go to [dataengineeringpodcast.com/upsolver](dataengineeringpodcast.com/upsolver) to start your 30 day trial with unlimited data, and see for yourself how to avoid DAG hell.
- Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png) Datafold helps you deal with data quality in your pull request. It provides automated regression testing throughout your schema and pipelines so you can address quality issues before they affect production. No more shipping and praying, you can now know exactly what will change in your database ahead of time. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI, so in a few minutes you can get from 0 to automated testing of your analytical code. Visit our site at [dataengineeringpodcast.com/datafold](https://www.dataengineeringpodcast.com/datafold) today to book a demo with Datafold.
- Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png) RudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines. RudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team. RudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again. Visit [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack) to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.
- Linode: ![Linode](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/W29GS9Zw.jpg) Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to: [dataengineeringpodcast.com/linode](https://www.dataengineeringpodcast.com/linode) today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.
Legacy CDPs charge you a premium to keep your data in a black box. RudderStack builds your CDP on top of your data warehouse, giving you a more secure and cost effective solution. Plus, it gives you more technical controls so you can fully unlock the power of your customer data. Visit rudderstack.com/legacy to take control of your customer data today.
[00:01:09] Unknown:
Your host is Tobias Macy, and today I'm interviewing Juan Ciketa and Tim Gasper about their views on the role of the data mesh paradigm for driving reassessment of the foundational principles of data systems. So, Juan, can you start by introducing yourself? Hello, everybody. My name is Juan Takeda. I'm the principal scientist at data.world, and I've been here for almost 3 years. I joined when I sold my previous company, which came out of my PhD in computer science at the University of Texas at Austin.
[00:01:34] Unknown:
And, Tim, how about yourself?
[00:01:36] Unknown:
Thanks for having me, Tobias, and I'm the chief customer officer and product strategist at Data.world. And previously, I was the VP of product at Data.world. So I've just recently moved into the CCO role. I've been in data analytics and AI for 15 plus years and various product services and and other types of roles. So I've got to see all sorts of different phases and hypes and evolutions, and I know that's a big topic for today.
[00:02:01] Unknown:
And, Juan, do you remember how you first got started working in data?
[00:02:05] Unknown:
That was undergrad. I did my undergrad in computer science and my PhD in computer science at the University of Texas at Austin in databases and semantics and semantic web knowledge that we call now knowledge graphs. I actually before while I was doing my undergrad, I had another start up that I was doing just creating, like, some it was a project management system for temporal staff companies. I remember that was, like, 2006, 7. But then, yeah, building database to that stuff. When, we took our technology, which is all about data integration, semantic data integration, virtualization, so I started working with all types of companies on mapping all their inscrutable complicated databases, dealt with SAP, ERP systems, Oracle EBS suites that have 10, 000 tables. Right? Trying to make sense of that stuff. Like, that is always fascinating for me. So did that for my previous company, Capcenta, for a while, and then Data. World was actually 1 of my customers and using a lot of the tech to the IP that we had and just made sense to join forces. So company was acquired, like, 3 years ago, and we're so aligned on our mission and vision about just integrating data knowledge, and that's where I am. That's 15 plus years of my life very quickly there. And, Tim, how about you? Do you remember how you got started in data?
[00:03:13] Unknown:
When I was in college, I started a small startup that was focused around analyzing social media data. And so we ended up getting funded by Capital Factory, which has an investment group and and co working space over in Austin. And that's how, I ended up moving down to Austin, Texas. And ever since then, I've just been in the data space from social media analytics to that company ended up getting bought by a company called InfoChimp, which was focused around big data platform as a service, especially Hadoop and Spark. That ended up getting acquired by Computer Sciences Corporation, which very large systems integrator services company. That was where I decided I didn't wanna work for a 50, 000 person consulting company, but obviously got some really great data challenges, worked with some amazing companies.
And I've just been a bunch of different data startups along the way of varying sizes now at data.world, catalog and governance company.
[00:04:08] Unknown:
Bringing us around into the topic at hand for today, I'm wondering if you can just start by giving your view on which areas of the data ecosystem you've seen experience some of the most turmoil and confusion, given the fact that it is a very large and constantly expanding ecosystem.
[00:04:26] Unknown:
Yeah. Absolutely. It is very large and expanding rapidly. And I feel like the trends are emerging and they're going through I think of, like, the Gartner hype cycle where, like, somebody posts something on LinkedIn, and then it turns into hype and the trough of disillusionment. And we go through the whole thing in just a couple of months' time. It feels like there is definitely a lot of hype around now and both excitement and concern, I feel like, around a few different topics. And I'll I'll just kinda list a few off, like data mesh, for example. Data mesh has become so hugely interesting and popular, and a lot of different vendors, a lot of different people are really putting out a lot of ideas, a lot of, hey, my technology fits into the data mesh picture. Juan and I always kinda joke about data mesh. Like, if you hear somebody say, oh, hey. We're a data mesh company. Then you should probably run away because data mesh is not a company. It is not a technology. Right? But data mesh has a ton of of sort of hype and and turmoil and confusion right now. The modern data stack is another big topic of excitement, but also fear and confusion.
And just to list off a few others, real time AI, data ops. Right. You can smile or frown when you hear that phrase. Also, semantics and the metrics layer. So these are all things that are going through a lot of hype but also have a lot of confusion right now.
[00:05:49] Unknown:
You can keep adding to it. Was it data observability? Right? There's almost a data quality, and then there's, like, the ETL vendors are out there, and then it's a reverse ETL where chain of names to data activate. So much stuff. Right? So I think it's way too much frank it's way too fragmented. You know, Matt Turc. Right? Always has every year the data and AI landscape, and you look at those boxes. I mean, now it's just it's really ridiculous. I mean, no way in hell you're gonna say I need something. I want 1 of every box. Right? I think we've come to the point that just everything is complicated. That's it. And I think we just need to get out of it. And I think but I tell everybody, so we just need to go back to the principles.
I would argue that if you look at the data and AI landscape that MacTurk does or any type of architecture, the principles are always the same. You move data. Oh, it's ETL. It's treatment. Whatever. You're still moving data. You're storing compute data. Oh, it's a data like a data warehouse. I'm doing this type of whatever now, whatever type of database, or you're using data. Right? I'm doing the dashboard of these things. I'm I have an AI, whatever. You look at those 3 kind of main principles they apply there. They apply both for data and they apply also from the metadata side. I would argue that you would look at moving data, store and compute data, and use data both for data and metadata. All that entire landscape can fit in there. The other issue would make some more complicated is that everybody is to find the quote, unquote category when that category is really just a a feature. So we're like, oh, I built this new category. No. You were built in a feature.
[00:07:15] Unknown:
And we just got a call BS on all this stuff. I'm super freaking tired about it, period. Yeah. Definitely a lot of areas to riff off of there. I think 1 that's interesting to explore is this concept of how do you know when you have something that is momentous enough to be considered a new category versus just being a feature on the, you know, existing set of tooling or, you know, a tool that is interesting but is not category defining?
[00:07:41] Unknown:
I think the test there is understanding inputs and outputs. So it's a box. It has boundaries. What goes in that box and what goes out of that box? And then you're like, hey. Isn't there this other box that takes the same inputs and the same outputs? Like, shouldn't those 2 things do the same thing or not? Like, then you realize, or they have some compliments to each other. I mean, start doing that exercise. I think buyers and vendors, we need to just be more critical about it unless and stop drinking all this Kool Aid and come up. It's just too tiring. Please, it's not that complicated. Look at the inputs and the outputs.
[00:08:13] Unknown:
1 interesting example of that, of, like, looking at the inputs and outputs is, like, looking at the data observability space. Right? And obviously, there's a lot of excitement there, a lot of really great tools and companies there. But, like, I was having an interesting conversation with 1 of my data engineer friends the other day, and they use Datadog very heavily at their company. Right? And we were starting to talk about data observability. Like, oh, yeah. We don't have a data observability tool yet. And I was like, oh, well, it collects all your logs and, you know, provides you great graphs and alerts and all that kind of good stuff. It does anomaly detection. And then they were like, oh, well, we use Datadog for that. Like, couldn't I just point my Datadog at my data warehouse? And I was like, oh, well, that's a good question. I should study up a little bit more and maybe I can talk to that a little bit more. And I'm sure there's very good reasons for why you might use a, you know, observability tool. But it starts to go into, like, oh, wow. There is a lot of overlap. There's a lot of tools that do very similar things. How do you navigate all that? There are a lot of tools in the data observability space where their tagline is, we are Datadog for your warehouse.
[00:09:09] Unknown:
So it just makes it even more confusing.
[00:09:12] Unknown:
I started doing this experiment while going to conferences, and I walk around every single vendor booth. And you can set up my LinkedIn post about this stuff, and I take every single tagline, and you go create a word cloud about this stuff. And it's a Scoke, you can go find my blog post about this stuff. Insights, faster insights for everyone. I'm like, people go around this stuff. It's like, everybody's saying these things, and you're all doing different things, but you're all saying the same. So too much confusion, go back to the principles and ask about inputs and outputs.
[00:09:43] Unknown:
To that point of data observability, not to, you know, cast any shade, but just because it's a very active area of conversation and 1 that we've already started on. So I think an interesting area there is that it started a view kind of data observability as a subset of the larger metadata space, and that whole kind of category began its journey with the idea of data catalogs, which I know is where data dot world got its start. That's where Amundsen got its start. And I think that that was kind of the most immediate pain point that people were experiencing of, I don't even know what data I have or how to get to it or, you know, how it's being used. And And so that was the start of the journey, and a lot of people said, okay. This is a category, the data catalog.
And then people started using them and realizing, oh, well, I can actually get other information out of this if I start piping all this metadata in. So then it was data lineage, which then tried to be its own category. But if you squint, they're actually kind of the same thing. And then it was, okay. Well, now it's data observability, which again is feeding off of metadata. And so now all of these different tools that tried to say, oh, we're our own category, are starting to blur and blend together where, you know, the data catalogs are adding lineage and the data observability tools are adding catalogs, and, you know, now it's starting to get even messier where it's like, okay. Well, it's all just a metadata layer. Okay. Well, what does the metadata layer do? Now there's active metadata, and it's all this you know, everybody started sort of like ripples in a pond. Like, metadata is the pond. Each of these different tools started as its own little drop in the pond, but the ripples are starting to converge and fill the entirety of the space. It's interesting to see how that whole space is evolving now.
[00:11:26] Unknown:
It is super interesting. And I agree with you that, like, this idea of catalog, if you just look at that specifically, it's 1 of the sort of, you know, oldest members of sort of the metadata layer, but also continually is going through more and more evolution and change. When you look at it in terms of like functionality, it looks super like disconnected and disjoint. And then all of this like sort of, you know, divergence and then convergence that's happening. Like, oh, lineage came out and, oh, that was a separate space. But now it's like, oh, wait a second. Maybe lineage actually is part of the catalog. And, you know, 1 sort of meta comment here is that I feel like this concept of catalog, even though some people would say like, oh, catalog is dead or that's the past. We're kind of seeing the opposite at Datadot World that, like, catalog used to be like a small piece, and it's actually starting to become the thing that's eating the metadata layer. And it's kind of expanding and taking more and more use cases.
But I think what gets simpler is to go back to what Juan said, where, like, it's like move, store and compute, and use. That if you're trying to, like, collect context, store it somewhere, do analysis on top of it, and use it for various use cases, it actually does start to say, like, oh, well, lineage is just, you know, I wanna understand the provenance between things. And a glossary is just, I wanna apply some semantics on top of things and then draw a relationship.
[00:12:52] Unknown:
And everything actually fits into this framework. And So I wanna go back on the principles when we think about this move, store, and compute, and use. If you think about it from the metadata's perspective, if you're moving metadata, I mean, it's like your your catalog. You're observing what's going on, and you bring in that stuff. What is that in the data in the data management world? That's the the the ETL type of stuff. Right? Now the storage of computers that I need to have a place to literally store all my metadata. Us at Datapworld, our foundations are a knowledge graph. We believe metadata is a graph problem. We represent everything in a graph. So we store everything in a graph, and you actually can go compute on that. I mean, you can write queries on that graph. We use RDF and Spark on all these open standards. That's our storage and compute.
Once you do that, you wanna go do something with that metadata. You wanna go that's the use part. You can use that. Search and discovery as a type of usage. I wanna go find the lineage of this software comp that's a type of usage of that stuff. Right? I wanna go make some inferences from here so I can notify somebody else. That's the type of usage. So we look at the principles about these things. It's all the same. And then, honestly, Tim and I, we do our podcast, Callahan Contos, the honest no BS non sales lead data podcast. The honest, no BS here is that we love changing names just for the sake of tanning it out. Data catalog. Well, data catalog is dead. We go do this, and now it's metadata, active active metadata because past admin it's all the same thing. You're just doing stuff with the metadata.
[00:14:12] Unknown:
It's metadata management the same we do data management. I think the buyers out there and people who are just trying to look at these tools, you really need to just brush aside all that marketing fluff and get into the principles around this stuff. And again, the inputs and the outputs. Yep. And focus on the use cases. Right. Because then if you focus on the use cases, then it becomes a lot clearer. Oh, to move, store, and compute, and use in the way that I want to, I need an observability tool. I need to catalog. I needed this, and that's my solution.
[00:14:41] Unknown:
Another term that we threw out earlier was this idea of the modern data stack, which is, again, another kind of rebranding of the same old thing. And I'm wondering what you see as the influence of that kind of frame of thinking on the ways that you and your customers think about the kind of skills and technologies and domain experience that's necessary to be effective at working with data and maybe some of the kind of foundational principles that are useful in this idea of the modern data stack and some of the ways that it is just kind of throwing new names at the same old things and just sowing confusion?
[00:15:23] Unknown:
You know, I feel like around modern data stack, we're seeing sort of 2 cohorts forming. 1 of them is more of your newer companies, your startups and folks where they get to kind of go into MDS Pure. Right? And I think that what we're seeing there is, like, really rapid adoption of, like, the analytics engineer concept. We're seeing lots of sort of, like, data engineers being able to come in relatively junior and pick up, you know, 5tran and Snowflake or Databricks and, like, get to make traction really quickly and do some really advanced stuff around data very rapidly. And so that's all, like, really good stuff that's happening around some of those companies that get to go into it really pure. And those are some of the skills that they're picking up around some of those tools around analytics engineering and around sort of lightweight agile data engineering practices.
But then you have sort of this second cohort, right? And that second cohort, I think, is more of your sort of companies that have been around for a little longer. They've already invested in more, you know, legacy technologies or they've become very committed to a particular cloud or a couple of clouds in their stack. And they're in more of a hybrid state. They have more of a mixed stack. And so for them, their skills have been much more, hey. How do I pick up a couple of incremental things and start to make some, you know, incremental improvements here? It's really hard to kinda do the full lift and shift all at once. You kind of need to approach it pretty incrementally.
[00:16:53] Unknown:
That last part, I think if you're coming from the, quote, unquote, legacy world, people start off, I need to go move to my cloud warehouse or whatever. Right? So that's their first kind of step into the modern data stack. And if you're a young company, then you can go all in because you're just gonna start everything completely modern. But something I wanna remind everybody is, like, data is a means to an end. I can go create the most beautiful warehouse or whatever you want, but if nobody uses it, it doesn't mean anything. I mean, I haven't accomplished I mean, oh, I invested all this time and, yes, we celebrate that we finished we, quote unquote, finished something with the data. But, no, people have to go use the data to answer some critical business questions that are going to, at the end, make money or save money with organization.
That's what success looks like. So I think it really doesn't matter if it's modern or not modern or whatever. Are we able to help the people who are making the decisions to make money, save money? That's it. I think what we've lived in this world, what we call the a data first world. We just focus on the technology and the data. Let's go dump the data here and that, and AI is all about give me more data. Give me more data. I'm like, wait. Wait. Really, all you need is more computing, more data that's gonna solve your problem? BS on that. Because what we really need to understand is, what does that stuff actually mean? Like, is this actually what you really need to go answer that question? We need to understand what the end users are thinking about. How does that business work? And understanding, meaning, that's semantics. That's that's the knowledge. And I think that's what's missing right now in all the whole modern data stack and just the data world in general that we've have. We live in this data first world, and we need to shift to this knowledge first world. I call it knowledge first world as people first, context first, relationships first. And how do you start?
Start with data modeling, something we've completely lost. And now it's people are starting to come back. Can't believe that so many people think, why do I need to go do that? Right? It takes too much time. No. Modeling is really trying to understand what people are thinking about because they're the ones who are gonna go use the data. So start with data modeling. That's what's missing.
[00:18:54] Unknown:
I think that that's an interesting area to explore because, you know, as you said, data modeling is something that is it's something that a lot of people are debating right now. Okay. Well, do I need to do my modeling first? You know, is it schema on write, or is it schema on read? The same experience we went through with the whole NoSQL movement of, oh, SQL doesn't scale, so I'm just gonna throw it all in a bunch of document stores. And now I actually have no idea what I have or how to use it, and so now I've gotta deal with all these bugs because there's no consistency as to what the underlying data looks like. And now in the kind of data engineering ecosystem, it's a question of, oh, I don't need to model my data. I'll just dump it all into raw and then transform it however I want to. But now I have 15 different schemas that I'll look at the same data, but I don't know which ones are actually being used for what, where.
And then there's, like, the debate of, oh, well, do we still need star snowflake schemas? You know, do we do Data Vault because the hardware that we have is more scalable and all of these patterns were developed when we were resource constrained? So are they still applicable, or do I just go wide table? I'm curious what you see as some of the useful kind of fundamental elements of data modeling that hold true and which of these kind of approaches to how to think about the architectural aspects of modeling are necessary and applicable in this kind of cloud native world.
[00:20:14] Unknown:
Let's zoom out for a second. I always talk about this balance between efficiency and resilience. There's this phenomenal talk by Moshe Vardi, a very brilliant computer scientist at Rice, where he talks about what we could learn from COVID, what computer science can learn from COVID when it comes to efficiency and resilience. So the talk track is COVID starts, and we ran out of toilet paper. Why? We had a very efficient just in time, supply chain, but it wasn't resilient for something like that. The Suez Canal is extremely efficient, but you have a ship that goes just a little bit sideways like that, an entire economy of the world goes kapoot for 10 days. Right? Not resilient at all. So let's understand this balance of efficiency resilience. You said something right now about, oh, this is not efficient for our hardware or whatever. It's like, that's efficiency from a technical point of view. We is it efficient for the people actually using the data? Are we actually talking to those people about it? So let's think about efficiency in this case. Somebody asked a question. I need to go answer it. I need I need to answer this question. You go tell your data engineer. They do all this work, and they answer the question.
Perfect. You very efficiently answer that question. Next question comes along. Just do the same work. Add more work, and then you answer that question. So you're doing almost a 1 to 1 kind of labor work. You do 1 unit of work to answer that question. That scales linearly. Is there a way that I can do least amount of work and go answer more questions? How can we go scale this? So people somebody says, we need real active users. We need to do analysis. Give me all the data about real active users. I do all this work. I need to go find all the features that are highly used, most popular features you go do at work. Well, if I would have sat down, I sat and think, well, there's this notion of users first. What does a user mean? Let's go talk to people. Oh, users, actually, they have some sort of activities within our platform. Oh, what are those activities that they go do? So let's go define what are the most important activities. By the way, these activities are based on some features. Right? So if you think about the modeling, I model the concept of a user. I model the concept of the feature that a product has. There's some activities that a user will have with respect to the features.
If I do that work up front and I model, I can now start answering questions about users specifically, about activity specifically, about features, and then the combination. Where does real active users come by? Your definition of how many users were clicking on things. Right? You're combining users and activity. What are the most popular features? You're combining activities with the features. So you do this upfront work, which is a little bit extra cost, but you can now then, later on, go answer a bunch of more questions. So what I say is it's dealing with the known and the unknown use cases. So if you have a very specific known use cases, then you can be very efficient about it. But if you want to be resilient, you need to be able to go deal with the known use cases of today and the unknown use cases of tomorrow.
Now the problem of everything I just said is that it's all about incentives. We are incentivized as humans in our organization to be efficient and not to be resilient. This is a change that needs to happen from an organizational point of view. And it's not just and everything I just said, there's no need for new technology. It's just about thinking about how we're gonna be efficient, how we're gonna be resilient. I think modeling is a way to help us find that balance.
[00:23:20] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done. DataFold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. DataFold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying. You can now know exactly what will change in your database.
DataFold integrates with all major data warehouses as well as frameworks such as airflow and DBT and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Another aspect of where we are with this question of kind of the modern data stack and touching what you on what you were saying about the kind of adoption for legacy players is, you know, this idea of inputs and outputs and what you were saying where, according to some people's definition of the modern data stack, it is decomposing the inputs and outputs of the different stages of the data life cycle so that you can compose together these different utilities, and so that is kind of what forms the modern data stack. You know, different people have different ideas about what those levels of granularity are. But in that question of adoption and evolution from legacy technologies to where we are today, I'm curious what you see as some of the ways that this evolution into the modern data stack is proving useful because it is more clearly defining those interfaces and maybe standardizing them a bit more.
And maybe some of the ways that we are starting to go too far into that kind of deconstruction of what a data platform is supposed to be.
[00:25:14] Unknown:
Look at this from 2 perspectives. I get from the technical and the social aspect. If we go back from the social aspect, it's what are the problems that we're trying to go solve? And you can be a large organization or whatever, but maybe a lot of the stuff that you need to go do is just a bunch of, like, just good reporting. Yeah. You don't have strong foundations to go do your reporting. You won't be able to go do great AI or whatever. So from there, it's like, what do I need to be able to accomplish that stuff? And, actually, I'll be very honest. I see a lot of large organizations who they they decide, I just need to have a really good BI tool to go do this. I need to have a really good I'm gonna move to the cloud data warehouse, and I'm going to start moving my ETLs into getting off some kind of legacy tools and then moving to some cloud, ETL tools.
That's probably a lot of the interfaces that you need at the beginning from there. Now if you go from the technical side, I was at the couple months ago, I was at the Kafka Summit, and I remember everybody there as I was talking about, oh, everything's real time. Like, if I could start from scratch everything, I would do everything real time. I was talking to everybody in the hallway. I'm like, wow. Okay. Why would you do that? Because eventually, we're gonna need real time. I buy that because eventually, yeah, eventually, we need a lot of things. But then I think if you go from the pure technical side, you're gonna start overengineering for things that you you don't know, and then you start setting up all of this stuff. And then the other thing is that you don't set up things that that you will need, eventually need, but you don't know that you need them at that moment.
So I'm, like, at this Kafka Summit, and everybody's there's so many talks about how to deal with schema evolution. And you have all these schema registries, and then they're like, oh, you have like, how do you go deal with all the serializations? And I'm like, wow. You're all reinventing databases in here just because you decided to go do everything streaming. So they went all to this very specific layers when there's, like, you really didn't need that because that's not what the end goal was. Right? I mean, some people do need that. Right? I mean, they had, I'm not saying that you don't need real real time or streaming at all. Right? They're in particular use cases. All this to say is, like, finding those interfaces and bringing all these tools together really just depends on the use cases you're gonna go do. And I think also associate that your organization needs to have a clear strategy of where we're going. Right? Where are you trying to go today?
What are we planning to go do 3 months, 6 months, a year from now? And that's what's gonna inform you of what you're gonna need to go buy and put together.
[00:27:32] Unknown:
I thought that was beautifully said. And the thing that I'll add is that I think a lot of what folks are starting to realize now, I think especially as they've done a lot of adoption around some of the modern data stack tooling and and technologies, is that getting to an ideal data platform, data architecture, data solution, however you want to kind of describe it, that it's not a destination, that it is something that you are iterating towards. And as your business grows and as the macro economy changes, that there is no perfect plan.
Even if you had like a legacy infrastructure, there's no perfect migration. And in many cases, I'm finding that there isn't even a completion to the migration, which is maybe scary. Sorry for those that are doing migrations right now. That might be a little bit of a scary thing to tell you. But I think what that means is that I think we're all learning as an industry to appreciate the journey a little bit more and start to think of these things as like, oh, these are tools that are in our tool belt that we can deploy that help us handle different challenges in different situations. And there are pros and cons. There is investment and there's return on that investment.
And I think this is making us all smarter and better for it.
[00:28:43] Unknown:
Going on a bit of a tangent, another aspect that is becoming common among the different players in this so called modern data stack is the idea of bottom up adoption and developer led growth, you know, product led growth. And I'm curious how you see that influencing the ways that teams get introduced to and start adopting tools and some of the useful practices of kind of skepticism that teams need to adopt to make sure that they're actually asking the right questions about, do I need this tool? What does it solve for me? And why does it seem appealing?
[00:29:19] Unknown:
Product led growth is very interesting in terms of how it impacts, you know, the data industry. And I think that 1 of the recent experiences that I've had here is that 1 of our data engineers at data. World started to play around with, you know, some of the reverse ETL tooling for our own internal uses. And it's so easy now with a lot of different data tools to especially SaaS based ones, right, to just go and sign up for a trial and connect a few things together. And it's like, hey. Look. It's flowing. It's showing up in Snowflake or whatever, or it's showing up in my Salesforce instance. It's like, sweet, cool.
We just deployed another tool into our stack, and all it took was somebody experimenting and running it on a, you know, a card for a $100 a month or $50 a month or whatever. Right. And so I would say, you know, this is helping us really evaluate and try tools faster. We're seeing that in our entire space, you're able to really build out your data stack and iterate and add to it. You can do in, you know, a year what used to take 5 years of a whole team. Now 1 person can do it in a year. But, you know, going back to the turmoil and the confusion, it's very easy now to build sort of the modern day version of spaghetti at the wall. And so, you know, let's avoid that. And there's probably some new governance paradigms that we have to start to put in place here where, you know, certain tools, certain markets, it makes a lot of sense for a data engineer or analytics engineer to go evaluate and just go pluck those things off the shelf.
But, you know, there are other choices that really need to be more part of a roadmap. And so you should have your, you know, data engineering or your data stack roadmap and really be methodical about how you're approaching your strategy there, even if you are going for more of the product led growth tools.
[00:31:03] Unknown:
1 of the things we need to start really focusing on is, hey. Just because I can, doesn't mean I should. This goes into being much being very critical about because then you can't do this other thing and this other thing. And then, like, who's gonna maintain that stuff? Like, wait. You're the 1 who started this thing, and then you're leaving the organization. And what are we gonna go do about this stuff? So talk about generating spaghetti code. We are already creating this modern spaghetti mess of stuff that we don't even know what these interfaces. And let's be very honest about this stuff. We're all trying out these latest tools. These companies are probably not gonna be around the whole time. So then you're gonna start levering to this stuff, and then something's gonna happen to them because there's with how big this this all these different vendors are, there's gonna be consolidation in the next couple of years. So this is something that need to be included in the strategy. Like, somebody who's defined the strategy needs to be very careful about we're selecting these types of companies for these reasons. Now if you're also a small organization, you're growing. Right? It's different if you're already a big organization. You have a bunch of, like, legal hoops. I mean, clicking accept on these terms and conditions, like, you have no idea what you just accepted. And if your legal team realizes that, like, you can get into a lot of trouble around their stuff. Like, talking about governance. So you gotta be very careful. That's 1 thing. Now the other aspect that I've this is the drum I'm banging so much is we need to have another shift. It's not just any more data literacy, but what I've been calling business literacy.
So we're telling all the business people, you need to go learn how to use data. How about all the data people? Do you know how the business works? So you know that if you're gonna make that decision, how that actually is gonna impact the business. Is that actually something that we need, or is that nice to have? You understand the priorities? Do you understand how you're contributing to the OKRs of your team? Do you even know what your OKRs are? Do you even know what the strategy of the company is and how your work is delivering value towards that? I mean, I get really pissed off when people tell me, oh, we're data team. We support everybody. You can't put an ROI on us. I was like, uh-uh, I'm sorry. The frickin' CEO has to be accountable to the board. Everybody else is accountable. You are accountable too. If you're not able to tell me the value you're doing, maybe you should not be here.
I think we need to have more of those really honest, no BS strong conversations. Data folks, you're not some special person out there just because you help everybody. You need to be able to show it. And in these macroeconomic times, guess what? Look at all those layoffs. Be prepared.
[00:33:19] Unknown:
Another interesting aspect of the kind of technological enablement from a data perspective is that it is introducing some more of these kind of organizational and social patterns about how to think about working with data because of the, you know, enablement, the fact that, you know, business users can get involved in the data, and, you know, technical people need to be more involved in the way that the business operates to understand what problems they're actually trying to solve. 1 of the terms that has already been thrown out there is data mesh, then there are also things like data fabric and, you know, data products. And I'm curious how you see that influencing some of the decision making about who is actually involved in data. You know, who is the part of the, quote, unquote, data team, and how does that influence some of these patterns around the kind of data platform and infrastructure, the way that collaboration happens and the kind of level of collaboration that needs to happen, some of the ways to think about governance and security and privacy, both from the data and the technical perspectives?
[00:34:23] Unknown:
I think that data mesh, as it's really emerged, has been a good thing for getting the business and the data folks talking more with each other. So I think even though there's a lot of hype around it, there's a lot of confusion around it. A lot of good is coming from it as well because this idea of the domains of our business. Right? What are the different sort of business functions, the business capabilities that we have, which sort of expertise orients itself around? And then saying, Hey, well, that expertise is knowledge and that knowledge gives context to the data. So really our data should be more reflective of how we do business and how we organize our business. That's been really good. And so I think we're seeing, you know, a lot more business folks and data folks working together in that context. And it's been a good thing, you know, from a governance standpoint too, because, you know, folks who were a little bit more seen as, oh, they're the protectors of the data. They're the security folks. They're just making sure that we, you know, do g GDPR compliance and things like that.
There's a little bit of a flip happening now, which is a really good flip to say, oh, wait a second. Like, governance and actually data enablement are 2 sides of the same coin. And let's get, you know, these governance people and our data engineers and our analysts all working together and are very cross functional. I think that's been a really good thing in terms of people roles. And another, which I know both Juan and I are both very passionate about, is this idea of treating your data as a product That's obviously having a really big impact to think of your data more as, hey, what's the user experience around this data? What's the surface area of this and the maintainability and the life cycle around this, you know, piece of data? How do I create sort of a marketplace, you know, whether internal or external around my data and think about more like what's that Amazon experience for data?
And it's putting a little more emphasis on, Hey, who's the person who takes care of the data and the data product. And, you know, we're seeing a lot more companies now actually bring on people to be or anointing people to be data product managers, you know, not just for externally facing data products, you know, because data product managers have been around for a while for companies that, you know, sell data or prepare data more in an external way. But actually data product managers that are internally facing, thinking about how to treat data more like a product for different teams as those are sort of like end users or consumers much in the same way that with a software product, you have software and consumers. So I think that's been a really good evolution there.
[00:36:56] Unknown:
What I would say is the data mesh, let's be honest, a lot of people make fun of it, and it's, like, so confusing and stuff. This is an opportunity. And I know I've been kind of being even sounding a bit harsh with, like, oh, we need to make sure that data engineers are doing these things, that they understand the business. This is an opportunity to really uplevel the more the value we're providing. By treating data as a product, we're really connecting to the business and understanding how that business works. Tim and I have been working on this framework we call the ABCs of data products. Accountability, boundaries, contracts and expectations, downstream consumers, and explicit knowledge.
So a data team was gonna consist of data engineers. They need to go be able to go manage all this data. We wanna be able to go have the data product managers, right, who's taking responsibility, who's a product manager for these data products. It's something that I've been calling for a while called the knowledge scientist, and it doesn't have to be a particular person. This is, like, the role, which is the bridge, the translator who understands kind of what the business and be able to go translate things to the data. And what we see is that sometimes data engineers are playing that translator role. And it's to be able to understand, hey. I can go talk to folks on the user side. You're using a word called order customers. Like, hey. Let's talk more about this. What do you mean by it? And then I have knowledge about the data. I can make those type those direct connections around that. So I think this is this what I call the knowledge scientist data translator. I think also the analytics engineer type falls into this. It's about building bridges, and I think that's a huge opportunity.
And going back to our ABCs, a data product should have accountability. Like, who's responsible for this? Who takes ownership? Who fixes it when it breaks? Who's on call? The boundaries. Let's go define a box. What's in this thing? What's outside of this thing? What is the road map? How is it gonna evolve? The contracts and expectations. What are the SLAs SLOs? Like, what are the policies? Who can use this? For what purposes? And where does this live? How often is it updated? Downstream consumers, we're defining this product for whom? Where are we getting these requirements? Who are potential new consumers for this? And the explicit knowledge is, let's have the semantics well defined. What is the schema? If I mean if I'm using a particular attribute, I know what that means. And if if it shows up somewhere else, is the same thing or not? Has clear identifiers. We know what we can join it with. We know what we can't join it with.
So I think this notion of data mesh is telling us to treat data as a product, and and these are more of the social aspects that we need.
[00:39:15] Unknown:
Data management is really a social technical phenomena that we need to look at it from more of that social perspective. Yeah. And technically, you can implement it in different ways. Right? Like, the explicit knowledge. If you wanna define your schema and provide the semantics as comments in your LookML, cool. Like, you could do that. Right? Do you want it in your catalog? Like, that's a great place for it. Right? Is it in a spreadsheet? You know, like, I think, really, how you do it is less important than that you're doing it.
[00:39:43] Unknown:
Build data pipelines, not DAGs. That's the spirit behind Upsolver SQL Lake, a new self-service data pipeline platform that lets you build batch and streaming pipelines and SQL Lake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQL Lake supports a broad set of transformations, including high cardinality joins, aggregations, upserts, and window operations. Output can be streamed into a data lake for query engines like Presto, Trino, or Spark SQL, a data warehouse like Snowflake or Redshift, or any other destination you choose. Pricing for SQL Lake is simple. You pay $99 per terabyte ingested into your data lake using SQL Lake and run unlimited transformation pipelines for free. That way, data engineers and data users can process to their heart's content without worrying about their cloud bill. Go to data engineering podcast dotcom/upsolver today to get a 30 day trial with unlimited data and see for yourself how to untangle your DAGs.
Another area of exploration is the question of kind of what are the skill sets that need to be contained within a given role or person? Because you were saying as you were listing off all the different things that need to happen, that sounds like either an overwhelming body of work for 1 or a small handful of data engineers or something that needs an entire team dedicated to it. And that's another aspect of the space that is still kind of influx and open to consideration of, you know, who are the people that need to be involved in this? What are the skills that are required? Like, do I need to be a so called full stack data engineer where I know everything from, you know, tuning the size of the shuffle files in my spark cluster up through to making sure that I have the right semantic models defined to be able to have the proper, you know, golden master records for the business to be able to answer their queries? Like, how do you think about the kind of staffing and skills requirements for being able to make sure that all of these things are happening, and then you also have all the engineering talent needed to be able to build your core product.
[00:41:53] Unknown:
1 of the things that data mesh, which I think I'm a big fan of data mesh. It's a social technical paradigm shift. And 1 of those paradigm shifts is finding this balance of centralization and decentralization. And I think that's how we need to start thinking about it. And that depends on the culture of your organization, on the size of your organization, how it's growing, on the industry around that stuff. So some things need to be centralized. Some things may need to be decentralized. So let's take that example. If I have a a very centralized kind of engineering workforce, then you may have those core data engineers who are setting up that platform who's gonna help. So the entire organization as much as they can. So they're the ones who are focusing on all the tooling and indexing.
But I may have data engineers or what I'm calling these knowledge scientists or these translators. We may be decentralizing a lot of the parts of understanding what that data is, defining the requirements for that, and defining the semantic models. Those may be then pushed out to the decentralized domains. And so you may have data engineers over there too. Other things that you wanna go centralized, depending on the industry you are, you need to have very clear policies around what is PII and GDPR, all that stuff, and you basically need to mandate everybody else to go do this. You have to it's a stick approach right there because, hey, we can get fined. That will have to be then centralized. Maybe you'll have some core models that you want to be centralized that everybody goes uses, but then you let these centralized domains extend those models, those semantic models, because you still want people to be very efficient and go do things fast.
So we can go back and forth on it, but at the end of the day, it really depends on the culture, on the size of the organization, the industry. It's all about balancing that centralization, decentralization.
[00:43:30] Unknown:
And the only other thing I'll add to that exactly, Juan, is, like, how much is sort of data driving kinda what you need to do as a business from, you know, hey. We just need it to, like, kind of manage our normal decisions and things like that to, hey. We're actually, like, building out data products and that sort of thing. You know, this is how we actually make money as a company. And then also, like, what's the complexity of that data? Right? Is our data primarily coming from IoT sensors and we're having to, like, you know, unite that altogether versus, oh, no, actually, all of it is very neatly and already organized. It's like we're just looking at CRM data, and, you know, we just need to put some, you know, BI dashboards on that. I I think when you think about, like, the complexity and also how much data is driving your business, there's probably actually an equation that we can kinda put together here that's like, this is how many engineers you need. This is how many, you know, data scientists or analysts that you need. And, you know, based on what Juan said, here's how much you should centralize versus you should decentralize. And, you know, there's a correct medium there.
[00:44:33] Unknown:
Bringing this to a case study of your work at data.world, I had Brian on the show back in 2017, you know, around the time that I first started the show. That was an eternity in the data world. And so I'm wondering if you can kind of take what we've been talking about of kind of bringing it back down to first principles, being kind of skeptical and practical about the ways that you think about the technical and organizational aspects of working with data and the kind of explosion of the ecosystem and what the different tools and categories happen to be, and maybe bring us on a journey of data.world over that time span, talking about kind of the ways that the scope and focus of the product have evolved along with the ecosystem and some of the ways that you're thinking about your own internal platform development and tooling to be able to support yourself and your customers?
[00:45:29] Unknown:
Well, you know, back in, you know, 2017, Datadot World is now about almost 7 years old now. In that first phase, Data.world was actually focused a lot around building the world's largest open data catalog. And so, you know, we're built completely on a semantic knowledge graph foundation, so same technology that powers the World Wide Web. And we were really focused on creating this, like, Wikipedia and this, like, GitHub for data where everybody could participate, contribute, you know, data.gov, nonprofits, you know, your own data that you're working on your own projects. It could be public. It can be private. And that still exists today. So you can go to Data. World and you can sign up for that today.
But in sort of the broader landscape, even at that time, you know, this is how fast you mentioned, Tobias, how much things change, you know, so quickly. Even at that time, catalog was still in sort of its 1st gen mode, and it was kinda like the card catalog, you know, that, you know, you would get from IBM or Oracle or something like that. Right? And just at that moment, really a lot of the AI movement started to really blow up a lot around the modern data stack really started to take off as Snowflake started to really grow as Databricks started to shift its model. You know? And so what we found is that a lot of companies really started to struggle with that problem that you brought up earlier, Tobias, of the complexity.
Right? It's like, oh, now we're moving to the lake model. Oh, just kidding. We're actually moving to the lake house model, you know? And, oh, yeah. And by the way, we're gonna do a, you know, a Lambda or whatever model, and it's going to have streaming on 1 end and it's going to have real time database on this end and search on this end, and we're going to make it all work together. So things got so complicated. And we switched from, I think, the focus being so much on the volume, which was, I think, the 20 tens and being like, oh, we need to do open stuff. And we really switched into variety mode and also just, like, multi cloud mode. Like, where is it? I have no idea. And so that really put a lot more emphasis in our entire space on you need to find your data. You need to understand your data.
Metadata and context is just as important as the data itself. We need to be able to balance SaaS as well as on prem as well as, you know, my cloud, your cloud, 5 clouds. We need to be able to balance all this together. And so a lot of the tooling, I think, around catalog, around observability, around even things like, you know, cloud ETL and reverse ETL, it's all now revolving around this multivariate stack that we find ourselves in now. How do we make that work well? How do we become efficient with it? How do we become resilient with it? I think from the technology
[00:48:10] Unknown:
and the vision, they're the same. They that has not changed. I mean, our vision is to build the most meaningful, collaborative, and abundant data resource in the world, right, to help maximize data society's problems. And to accomplish that vision for the world, you better start within your own organization, and that's what we're doing. So I think that first phase when you spoke with Brian, which by the way, before coming here, I listened to that episode too. And and I was just so happy to listen Brian talking to that, which is I think that was episode 7, 9, or something like that. That's about right. Now we're, like, episode what is it? Like, 200, 300? Or We're up at around 340 some odd now. And I listened to this, and I'm like, wow. The stuff that Brian said continues to hold. The technology he was talking about, the RDF, we talked about HTT, all that thing continues to hold and it is the basis of our foundation.
So the vision is still the same, and we're just like, we gotta start from the enterprise. We gotta start with building it from the metadata. And there are knowledge graph foundations are still there, which makes us very powerful. We had move, store, and compute, use. I mean, we move data means that we can go catalog anything. You can turn anything into triples into a graph. We can no if, ands, or but. We can go catalog anything. Store and compute. It's so cool to see our customers doing all these crazy kind of transformations of metadata, graph queries, finding, just doing graph analysis, all of that stuff, how to go use the data, even search, doing recommendations over your metadata.
I will acknowledge that 1 thing that has shift on the time is that we were kind of expecting for folks to get more into the data management within data dot world. Because in data dot world, you can upload data. You can go virtualization. We have virtualization federation. But when we go off to the market around 2019, 2020, people are thinking about catalogs as just pure metadata. And we would show them, oh, you're actually accessing data and loading data. That would confuse people because they said, oh, I thought this was metadata. I'm like, no. I got metadata first and then data access later. I think that's something that would argue that we've kind of been ahead of our times on this stuff. What is fantastic is that we are seeing our customers saying, we've kind of, quote, unquote, graduated from the metadata part because, hey, metadata by itself is just another means to an end. I gotta do something with the data. And they're like, oh, great. I can now access the data and create these data products in data dot world. And, yes, they can live in Snowflake or whatever. But once I find it in data dot world, you can actually go use it in data dot world even though regardless of where it lives. So they're going through that evolution, and then we're now seeing them go into the evolution of adding more semantics. What does this mean? So we've been at this for 6, 7 years. I joined the company 3 years ago because we're just been so aligned on the vision and the technology. So this is the mission that we're on, and we're still working towards it, and I'm thrilled to be
[00:50:47] Unknown:
here. In your experience of working at data. World, working with your customers, you know, running your own podcast and exploring the ecosystem and the ways that people are interacting with it. What are some of the most interesting or innovative or unexpected ways that you've seen people think about first principles and how to apply them in the ecosystem?
[00:51:07] Unknown:
I think 1 of the biggest things that we're seeing that has been interesting is that at first, kind of what Juan was just mentioning, folks tend to think of things very, especially, you know, larger companies, you know, companies that have been doing this. You know, they've been at the rodeo for a long time. They tend to think of cataloging, enterprise cataloging and governance through a very metadata centric lens and through almost like a a librarian kind of lens. We gotta catalog all the things. It has to be organized, apply the taxonomy, organize to the domains, and wrap some policy around it. Right? It's very much that kind of approach.
And I think 1 of the interesting, innovative, and unexpected ways that we've seen Data.world be using, but also kind of the catalog space in general trending towards is, well, metadata is just the beginning. Like, let's do things with the metadata. Let's use it as a way to actually define in a policy driven way what data you should get access to. Or, you know, we have 1 customer of ours that actually is leveraging the metadata that they have in data dot world to actually power their targeting engine for personalized digital experiences and ad targeting. So it's a really large media firm. And so the metadata that's in the knowledge graph in the catalog actually allows them to power these data products that help other companies, you know, provide those more meaningful and personalized experiences. So a lot of people, I think, don't think, oh, metadata could be part of, like, a data product engine that's intelligently, you know, doing ad targeting and personalized experiences. But the truth is is that the catalog is your context. It's the context for your business, and that can translate into so many different things. So those advanced AI applications have been pretty interesting.
[00:52:57] Unknown:
What I like seeing is when people realize that data catalog is much more than cataloging data, but cataloging knowledge. And what I mean by that is catalog the business questions people have. And I tell people, like, remember that data translator role, that knowledge scientist? Like, it was almost like a data therapist. And I tell people to go do this. They say, ask your leadership team. What keeps you up at night? They're gonna tell you some alright. Put that catalog then. I mean, start with the spreadsheet. If you have a data catalog, you should be able to go add that there and also relate that this person there's this person in this department is asking these questions. And then you have all that stuff and you're like, wait. I cannot see all these different people asking kind of very similar questions. Is this aligned with business or strategy that we have? That's what I mean by cataloging knowledge. Actually, very recently with the customer, they literally are like, let's go push the barrier of data dot world. And they are like, we want to go catalog our business processes that we're defining in VPN engines, blah blah blah. I'm like, this sounds super cool. Let's go do it. Day and a half, I hacked it up and, like, oh, here it is. And now we can actually you know, we all talk about data lineage. Now they're talking about business lineage. They're like, oh, here is this actual decision model that is being done.
And when if we make a change to the decision model, what the business process are going to affect? Like, this was super cool to go see people. We've been cataloging role based access control. Like, oh, I want to know we have all these different users in the database and all these different objects that we have. Does this person have access to this? And if so, through what different, roles? Like, that was another experiment we did with another vendor, like, try and do that. Our own engineers have used data dot world to create their recipe catalog, the Marvel movie catalog. I have my own wine cellar catalog inside the world, so we can go do all these crazy things about it. And I think it's not just about cataloging the data, the metadata. It's like literally cataloging knowledge. You're now genuinely creating graph of all knowledge. That's what a knowledge graph is.
[00:54:52] Unknown:
In your experience of working in this space and working with your customers and on the business, what are the most interesting or unexpected or challenging lessons that you've each learned in the process? I would say that
[00:55:03] Unknown:
the big lesson I've learned is that discovery and especially governance is hard and not just for technology reasons, which, you know, systems are all very different. And so you have to, you know, really accommodate a lot of complexity, be the master of complexity to really then make it simple for end users, but also because it involves people. And people are messy. And I think this whole metadata layer and even the data layer, like, so much of it, we love to talk about technology. We love to talk about architectures and things like that. But at the end of the day, there are people with questions and people with goals and people interacting with each other and perhaps some cases not interacting with each other in the silos form. So I think that's been a big lesson learned is, like, how do we get people to work well together?
How do we help to simplify the experience as much as possible for those people so they can just focus on what they need to get done? And how do we take all that context that's in people's heads, all that tribal knowledge, and how do we get it to go into a system, into the knowledge graph, into the catalog?
[00:56:12] Unknown:
I got 2 things. 1 is incentives. Like, how do you get people to start contributing knowledge? And this is kind of adding to what Tim was saying. My wife, her PhD, she's a behavior analyst. And I've learned this myself is, like, trying to understand people's behavior, how they change behavior, is this something you need to go do? We work with a customer who their way of putting incentives is to say, hey. They want to increase the quality of data because they identify very specific use cases, how they can make more money and save more money. You know what they did? They told everybody 20% of your bonus at the end of the year depends on data quality. That's how they're incentivizing. These are really crazy incentives are something really important.
And then another aspect, which is not much kind of unexpected, but it's just a constant reminder, and the team said it too. It's like humans are complicated. And data, if we don't deal with the human aspect, then we're dealing, frankly, with the easy part. The technology part is, like, the easy thing. When you start talking with the people, that's when you realize, oh, shoot. Like, this is getting much more complicated. It's why it's a social technical paradigm shift. There's this joke I heard once. I repeat this all the time. You can send a rocket to space. We can bring it back to Earth. It can land on a platform in the middle of the ocean, but we still can't say that these 2 spreadsheets match. I can argue as a proof by example that, wait, is rocket science actually easier than data management? And then, actually, you can say, probably, yeah. Because rocket science, this is like physics. It's natural science. I can study it on its own. Well, data, I gotta deal with humans, and that's complicated.
[00:57:33] Unknown:
Yeah. It's a great insight. And so for people who are trying to figure out what is the kind of 1 piece of foundational infrastructure or tooling or utility that I can lean on and use to kind of understand my place in the data ecosystem and my organization? What are the cases where data. World is not the right choice for them?
[00:57:59] Unknown:
So definitely, if you're like a smaller company, right, or you have smaller use cases, sometimes, like, a spreadsheet is probably enough, and you you probably should start with that to figure out how things break. And again, it's not the tech, it's the social, it's the people in the process there. And then from there, you can figure out what are the requirements. So I think small organizations,
[00:58:16] Unknown:
you don't need data at world. That's why I need it. Well, I don't know, Tim. How about you? I would say 1 other thing, like, as you're navigating the data landscape and where a tool like data at world maybe doesn't make sense is, like, if you only care about, you know, there's sort of, like, offense use cases, which are like, hey, like productivity, like creating new value, creating new, you know, data products and collaborating with each other and, you know, building a data literacy data culture. Right. And then there's more like defensive use cases, which are more, hey, we gotta protect our data. We gotta be focused on security and things like that. I think that if you only care about these defensive use cases, then a tool like data dot world probably doesn't make sense. Right? There are tools and technologies out there that are purely focused on just like scan all the operating systems and find all the PI. Right? But if you care about offense plus defense, that's where something like data.world can make a lot of sense.
[00:59:08] Unknown:
As you continue to explore and report on this data ecosystem and some of the ways to think about it in a holistic manner? What are some of the ways that you are continuing to kind of keep apprised of what's happening in the ecosystem and ways that you're thinking about evolving the data. World product to kind of keep pace with that?
[00:59:30] Unknown:
I'll do kind of short term and long term. Kind of short term because everything in data dot world is a graph. So we're generating all these, what we're calling knowledge graph powered automations. So, basically, how can we use the power of the graph to be able to infer new things and few recommendations? I want people to come into data world and saying, oh, wow. I had no idea about that, and now you're telling me what I should do. So an example is, dude, lineage. Right? Everybody uses lineage, and they're like, oh, impact analysis. Right? If I change this column, what's gonna happen? Or root cause analysis. This thing is I don't trust this number. Where does that come from? Okay. Great. What comes after that? I'm like, we just analyze your lineage graph, and we realize that these are the hot spots. These are the nodes which represents, I don't know, a view, a strict procedure, some job where you have so many things going into it and so many things going out to it. This is a bottleneck. This is a critical bottleneck. By the way, you don't have any steward associated to this. Our recommendation is that you should assign steward from these people because of yada yada yada reason, because we can analyze from the graph and be able to tell you this is the prioritized backlog of things that you should be paying attention to, and these are the people who should be doing that type of work. That's the types of automations that we're looking into. I think that that's the short term view of it. For the big picture here, I mean, we're all talking about AI. We're having so much fun with GPT and chat GPT and all that stuff, and we have a lot of fun with that, and all this AI is so grandiose is because it trains on a lot of data.
We don't have a lot of metadata. Well, guess what happens if you are able to start training a lot of metadata? Frankly, you can now truly start automating data integration. You can automatically start generating these transforms, these mappings that are gonna be integrating source to target. I always say data integration is like an AI complete problem because need to have all this context, all these people. But if we start seeing a lot of this metadata and we start incorporating a lot of the knowledge from people with what we're seeing in the metadata, I hope to prove myself wrong, and we can actually automate data integrations. I think that's the big the big picture.
[01:01:23] Unknown:
I think that I'll note a different gap because I agree very wholeheartedly with the gap that Juan said. But to say something additional, it's that I think that in the data landscape right now, there's a lot of sort of middleware orchestration that we're doing where, like, oh, a thing happens. And, you know, if it's this, then it needs to be a Jira ticket. And if it's this, it needs to be, you know, ping the data engineer and ask them to add a comment. And and then if this thing happens, then, you know, we need to lock this table down. And there's a lot of like stuff that we have in our heads that is, like, the burden of the data engineer or, you know, the admin to take care of.
And I think a lot of that stuff is it's like business process and tribal knowledge, but it really is metadata as well. And I think a big gap right now is, like, how do we capture some of these things that we know? How do we capture these things that we're doing and actually try to automate some of that? Because I think that's where a lot of the burden will come off of the data team to constantly be on their toes at all time. And, you know, I suggest that this kind of automation probably needs to be a part of the metadata layer and part of the catalog layer to make our lives easier in managing the data.
[01:02:44] Unknown:
Are there any other aspects of the kind of broader data ecosystem or the fundamental principles of working with data that we didn't discuss yet that you'd like to cover before we close out the show? Just a reminder that it's focused on the social side. I think the people side is what's missing.
[01:03:01] Unknown:
I believe that a lot of the tech is there. I don't think we need to invent more tech. We really need to just start being empathetic and being more curious. That that's my main parting thought.
[01:03:12] Unknown:
I'll add that I think a core fundamental principle that sometimes gets lost as we get very excited about technology and about architecture and things as right we should. Right? Is just around ROI, return on investment. What are we putting in in terms of people, time, energy, literal money going out to various tools and things like that? And then what is the return? And, you know, easier said than done because I think sometimes as data people, as data organizations, it's very hard to account for the impact that we're having. But if we can do a better job of that accounting, then that's gonna make us focus on the right stuff, be able to focus less on stuff we shouldn't be focused on, pick the right tools, pick the right approaches and techniques, and really make sure that our people are having the biggest impact that they can.
[01:04:05] Unknown:
Well, you've both already preempted my final question about the biggest gap in the tooling or technology. So with that, I'll say that for anybody who wants to follow along with the work that you and your teams are doing, I'll have you add your preferred contact information to the show notes. So thank you again for taking the time today to join me and share your perspectives on the kind of fundamental principles of the data ecosystem that are starting to become lost in the noise. So it's always great to revisit some of those ideas and the ways that we can try to recapture them and bring them into our work. So thank you again for your time and efforts, and I hope you enjoy the rest of your day. Thank you very much. Thank you.
[01:04:48] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com. Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at dataengineeringpodcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to Guests and Their Backgrounds
Current Trends and Challenges in the Data Ecosystem
The Role of Metadata in Data Management
Modern Data Stack and Its Implications
Data Mesh and Organizational Impact
Evolution of Data.world and Its Vision
Innovative Uses of Metadata and Knowledge Graphs
Lessons Learned in Data Management
Future Directions and Closing Thoughts