Summary
This has been an active year for the data ecosystem, with a number of new product categories and substantial growth in existing areas. In an attempt to capture the zeitgeist Maura Church, David Wallace, Benn Stancil, and Gleb Mezhanskiy join the show to reflect on the past year and share their thought son the year to come.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box.
- Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
- Your host is Tobias Macey and today I’m interviewing Maura Church, David Wallace, Benn Stancil, and Gleb Mezhanskiy about the key themes of 2021 in the data ecosystem and what to expect for next year
Interview
-
Introduction
-
How did you get involved in the area of data management?
-
What were the main themes that you saw data practitioners and vendors focused on this year?
-
What is the major bottleneck for Data teams in 2021? Will it be the same in 2022? One of the ways to reason about progress in any domain is to look at what was the primary bottleneck of further progress (data adoption for decision making) at different points in time. In the data domain, we have seen a number of bottlenecks, for example, scaling data platforms, the answer to which was Hadoop and on-prem columnar stores and then cloud data warehouses such as Snowflake & BigQuery. Then the problem was data integration and transformation which was solved by data integration vendors and frameworks such as Fivetran / Airbyte, modern orchestration frameworks such as Dagster & dbt and “reverse-ETL” Hightouch. What is the main challenge now?
-
Will SQL be challenged as a primary interface to analytical data? In 2020 we’ve seen a few launches of post-SQL languages such as Malloy, Preql, metric layer query languages from Transform and Supergrain.
-
To what extent does speed matter? Over the past couple of months, we’ve seen the resurgence of “benchmark wars” between major data warehousing platforms. To what extent do speed benchmarks inform decisions for modern data teams? How important is query speed in a modern data workflow? What needs to be true about your current DWH solution and potential alternatives to make a move?
-
How has the way data teams work been changing? In 2020 remote seemed like a temporary emergency state. In 2021, it went mainstream. How has that affected the day-to-day of data teams, how they collaborate internally and with stakeholders?
-
What’s it like to be a data vendor in 2021?
-
Vertically integrated vs. modular data stack? There are multiple forces in play. Will the stack continue to be fragmented? Will we see major consolidation? If so, in which parts of the stack?
Contact Info
- Maura
- Website
- @outoftheverse on Twitter
- David
- @davidjwallace on Twitter
- dwallace0723 on GitHub
- Benn
- @bennstancil on Twitter
- Gleb
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you're looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for reverse ETL today. Get started for free at data engineering podcast.com/hitouch. Your host is Tobias Macy. And today, I'm joined by Mora Church, David Wallace, Ben Stancil, and Gleb Mjansky to talk about some of the key themes of 2021 in the data ecosystem and some of the things that we're all looking forward to for next year. So
[00:01:46] Unknown:
going in order, Maura, if you can start by introducing yourself. Thanks, Bryce. My name is Maura Church. I run the data science and data engineering teams at Patreon, which is a platform for creators and artists to make money from their fans and full stack data science team of about 15 folks and then a data engineering team of 5.
[00:02:02] Unknown:
And, David, how about yourself?
[00:02:04] Unknown:
Hey, everyone. I'm David Wallace. I'm a data engineer here at Dutchie. We are the world's leading cannabis ecommerce and point of sale platform right now, and the last year of my time here has been spent building out the data platform.
[00:02:16] Unknown:
And, Ben, if you can introduce yourself. I'm Ben Stansil. I'm 1 of the founders of Mode. We build products for analysts and data scientists for creating analysis, distributing out to their organization. I do a few different things there. We run running our internal team data team, and then spend a good bit of my time yelling into the void on the Internet.
[00:02:35] Unknown:
And, Gleb, how about you? Yeah. Hi, everyone. I am Gleb, CEO and founder of DataFold. DataFold is a data liability platform, so we help data teams never worry about the quality of their data. And we do this by automating some of the most tedious parts of the data workflow, like testing TL changes to your pipelines and providing visibility into data flows and doing anomaly detection. Before Datafault, I've been data engineer at number of companies, including, I was a 1 of the founding data team members at Lyft. So got to build a lot of things there and also caused some really bad data incidents, some of which informs my current experience at Data Vault.
[00:03:18] Unknown:
And going back around to you, Mora, do you remember how you first got involved in the data ecosystem?
[00:03:22] Unknown:
Yeah. So I initially thought I wanted to be a software engineer and then found myself being more sort of a code monkey implement features and wanting a broader range of sort of what a company did and all the data that it was involved with. And so that was my interest in data science. I'm a huge math nerd and a huge stats geek. And so I wanted to have a role that would let me see a broad view of the company. And that started with trust and safety and spam and abuse at Google, which is where I started my idea career.
[00:03:45] Unknown:
And, David, how did you first get involved in the data ecosystem?
[00:03:49] Unknown:
I would say about 5 ish years ago, I actually started more in, like, the analytics consulting realm of things. I worked at a small startup out of Philadelphia called RJ Metrics. Shortly after that, I think I just realized that I liked building things a lot more than doing the analytics side of things, and I started transitioning more into, like, data engineering and data platform style work.
[00:04:08] Unknown:
And, Ben, how about you? I started my career doing policy research. It's data work ish. It's like the DC version of data work where you work with data, but in a very different way. Basically, sort of accidentally found myself at a tech company doing analytics work essentially, and essentially met people there and sort of found my way sort of falling deeper and deeper down the rabbit hole, I guess. And, Gleb, how about you? So I started a few years ago as a software engineer that was
[00:04:36] Unknown:
at the time, which feels like a completely different world terms of data domain, back in a time when building a desktop Java app to collect metrics and display them as an Excel spreadsheet still seemed like a good sustainable pattern. And then my career progressed as a data engineer, so I was building a lot of data pipelines, breaking things, and started building tools. And since I started building tools back at Lyft, never could really stop doing that because I've always been frustrated how inefficient workflow was and how much I should have done. Like, I was doing manually as a beta person, so continuing doing this now as part of Data Vault.
[00:05:13] Unknown:
So as I mentioned at the open, the core topic that we're focused on here is talking about some of the trends and themes and events that we've seen in the world of data over the past year or so kind of in the time honored tradition of the end of year recap. So I'm gonna just kind of open it up to everyone now and just ask what were some of the main themes and most notable events that you saw in the overall space for data practitioners and vendors and just the areas that we've all been focusing on and talking about.
[00:05:45] Unknown:
2 things that really come to mind for me as someone who's been leading a team for the past year. 1 is, like, the explosion of tooling and the tooling space. I think as someone who is a customer for data tools, my inbox is filled with people wondering if Pedro wants to use our data tools. And a lot of that, I think, is in the observability and reliability space. But then there's also been this huge shift I've seen in the last year of connecting your data warehouse to many, many other things and all the tools that it can take to do that and do it in a way that is smart and stable and sort of visible from other parts of the ecosystem. The other major trend to me seems like a increase in the conversation about how a data team should function. I think previously to 2021, it felt like, you know, there's this big debate around, like, centralized versus embedded teams. But I think in the last year, there's really been more of a question of, like, who is a data team? Are they a product team? Are they an analytics engineering team? How do those things relate to each other? And so I sort of saw the tenor of that conversation shift from just where does your team sit to what is your team and how does it interact with others.
[00:06:45] Unknown:
I think too with the introduction of the data mesh becoming more visible and generally people becoming aware of it, it also colors that question of, do you need a data team? Is the data team just the people who run the platform, and everybody else is just an application engineer that has happens to work with data? So, yeah, I think the kind of organizational questions are definitely in flux right now, particularly with the rise of new job titles with the, you know, increasing presence of the analytics engineer and the, you know, continued rise of machine learning engineers and all of the different, you know, data product engineers, the different proliferation of roles as people try to figure out what it is it that I'm actually even doing with all of this data. So I have a question about that, Tobias, since you talk to a lot of people.
[00:07:29] Unknown:
Is the data mesh real? Like, do do we know what that is? Because I said the rise of the data mesh and, like, us all seeing that, like, is it though?
[00:07:40] Unknown:
Well, I guess it depends on where you spend your time. And I've definitely been seeing more commentary about it. It's not as ubiquitous certainly as things like data catalogs, data observability. But I think it is starting to filter into the kind of general awareness, probably helped along by the fact that the person who kind of kicked off the whole idea is nearing completion of the book titled The Data Mesh. So
[00:08:04] Unknown:
it'll be interesting to see how that manifests in the next few years. But, yeah, I do think it's still very much in the kind of early nascent phases of people trying to figure out what does this mean and do I care? I think it's real. I think we'll see it implemented slightly differently than the way it's being pitched in the next few years. I think there are a lot of companies right now that are taking a much more tactical approach to describing and implementing what the data mesh may look like. And I think what we're starting to see is that it will probably end up in the data platform first, the realm of the data platform. What I mean by that is you see a lot of tooling right now that are implementing features with the state of mind that data teams sometimes operate in silos, but there are almost implicit dependencies sometimes between the artifacts that are produced within those teams. Right? So an example of this is sometimes the BI team produces a data artifact that is then leveraged by the data science team or something like that. Right? I think a lot of tooling right now is acknowledging that world and saying, hey. These are artifacts. These are data products the same way that they're being described in the data mesh. We need to find a way to actually describe those dependencies between them and express them using tools and technology.
[00:09:08] Unknown:
Yeah. I think the idea of data artifacts being a sort of component of the ecosystem is also something that's been gaining ground recently where different vendors have been vendors have been talking about data artifacts and data products. And I think that brings to mind too some of the work or in the DevOps ecosystem where you have an artifact repository. So every time you build a unit of software, it goes into that repository so that you can then consume it downstream. And so you have this reproducible way of being able to say, at any point in time, I want to be able to understand what was the state of this system at this point in time given all of the inputs. And so I think that we're starting to recapture that idea in the data space.
[00:09:45] Unknown:
And my sense is that the reason that so much attention being paid to this right now is just because I think we all can see that it's easier than it's ever been to produce data artifacts. Right? Like, the tools that we've received over the past 5 years or so has just made it so easy to publish data products and curate them in a way that anyone can use them, which has also led to the rise of, you know, things like reverse ETL, like operationalizing
[00:10:08] Unknown:
data and stuff like that. So I definitely think that that's why so many eyes are on the problem right now. To the points that I think rolled up around data liability and kind of accessibility and democratization of data, whether we call it data mesh or not. We've run a survey of about 200 data teams earlier this year, and 1 of the questions was, what are the KPIs that you are having this quarter, next quarter? And we gave a whole bunch of different answers ranging from, you know, probably infrastructure to democratizing data, improving data quality. So the top 2 KPIs by frequency that were mentioned by data teams were, 1, around improving data quality and reliability, and second was around improving data accessibility and collaboration around data. So it sounds kind of supportive to that theme of whether, again, we call it data mesh or not, that data teams maybe take a step back from being the bottleneck or the single point of producing all the data artifacts to enabling others to actually leverage data. And it sounds like right now, the big challenge, at least what we've drawn from the survey, is around, 1, making sure that whatever data is in the realm of data team is reliable, products that are produced are trustworthy and reliable, and, also, how can they involve larger organization in creation of data and consumption of data.
[00:11:27] Unknown:
And I think that we'd be remiss in recapping the year and getting too far along without using the term modern data stack because that's definitely been sort of the biggest piece of buzzword bingo this year of everybody saying, oh, it's the modern data stack, and then everybody else wondering what are people talking about trying to define this term. And although the vendor is saying, oh, that sounds good. I'm gonna call myself part of the modern data stack. So discuss.
[00:11:52] Unknown:
I've used a definition before, and I stand by it, that the modern data stack is just data tools that were released on product time. Like, it's tools that were aimed at a Silicon Valley audience that were released somewhere between, like, 2011 and now that are kind of, like, consumerized data products that aren't, you know, Oracle's new database, random YC startup that is trying to do data thing that there are now 50 of a year that are all, like, launching themselves on product hunt. To me, that's that's as good of a definition as I've got. I think it's a pretty great definition.
[00:12:27] Unknown:
I would actually push back a little bit there because I feel like there are still so I think that when people say modern data stack, we kind of mean implicitly different things. Just to some, it's about awesome user experience and maybe modularity and low pricing point to start that also scales and everything supposedly play nice nicely with each other, all the tools. I think we're still seeing, you know, companies and products that try to adopt, like, a a more older approach of vertically integrated solutions to data and Mails on Product Hunt. So are there any more, like, characteristics of vendors or tools or solutions that we would say belong to modern data stack, or is it just temporal thing? Yeah. I would like certainly, it's not strictly temporal.
[00:13:15] Unknown:
Like, if there's a question here to me of, like so Oracle released, like, the Oracle autonomous database or whatever, like, last year. Is that a part of the modern data stack? And it's like, I don't know. It's with Oracle. Feels like not. Sort of I think most people would kinda just reject it sort of on its face. The question though, like, are what about vertical tools? So amplitude or whatever the current version of amplitude is, is that part of it? And, like, kind of no to me, except I think we're gonna see a lot more of that style of thing where it's more vertical. Not necessarily vertically integrated. It's like it cuts all the way down to, like, a logging framework, But vertically where it's like, this is aimed for a particular audience that is more narrow. There's kind of an application built on top of the modern data infrastructure or whatever, such that it is still very much a part of that ecosystem, but it doesn't kind of fit the same paradigm of we're all trying to build a platform for the next data thing. Like, there will be a point at which we realize platforms aren't worth it, that we just need to make apps that make money, and we need to stop trying to build iOS, and somebody's gonna come along and be like, I can actually make a whole bunch of money building just Instagram. I think that will be a day that that comes relatively soon. And then once that happens, everybody will be starting to be be building Instagram instead of everybody trying to build platform.
[00:14:25] Unknown:
Yeah. That example of of sort of the modern data stack being things that integrate with the modern data infrastructure, like things that integrate with the cloud data warehouse and make your cloud data warehouse more accessible to the other parts of your organization feels like the closer definition to me. You know, Patreon uses Amplitude. I would not consider Amplitude part of the modern data stack because they don't easily integrate with DataFold or like DBT or even our Redshift integration is like challenging. And so I think maybe part of it is, like, how it's not quite ease of integration, but something around the tie into that infrastructure and the cloud infrastructure specific feels like the delineation to me. Feels like if it's a mesh or not.
[00:15:03] Unknown:
I think that 1 of the common themes in this sort of current batch of products and businesses that have been gaining a lot of popularity in trying to adopt the mantle of modern data stack is the kind of general sense of coopetition among them. Everybody says, you know, I wanna be the best of breed in my category, but I'm not gonna bad mouth everybody else because we're all in this together. And, you know, we'll gain more ground by kind of working together and trying to figure out what are the useful abstractions and useful kind of dividing lines across the life cycle of data than we will trying to, you know, snipe at each other and shoot each other down because then we're all gonna lose. Lev, I'm sure you experienced this as well. Speaking as a vendor in this space, there's a whole lot of, like,
[00:15:45] Unknown:
the friend of my enemy is my friend kind of, wait. We all wanna be partners with each other, but we want you to be partners with me most of all, but you gotta be partners with everybody else. Like, there definitely is a lot of trying to play Switzerland in this, which I'm not sure how all that plays out longer term.
[00:16:00] Unknown:
I think what's also is interesting about the modern data stack and being the vendor nowadays is that I think the part that data communities play in the selection of vendors is probably very different from what it's been even a few years ago. So I think right now, basically, lots of data practitioners spend time and meaningfully engage with our data practitioners in different companies and different places ranging from Slack channels, like Leptomistic or DBT discourse forums. And it seems like this is the place where the hive mind actually generates opinions and decisions about what is the optimal way to put together a stack, what are the vendors. And I think in certain ways, it also influences not only how vendors decide to promote themselves, but also certain rules of the game, basically. You have to be part of the community, so you can't really be mean to other vendors or to anyone.
And I think it also puts pretty high bar to vendors in terms of even how they do marketing and what kind content they promote because they get immediate feedback loop. And if they go too aggressive with just sales, no value add pitch, which I admittedly sometimes also do, there's immediate feedback that you get from the community. And so I feel like, overall, it becomes a healthier ecosystem.
[00:17:17] Unknown:
Yeah. If the outcome of the modern day of stack is we all get nicer towards each other, that's probably not a bad outcome for us to have as a broader data science and engineering community. So okay. So is that healthier, though? Like okay. It's definitely healthier for the people,
[00:17:29] Unknown:
but does that actually get us to a better place with tools? Like, there's a certain level of, like, like, competition helps, and it helps sort of weed things out. It helps develop standards. It helps make sure everybody sort of, like, agrees to what's going on. I've never thought about that before. I think that's true that, like, basically, everybody existing together in these ecosystems, they tend to be pretty nice to 1 another, at least in public. Like, does that actually get us to a place where we're solving problems the best? Or do we end up trying to, like, do a little bit too much kumbaya and all of the fighting is kind of happening under the water? Are we all playing, like, water polo, where, like, we're all just kicking and screaming at each other under the water, but, like, it looks kinda nice on top? Actually, water polo looks nice on top. I guess it's probably pretty brutal on top too, but you get the idea. I think that there's definitely been a lot of kind of
[00:18:13] Unknown:
crowded markets where, you know, different themes crop up and everybody jumps in and, you know, whether it's that everybody happened to have been working on the same thing before they went public with it and they all just happen to come out at the same time, or if it's a lot of kind of copycats of, oh, that's a good idea. I'm going to do it with this slight twist is debatable. I don't really know sort of how that has played out in different areas. But I do think that because we have a number of different businesses that are working in these different spaces, it gives opportunities to explore slightly different tangents and manifestations of these ideas. So, you know, some of the markets that have become particularly, I guess, vocal is 1 way to put it is, you know, data observability is 1. Data catalogs and discovery is another.
You know, data governance is 1 that's still a little nascent, I think. And, you know, but each of these, they definitely have a lot number of different vendors and open source projects that are all competing for attention and mindshare. And I think that we're still early enough in the kind of global data journey that there's opportunity for all of them to succeed to an extent, but there will definitely come a point, I'd say, in the next, if not year or 2, within the next 5 years where there's going to be a cycle of consolidation across these different product categories to say, okay. This is the clear winner. This is what makes the most sense in the most cases. You know, there will still be, you know, maybe a long slow death of some of these companies or projects. But I think right now, we're still very much in the, you know, everybody let's explore the space kind of phase where before, you know, that was the case for, like, the Hadoop ecosystem of how do we do distributed systems and massive data? And that's, okay. Well, that's mostly been solved, and we realized cloud data warehouse and, you know, some of these data lake technologies have won out. So Hadoop is kind of it's around, but nobody really talks about it anymore.
And I think that we're entering that cycle of the kind of higher order ex data experience and, you know, the foundational technologies have been, quote, unquote, solved to some approximation, and now we're figuring out, okay. Well, now how do we actually make use of these systems and work together on them and bring this to an organizational scale, not just a technological scale?
[00:20:24] Unknown:
I agree with that, though. I think this is kinda where I think I've never thought of this before again. But I think, like, the competition part actually might make this worse, where, like, Sensus and Hitouch, those 2 companies seem to wanna kill each other. Like, they basically are, like, constantly putting, like, speed tests and stuff on their Twitter accounts. That probably makes for a better product. Like, Snowflake, Redshift, BigQuery definitely wanna kill each other. The Redshift kinda sold everybody's Snowflake, which is a little weird, but that's the only story. But, like, I think that makes databases better. BI tools. So the space that Node is in, there has always been this sort of delicate dance of, like, we all do slightly different things. Let's carve out our own niche. Okay. Part of that is it's a complicated space, but part of that I think leads to everybody sort of looking for their angle rather than just saying, like, we're gonna make a really amazing thing and go directly add another 1 and just make a better version of what they have. And so I've never thought about this before, but it's like, it's an interesting dynamic that I think some of the friendliness of the space is maybe creating where I think that will eventually shake out. But, like, the products are still kind of trying to find their niche rather than say, no. We're just gonna make a better version of this.
[00:21:29] Unknown:
So spitballing here. Is it the case that the reason that we're all being so nice to each other is that all of these companies are still very early stage and run by nerdy engineers who don't wanna get into public fights? And we're just waiting for the, you know, business people and salespeople to take over, and then we'll actually take the gloves off.
[00:21:45] Unknown:
Yeah. It's the team. Once you get enough salespeople in the door, then the dynamics start to change. But I do think, Ben, to your point, there's an interesting power that focus has. Like, having focused competition, like, having DataFold and Monte Carlo focused on observability reliability means they're probably gonna create a better product than if, like, Amazon goes out and creates its own observability product on top of Redshift. And so I do kind of see this world where we have these super focused products that actually solve the problem. And then eventually, some sort of, you know, rebundling or acquisition spree in late 2022, 2023
[00:22:16] Unknown:
that actually makes these fully vertically integrated for a given team or a given stack. I actually really agree with that. I can personally say that I have firsthand experience with the focus problem. I come from a company called RJ Metrics, as I said previously, where it was very early in the BI stages of things, and we were effectively trying to be full stack. Right? We were trying to be the data warehousing portion, the transformation portion, and also the the BI visualization layer. And, unsurprisingly, we sort of got our lunch eaten by, you know, the lookers and the modes of the world. But the interesting part about this is that, a small company that some of you may heard of called Stitch actually spun out of RJ metrics, which is effectively like a hyper focused version of what RJ metrics is already doing, which is data pipeline as a service. And as we can now see, that's an entire industry unto itself. Right? So I think I wrote about this at Mode a number of years ago, ago, but I think this is a trend that we've seen coming for a long time, which is focus is good, trying to do everything, you end up, you know, being jack of all trades, master of none. And when we can get a bunch of people focusing on hyper specific problems, we ultimately end up with better tools. The issue that we're running into now, which sort of all dancing around is I think Nick Schrock from Elementor said this. I'm stealing his quote, but I love it. He said that we're leaving the age of big data, and we're entering the age of big complexity. And I think that's almost exactly what's happening right now. I think all the problems we're talking about right now have to do with basically how inherently complex and entangled the modern data platform actually is because of this sort of natural trend towards hyper specialized modular tooling.
[00:23:40] Unknown:
Yeah. And that's why we're now starting to see the rise of companies such as Mozart where your modern data stack in a box of you don't have to do the tool selection and integration. We've done that for you. So it's we've gone from you know, we have this 1 end to end solution to we have lots of point solutions to now we're an end to end solution that's actually just a bunch of point solutions under the covers.
[00:24:00] Unknown:
But I think complexity is also not just in the number of integrations, the number of tools you have to select from, but it's also in how much content or data products or data artifacts you're creating. And I would definitely agree with Nick. Basically, we're, like, in a big metadata world, in the world where it's not unheard of for companies to have dozens of thousands of tables in a warehouse, millions of columns. And we recently started working with a customer claiming to have millions of tables and billions of columns, which is basically enormous amount of metadata. So definitely beyond the human capacity to deal with. Why are they doing that?
[00:24:39] Unknown:
Then just goes, why do they have millions of tables? Let's let's go down there.
[00:24:46] Unknown:
Because they didn't buy a data catalog early enough. Nobody knew they already had that table.
[00:24:52] Unknown:
I think a big part of that is data becomes, to a large extent, generated by software and not just, I mean, like, events that aren't just in the warehouse. It's basically a lot of processes around experimentation and automation and machine learning are now fully automated. So every single version, every single experiment is now getting potentially entire versions of pipelines recreated. And for auditability reasons, companies sometimes tend just prefer to keep everything, just pay for storage, which is cheap, rather than try to clean it up and potentially risk the downside of making mistakes there.
[00:25:26] Unknown:
Yeah. That was kind of the whole promise of the initial era of big data of just throw everything in the data lake. You never know what you might need, and, you know, it could be useful someday, which is also what happens when you talk to a hoarder. So
[00:25:38] Unknown:
Yes. It's kind of interesting. I think that, like, the term data janitor has gotten a bad rap. But I do think that in a world of increasing complexity, like, we need function, either people or tools to be the data janitor to say, like, maybe you shouldn't have millions of tables in your warehouse, and maybe these, you know, 45, 000 of them have never been queried in the past 5 months. So it will be sort of interesting to see with that increasing complexity, like, the tools that come in and the roles maybe and the functions that come in to make sure that not only is your data abundant and reliable, but that it is actually interpretable and useful for the organization and not just sort of a big spaghetti mess in a warehouse.
[00:26:14] Unknown:
Yeah. And going back to the sort of themes of the year, I think 1 of the other ones is the growing awareness of metadata as an important first class concern of anybody who's working with data because we need to have this consistent and universal view of what information we even have, what form does it take, how is it being accessed, what are the producers and consumers of it, and being able to figure out how do we actually make this ubiquitous where, you know, up to now, it's been a lot of different tools will have their own metadata layer that they use internally that are useful for their specific use case. But now we're starting to see the need to actually break down those, you know, metadata silos and say, no. Actually, I need to integrate the metadata from, you know, this tool, you know, whether it's my Fivetran metadata with my metadata that DataFold is owning with my metadata that I'm relying on for my superset and the metadata that is being produced and consumed by Hightouch or what have you and just saying, you know, no. I don't want you to own each of your own metadata catalogs. I want you to talk to the 1 that I control so that I can see everything that's happening in my data ecosystem. I think that that's still sort of the utopian vision that hasn't been realized yet, but more people are starting to awaken to the fact that that is a necessary capability in order to be able to actually go to the next stage of data evolution.
[00:27:33] Unknown:
Yeah. I I agree. We actually had a really interesting use case for metadata when I was at Good Eggs, similar to sort of what you're describing, which is 1 of the things I noticed about DBT is, you know, it's a great tool. I think we can all agree about that. But I think that 1 thing it inevitably leads to is uncertain life cycles of artifacts. Right? I think I used the term artifact atrophy atrophy or model atrophy before where it's like, you know, you can build it once, but if you don't use it, who knows how good that artifact is at a certain point. Right? Historically, in the past at GoodX, we actually built a pipeline that managed the metadata from DBT and also from mode that effectively said, if someone hasn't queried a mode report or looked at a mode report in, you know, x number of days, then automatically deprecates that mode report. In addition to that, you know, any artifacts that were fed into that mode report that were only fed into that mode report, deprecate them as well or mark them for deprecation. So I I think that that kind of thing is only possible by combining the metadata between all the pieces of various tooling that we use. It solved a lot of really critical pain points for us as well.
[00:28:37] Unknown:
Yeah. And that brings us to the natural progression of DBT as the other kind of breakout topic of 2021 where, you know, they've just recently hit 1 0, so that's definitely noteworthy. It has, you know, given rise to a whole industry of analytics engineers. You know, everybody says, okay. The day of SQL has come again. So being contrarian, I guess, what are the problems, that DBT is starting to build up, and what is the ticking time bomb in this overall space of, you know, SQL and Jinja that we are not yet willing to come to terms with, and what is a sort of potential future solution for being able to get the benefits that DBT is offering in a way that is reducing the amount of potential technical debt that we're accruing in the process?
[00:29:25] Unknown:
I think I already played my hand here, but I think the absolute biggest problem is the endless proliferation of data artifacts and overwhelmingly a lack of any kind of curation for those artifacts.
[00:29:36] Unknown:
I have some real skepticism about this, like, Jinja thing, honestly. Dbt is and, like, it works. It feels a little bit like a hack. Dbt has built it, and, like, you can do a ton of stuff with it, and Mode has some implementations of things that sort of do this. We don't sort of embed it quite in your, like, infrastructure the way that you see this. But but the way they have talked about this before is sort of like it is like React or some JavaScript framework that is like templatized HTML, basically. They are templatizing SQL.
Okay. Same idea. Oh, maybe. The reason to me it's a maybe is templatizing you see when it breaks. If If you templatize a bunch of SQL queries that are like black boxes, you have no idea what they do. They generate a bunch of, like, really crazy SQL. It's like trying to debug some, like, look at Mel generated thing, except you don't really know what you can't judge the truth, like, how true the thing is that comes out the other end because it's kinda like, by definition, it's the metric. Like, is it right? I don't know. It's what the metric says. I guess it's right. You don't have anything to compare it to whereas with the web page, it's like, well, this looks really wonky. And so I think there is some danger in how far we go down that rabbit hole of sort of Jinja on top of Jinja on top of Jinja to the point where 1 of the sort of beautiful things about DBT in the early days is it's all SQL, which is, like, easily legible.
If that changes, at some point, it becomes it becomes as incomprehensible as, like, a complicated airflow DAG.
[00:30:58] Unknown:
But instead of it being a language like Python, it's a it's a templated language, which kinda feels like a nightmare. I totally agree. I think there's, like, interesting and I know that my view here is is different from some others. But I think when you have that many layers of abstraction, you sometimes lose the business context of where you're starting and how the data is ingested and also where it's going. And I do think that there's value. Like, when you imagine the role of an analytics engineer, there is value in that person understanding the origin of the data and then the consumption of it and not just sort of making these abstraction layers of metrics that are gonna be consumed all over the business. So I do think that that's 1 of the challenges it presents that you sort of could create this role that is very abstracted from the actual context of the business rather than having a data scientist or data analyst or data engineer who is sort of, like, vertically seeing the whole journey of the data all the way to its consumption point and making sure that it's making sense along the way and not sort of a black box that's subtracting all of the actual business value.
[00:31:53] Unknown:
I think 1 of the challenges of really betting 100% on SQL is that what we're seeing right currently happening in a data platform, it's inherently multi language, multi environment, multi paradigm system. Right? And SQL is maybe good for operating on simple data transformations within the warehouse, but there is a lot of machine learning that's typically happens in systems like Python. And I think that 1 of the challenges with having a super SQL focused solution that now becomes the hub of everything is how do you integrate, how do you glue that with everything else that's happening, how do you build data applications that are not 100% SQL. And I think we're starting to see some integrations with, for example, tools like Daxter that sort of wrap DBT under the hood, but then you also start losing some of the magic that DBT provides as this fully integrated experience. And I think DBT has grown to a large extent because they enabled lots and lots of data teams that otherwise would have no tooling to actually organize their transformations.
I think what'd be really interesting to see is how DBT makes it way up in organizations that have mature data stacks with existing airflow, Dockster, and, you know, more sophisticated orchestration multi language setups.
[00:33:08] Unknown:
Struggling with broken pipelines, stale dashboards, missing data? If this resonates with you, you're not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world's first end to end fully automated data observability platform. In the same way that application performance monitoring ensures reliable software and keeps application down time at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today. Go to data engineering podcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo swag box.
Yeah. So a couple of points here. 1, I'd like to pile on the ginger question of saying that, you know, when you first start down the path of, oh, I just wanna be able to template this 1 string. Okay. That's fine. But then as you evolve, you've got 15 layers of ginger inheritance with macros thrown in there, and you have no idea what's happening until you just run it. And having spent 5 years working with SaltStack, which is a similar case of YAML and Jinja and trying to figure out how's how does this all compose together? I don't know. Just run it and see what breaks. I I I think that that's where we're starting to head with the DBT ecosystem and this idea of just templated SQL. But the other problem of, you know, SQL being the primary interface is that it is inherently locked to the platform on which it's being executed and built. So, you know, 1 of the promises of DBT is, oh, it's, you know, composable, it's flexible and reusable, but only within the bounds of that organization and only for so long as they stay on the same database engine, unless you wanna go down the path of having 15 layers of ginger inheritance.
And so I think that that's definitely 1 of the complexities and challenges that we all are trying to come to terms with is, you know, how much do I want to accept that fact if I'm locking myself into this database engine and, you know, down the road, maybe I'll need to refactor it to run, you know, from Snowflake to BigQuery or vice versa. And how much do we say, no, I need to have this, you know, completely generic logical layer that sits above the specific syntax of the database engine so that I can be, you know, multi platform or so that I can have this reusable component that I can deploy as a vendor even to multiple different database engines.
[00:35:38] Unknown:
Can I go back to the Jinja thing for a sec? Because, Gleb, I have a question for you about this, which is also, like, this is a preview of my rant for tomorrow. So okay. So I I am kind of an observability hater. In that, to me, it solves the problem of sort of, like, statistical anomalies in the data, but it doesn't solve the problem of, like, semantic ones. It doesn't solve the problem of this thing actually just doesn't represent what I think it represents. And the reason I bring it up is, like, this Jinja stuff feels like it adds a ton of semantic complexity that goes beyond the kinds of things that you would get. Like, for your data can come in fine. Your pipelines can be running. Everything can be shaped the way you expect it to be shaped. Sort of all of the data folds slash Monte Carlo slash Bigeye slash whatever else sort of dashboards are green. And yet, the number that you see on the dashboard is wildly off. And, like, that to me is the biggest it's like, observability is a thing that is meant to make me sleep better at night, and that is the problem. And all of my numbers are built through sort of layers of Jinja or even just regular GPT models and dashboards, but especially with layers of Jinja.
Observability dashboard is, like, nice, but I still have to go digging through all these sorts of things. And it, like, I'm still scared when I open a dashboard to know if, like, my exec is gonna kill me when this thing opens up wrong. So, like or really what's gonna happen is 3 months later, they'll find out, and they'll kill me 3 months after the fact that they told the board the wrong number. So, like, how do you think about that problem? Why you just go to a different job. She just this is this is why careers are so short in in analytics engineering. It's like you have a year set until you blow up an exec team. How do you think about that problem where the sort of, like, statistical and structural problems underneath observability are in some ways the less hairy ones And the ways that you really get yourself, like, shot in the foot are these other ones that are higher level. I totally agree, Ben. And I think observability
[00:37:27] Unknown:
is a quite vast topic. And if we think about, okay, what is observability means, there are multiple tools and different vendors kind of presenting their own way you know, view of observability. But in my mind, fundamentally, it comes down to really understanding what's happening in your data environment at every point in your workflow. So when you develop things and you write all this Jinja and all this SQL, how do you know that you're doing the right thing for the business? And you probably have some model, some semantic model of how the business works, and you wanna lay down on SQL. And I think where it can break is that we just can't think about all the edge cases, all the different values that can come in, or all the different queries of SQL. SQL is not inherently testable.
And so this is 1 area where this kind of semantic understanding to SQL can break. And I think what is their ability can help you here is, for example, every time you build anything, you run a mode query or you test the DBT model, Just know exactly what you get out of your table. Know the distributions. Know what are the values, both on statistical and value level, what comes out of it. And people are doing it naturally. Right? We all prototype. We iterate. We kind of run things, and we build a staging environment. Then we do, like, select star, count star, group by things. And I think what observability can help here with is just getting this, what's currently manual work, faster. I think another example of where things break with where between kind of semantics and the actual data is when you're making changes. Again, because of the complexity of SQL and because of multiple layers of abstraction, there is a lot of complexity you have to manage, and we are data teams move fast. Right? And then we are always evolving our pipelines and ensuring that every time you make a change, you actually know exactly what what's going to happen and what's gonna be the impact on your business metric is actually pretty hard problem and also very prone to failure. And I think those are things that you actually don't need statistical analysis or anomaly detection for. Those are things that you need to embed as part of the workflow. Just making sure that whenever someone is modifying the logic, either writing logic or modifying it, they have full understanding of what are they getting as a result. And I think then we'll actually get to much better place in terms of, you know, not being surprised by data.
[00:39:44] Unknown:
This is kind of an interesting question of, like, it's almost a different framing of, like, the data science hierarchy of needs. Like, maybe as a foundational layer, you wanna know that this column that was previously non null is suddenly 20% null. But then on top of that, there's your question, Ben, around, like, are the semantics. Right? Is the metric actually defined well? Is it providing the sort of measurement and indicator that we want for the business? And that's where I'm very interested to see, like, the promise of the metrics layer. Obviously, with all the tools coming out in that space this past year, will that actually substitute for like a PM and a data scientist getting in a room and debating like, hey, this dashboard is showing this thing and I don't think that this dashboard is the right way to measure it. Or I think there's something messed up in the pipeline that's not related to observability that has caused you to look the way it look. That sort of, like, context and semantics and actual deep discussion is, I think, gonna be very hard to automate or replace with a tool. Like, you still ultimately need an exec to look at a dashboard and say, this is not right. And the only reason I can say this is not right is, like, they have the business knowledge of what right and wrong looks like, hopefully.
[00:40:43] Unknown:
So I don't know. I don't know that semantic layer. Like, getting to that next level of the semantic layer of of the hierarchy of needs might take way more than just tooling and may take a long time. Yeah. I have like in the observability stuff. It makes me a little nervous, honestly. So this was an analogy that was gonna be too morbid for the piece tomorrow. But like, there was a famous plane crash, like the Air France, I think 4 47 that crashed, like, in the Atlantic Ocean. 1 of the reasons it crashed was because they had, like, their instruments broke such that it told them stuff that looked good. And so they were like, oh, we're flying at the right speed, but it was all wrong. And so there's, like, confidence that comes with seeing a dashboard that says all this green stuff where that's the sort of piece that makes me nervous. It's like you tend to not investigate things. So it's like, oh, yeah. It all seems fine because this thing that's supposed to be checking, it tells me it is. When in reality, if something's actually bad happening under the hood or in the engine or whatever, then, like, I don't know. That's how you start to relate to me, We get off the rails. But I don't know if we have a better solution for it now. Like, we're not like, we're actually solving any problems around this today other than, as you said, more. It's, like, very much just like a people process.
[00:41:43] Unknown:
I think the problem is not basically to say whether it's green or red. And I think that 1 of the challenges of observe data observability is that tools don't have enough context to actually say whether this change in your distribution shape or this drop in your number of relatives is actually good or bad, except for, say, maybe, like, this loose anomalous or suspicious, right, which is not getting you quite where you want in terms of being really confident in the dashboards. But I think if we frame observability as really understand what's happening with your data when you played it, when you developed the code, when you ship it to production, just doing that enables you to have more confidence in your own work. Because you, at the time when you actually do this, have full context on what the business is probably supposed to look like or what the data is supposed to look like. It's just that the fact that you may not be fully aware of what the data is could be a problem.
[00:42:33] Unknown:
The proper analogy isn't like a warning lights. The proper analogy is more like it's profiling during development. So so that it makes the development process. If I did a thing, check it. Yeah. Okay. That thing looks good. I can be confident in it as I'm developing it. Like, it's not so much
[00:42:47] Unknown:
it's green or red. It's like, here's what it looks like. Did you expect that? And if, like, yeah. That seemed normal, then you can move on. Exactly. That's just sort of, like, how we approach the data fold, and I think we've been seen somewhat contrarian at, like, playing in the trend of anomaly detection. And I think the reason why I decided to approach it that way is because I feel like when you are actively working on developing on something, having full context and being able to make a decision whether this is right or wrong at this point in time is much better because you potentially prevent things from breaking later Versus if you start from the other end, which is, well, let's see what are things in production that can be, you know, looking wrong. This is sort of like defeating the purpose because you're, 1, detecting things that are already broken.
2, you have to drop whatever thing whatever you're doing and then go and debug these things. And it's a really, really high friction action for data team, which is already busy. So I think, to me, observability means more understanding what's happening as you actually working on data and being being fully aware rather than having a system that knows everything about whether your data is right or wrong.
[00:43:50] Unknown:
Using it more as a linting framework for your data in similar to how you would run a pre commit check before you commit your code and push it to production.
[00:43:57] Unknown:
That does seem different than how the folks are are approaching. I haven't thought of it in that approach. That's interesting.
[00:44:02] Unknown:
Yeah. And, I mean, to get us back to to, like, the 1, 000, 000 row call or 1, 000, 000 tables, 1, 000, 000, 000 rows. If you throw anomaly detection on top of that, right, how useful is that really gonna be to you if you're detecting you're probably checking the hundreds of thousands of anomalies across datasets that wide. So instead, like, being able to have, again, your confidence in your diffs, the confidence in the changes you're making rather than just like here are all the things going wrong in your data warehouse, which if your data warehouse is big enough is probably many, many things.
[00:44:28] Unknown:
Yep. Same problem you get when you're doing sort of statistical measures on website traffic where you say, alert me if 20% of my requests are an error and you're only getting 5 requests per minute, then, you know, you're gonna get a lot of noise. And it's like, okay. So 1 person hit a 404, and now I'm getting woken up at 2 in the morning. So another interesting thing possibly to talk about is 2021, this is a very anomalous year by any measure. You know, we're still in the middle of the pandemic. Everybody has varying perspectives on whether or not there is an end in sight. And, you know, 1 of the outcomes of this is that particularly people in software, but people in business in general have become much more acclimated to the idea of remote work. And there have been a number of businesses that have been founded and grown entirely remotely in the midst of the pandemic. I'm just curious what folks' opinions are on the sort of long term normalization of remote and distributed workforces versus the world that we have come from of the, you know, office centric environment and just the impact that that has on us as data professionals and our ability to collaborate
[00:45:38] Unknown:
effectively with business users and understand the appropriate context of the companies and the organizations that we're working with. I think it's here to stay. I think this is how the world's gonna work. I think for most data people, it's a step in a bad direction. Not like culturally, like, that's all that is what it is for different people and stuff like that. But I think the job is much harder to do remotely. It's not like some serendipitous office connection. You have your conversations and the sort of things that people see the trend is around. But I think it is much easier to have the, like, let's sit down. Let's talk about it. We don't understand what we're saying. We need to just, like, take 5 minutes to have a conversation to figure out, as Mora was saying, like, wait. We see these 2 things that don't quite make sense. How do these things different? Ends up happening over a 4 hour Slack conversation instead of a 5 minute conversation where it's just like, let's poke at this together.
Given how
[00:46:26] Unknown:
collaborative, like, data work has to be, I think that's gonna be a big thing to try to figure out. I think there's also such an implication for data teams on learning and mentoring. The Patreon team is doing a hybrid approach, so we've actually spent some time now together back in office, which has been amazing. And even seeing data teammates connect and more junior data scientists learning from more senior data scientists and being like, hey. Take a look at this query. It feels like the bar to do that on Slack and Zoom is still so high. Like, you have to Slack the staff data scientists and say, hey, can you review my query? And so I do think we still have so much cultural processes to figure out as data teams of, like, how do you make sure an environment of learning is strong? And then that's also true for stakeholder teams, like a new PM onboarding to all of Patreon's data. You know, I would sit down with them probably over lunch and, like, walk them through all of our dashboards live and point out charts, and it just can't quite yet be replicated the same way remotely. And so I think we're gonna see data teams start to try to adopt new processes to make that better in addition to sort of what Ben was talking about of, like, you know, those serendipitous conversations where you see someone using your chart in a presentation by walking by the room and you actually know that they're using your data artifact. Like, maybe that's where metadata comes in and becomes very useful that you actually know people are consuming your dashboard rather than just sort of throwing it into the void. I agree. I think remote work is here to stay. Do openly think a lot of positives will come out of this. I think it'll it's almost a forcing function
[00:47:44] Unknown:
to get data people to start thinking about things that have historically been neglected, like documentation
[00:47:49] Unknown:
and onboarding and stuff like that. So I'm already seeing kind of a lot of positives come out of this trend. The solution is just everybody buys whatever the shiniest new data catalog is so that they can have all of their business context in there and make sure that everybody's using it. Right?
[00:48:04] Unknown:
I mean, if I'm a hater of observability tools, I'm really a hater of data catalogs, different conversation.
[00:48:14] Unknown:
Ben, you're starting to come across as the grumpy old man of data. Starting to.
[00:48:22] Unknown:
1 other thing that I noticed, which is basically because we went all remote, I think the integration of global international data folks in the community that historically has been dominated by probably, like, few tech hubs in the US has been very noticeable, I think, awesome trend. What I've also observed is that there is much increased transparency in terms of the data roles, even salaries. So it's not uncommon these days to see on certain for now, niche job boards in data communities, jobs published with cellular ranges, which, again, in tech industry in Silicon Valley, is pretty much unheard of, but I think leads to probably more transparent, more equal, more efficient market as well. And I think that the competition that companies are now having for the workforce, especially in the data domain, given how that the shortage qualified data engineers, analytics engineers, data scientists now shifted a little bit from, you know, just the brand or the salary or being, you know, this is another fancy company in in the Bay Area to what is the impact, what is the work, what is the environment, what are the tools that you're gonna be working with. And I think it's great. It probably means that the type like, the role, the importance of the role, and even how comfortable is it, how convenient is it to work as a data person improves for the better. And I think that became pretty stark in the last year and a half with a ship tool remote compared to even 3, 5 years ago.
[00:49:52] Unknown:
Yeah. And to that, I think the the hiring from different places. 1 thing too that I hope that also encourages is I guess you're sort of like out of the the tech hubs, basically, is hiring people into data that aren't coming from engineering backgrounds or have more of the, like, social science type of backgrounds. It's a great way to go hire people in Chicago who went to UChicago to study history. I would love to see more of that, and I think the remote stuff might open that up a little
[00:50:18] Unknown:
bit more. And I think another element of this year in the kind of job market is what everybody's calling the great resignation where everybody's realizing that maybe I actually don't care if I can get 5% more people to click on this widget. You know, maybe I care more about going to, you know, try and save the polar bear or what have you. So I'm interested to see what the near to medium term future looks like for data professionals working in more sort of socially or ecologically or environmentally conscious roles versus just how am I going to, you know, increase the bottom line of x company that's VC funded by 5%.
[00:50:58] Unknown:
Yeah. We definitely I mean, for sure see that at Patreon. Like, a lot of candidates that I talk to are excited about. I'm actually working to help creators rather than just trying to get people to click ads. And, of course, you know, Patreon's not focused on the environment or some of these more, like, socially responsible areas. But I do think that is for sure a trend is people realizing, kinda looking up from their computers and realizing that they can be working on more important things that are gonna help the world.
[00:51:20] Unknown:
I have a question, which is all of us probably do this of, like, get kind of buried in the especially these conversations most of data ecosystem, data tooling, all that kind of stuff. Like, okay. This is how, like, all the things that we're building, which the assumption is to serve some point at the end of it. Like, is the effort worth what the point is currently? Like, does it actually work yet? Like, great. We have all these systems, like, make sure data is reliable and that we're serving the dashboards to the right people and cataloged well. The infrastructure is good. But, like, where do we get on the other end? Like, what's the actual value of all of that? And is it worth the amount of effort that we're putting in? Particularly around, like, the stuff you're mentioning more, like, is it basically, it feels like there's a lot of stuff where it became very much, we need to be these systems to work, and then we got very focused on the system and started to forget, like, what it's all for. Ben, are you advocating for yeah. Are you advocating for all of us who's in our jobs? We should be AB test or data teams and see what happens. I mean, I do think maybe there's some joke, Ben, you're trying to make of, like, maybe the real success is for the PMs we helped along the way or something. But there is that question of, like, what does a business get when it has a data team if it if it actually works?
[00:52:24] Unknown:
I'm a kind of cornier person than most. But for me, it's like people actually feeling like they're having impact and knowing that their work is helping a business achieve its mission and teaching people too. Like a lot of people at Patreon have, you know, gone through Mode SQL School and learned how to write SQL and feel like they're just more proficient and more technical and can do more things on their own. And that to me at least feels worth it. But maybe there's that question of what is it all for?
[00:52:48] Unknown:
And so with that philosophical bent, I guess, that brings us to the future predictions. As we're closing out this year, we've talked a lot about the things that we've seen. What are the things that we are predicting for the upcoming year or the things that we would like to see or things that we want to advocate people to focus on and try to improve in the data ecosystem or whatever sort of areas of effort we want to encourage.
[00:53:12] Unknown:
I think that 1 of the trends that we've started seeing and probably will continue to see is that there is a huge disparity between how, like, between the shortage of data professionals and how hard is it to hire, and everyone's saying we're, you know, we can hire the data team. We can expand. And how much manual work is it right now in the workflow for different reasons, you know, ranging from data testing and observability to putting together pipelines and prototyping those pipelines still to communicating the results and all the all the different small and large pieces of the workflow.
And I think that a plethora of vendors, including you know, Mode and and Dataflow, are trying to solve a problem. But I think that increasing number of data teams, I would predict, would start thinking about how do we invest in productivity, how to invest in tooling and processes to make the team more effective as opposed to just focusing maybe on building or hitting certain business KPIs. And I think we've seen that big trend in software over the last decade, and lots of great tooling, lots of great thinking, and books and practices have evolved. And I think that the similar thing will happen to data over next year or 2.
[00:54:32] Unknown:
What does everybody need to stop doing?
[00:54:34] Unknown:
Need to stop founding new BI tool companies.
[00:54:37] Unknown:
Stop building dashboards.
[00:54:39] Unknown:
No no need for new BI companies. Got enough of those. Just choose between the ones that are already out there. But only if they start with m.
[00:54:52] Unknown:
I think we'll just continue to see more of the trend of integration right now, to be honest. I think everyone is already on a path of just sort of acknowledging their place as a single tool within a larger ecosystem, and I think that you're already starting to see a huge trend of integration technology because of that. I remember hearing a really funny quote. I forget who it was from, but at a certain point, they were describing the modern data stack as, like, having 2 kids that play next to each other but not with each other. And I think that's, like, exactly the place that we're in right now, and I think that the trend that we're seeing right now is sort of acknowledging that and and trying to fix it. I personally am sort of curious about the the how the roles will evolve, like, the role of the data scientists as an analytics engineer becomes more prevalent. How will that change the teams?
[00:55:32] Unknown:
What will data engineering mean as, like, some of these tooling make parts of data engineering way easier? And so I sort of am interested to see in 2022, like, whether data science and engineering teams become closer to engineering or whether they become because of the ease of tools even sort of more fragmented and focus on the business. That's 1 of the things I've been thinking about for the next year. I agree with that. It it feels like there's been a real acceleration in conversations about
[00:55:55] Unknown:
who we are and stuff like that, which I think is probably useful. I'm with you, Dewell, on on the questions about sort of bringing stuff together. The 2 kids playing next to each other, not with each other kind of piece. The 1 other thing I think I would add is it feels like the, like, mega money corporation data companies are coming where it's like these things are just straight up trying to make cash, stopping trying to be sort of community players, whether or not existing tools or whether or not it's new sort of vertically integrated products, where it's just like there is so much money in this ecosystem, then I think it's inevitable that there will be just a lot more kind of alright. The corporate dollars are now gonna show up, and people are gonna stop trying to build sort of platforms and frameworks and the kind of kumbaya stuff and just try to build, like, the ruthless money making machine.
[00:56:45] Unknown:
Maybe a slightly more optimistic contra argument to to Ben's point would be to say that I think because of the data community, because in the ways to think vendors are now closer to the users and potential users, as before probably we'll be interacting, you know, in trade shows and events, I think there will be a proliferation of more interesting ideas because I've already witnessed dozens of people who are data practitioners and they start a a side project, and that side project, all of a sudden, it picks up with the community. People start using it, be that, you know, a new attempts to do BI or a new improvement in how SQL is written or something even more bolder like Symantec knowledge graphs for managing your events.
And I think coupled with the funding, we'll probably have more good things come out of this and not just money making
[00:57:35] Unknown:
corporations. So I'm quite optimistic there. I don't think that's actually a bad thing. Like, it's I don't think it's, like, ruthless money making machines that are, like, financial engineering. I think it's just, like, we wanna build stuff that people find valuable and pay money for, full stop. And I think that makes things, like, great. People who build products like that, I buy, and they make me happy. And do I feel great about funding? Oh, maybe not. But, like, my life is better because of it, so that's okay. Alright. Well, I'm sure we could probably keep talking about all of these topics
[00:58:02] Unknown:
for hours on end. So I guess I'll give 1 more call to see. Is there anything that we didn't touch on yet that we wanna, you know, briefly cover before we close out the show?
[00:58:11] Unknown:
DBT Labs is apparently worth $6, 000, 000, 000.
[00:58:16] Unknown:
I think HashiCorp is worth 14. Wow.
[00:58:20] Unknown:
Still some room to grow.
[00:58:22] Unknown:
Yep.
[00:58:24] Unknown:
I think Apple just hit 3, 000, 000, 000, 000, so I think that's the new ceiling.
[00:58:28] Unknown:
Wow. Apple's 3 trillion weren't they like a trillion for the first time, like, a year ago? Things move fast, Ben. Okay. I'm not keeping up.
[00:58:39] Unknown:
Just as a final question to close this out, and I'll go in sequence again. So starting with you, Mora, from your perspective, what do you see as being the biggest gap in the tooling or technology that's available for data management today? And maybe as a corollary, what do you hope
[00:58:54] Unknown:
to realize from that gap in the upcoming year? I'm a somewhat biased audience here because I use both data fold and mode at Patreon. But I think the biggest gap goes back to what we talked about for sort of, like, semantic and contextual understanding. 1 of the biggest things that I think happened in 2021 is, like, this discussion of lineage, seeing where your data is coming from and where it's flowing. But I still want to be able to, like, put a big circle around that and say, okay. That whole line is payments. That's our payments data for the $100, 000, 000 we process each month. And then be able to hand that to the new payments PM and say, like, here's everything you need to know about payments data at Patreon. And so I still see that gap in, like, actual data to what the business cares about and making that a little bit more visible. And to me, that's like way beyond the catalog. It's sort of, you know, the metrics and the growth and the context and the the and the the actual columns, all of that. So that's the biggest gap. Hoping we can solve it with the mix of those tools in 2022.
[00:59:46] Unknown:
Alright. Moving on to you, David. Yeah. I'm gonna sound a little bit like a broken record, but I think that the biggest gap we have right now is finding ways to manage and express the implicit dependencies between not only data tooling, but also data assets.
[00:59:59] Unknown:
Alright. And Ben? I mean, sort of an extension of those same things. To me, it's it's like the experience of using the whole thing. And not even just as data people, but as the people who are consuming the sort of final asset. Like, to them, this whole thing is 1 product. Like, that's what they care about is the 1 product. And we're sort of shipping the org chart of the stack in some ways. And I think it's like, okay. How do we make that experience something where we can finally get past the this number and this number don't match. What's going on? I spend all of my time bickering with somebody else about, like, whose dashboard is right. Like, we are basically gonna be have a ceiling until we can solve that problem, and we haven't quite gotten there yet. We've built a lot of great tools underneath it, but we haven't sort of made that experience better yet. I think we're now kinda getting there because the stuff is there, but we still aren't, like, using this in sort of the next level of ways that we could do stuff in. And, Collab, why don't you close us out? Yeah. I'd actually
[01:00:51] Unknown:
100% agree with everyone, but also wanted to double click on what Morris said around Symantec. I think that 1 of the challenges we have in the data stack and maybe 1 of the reasons why there's so much manual work is that we're just pushing down so much thinking and so much context to be in, you know, people's heads. And I think that probably the big advancements in productivity will come from the data systems, data tools, understand much more context behind the data, be that dependencies between assets, be that semantic meaning, you know, what kind of data is that, understanding the shapes of data, and then helping people to make decisions much faster, be that what line of SQL to write or whether this dashboard is actually okay or not or what dataset to use for work.
And I think that observability, although Ben is not not a fan, is actually a stepping stone towards that because through observability, you have a system that actually learns about data, and then it helps people learn about data. Then you can build lots of publications on top. I would agree that the current observability platforms are still nascent, and we're still figuring out how to plug them into workflows and how to extract more value than noise from them. I think long term, that work is really fundamental to actually providing a cohesive experience where people actually do creative stuff and not do, you know, manual tasks, just do basic things.
[01:02:10] Unknown:
Alright. Well, thank you all for taking the time today to join me and share your thoughts and perspectives and experiences from this past year and your thoughts and hopes for the year to come. Definitely been an interesting few years, so definitely appreciate all the time and energy that all of you are putting into helping make the data ecosystem as strong and enjoyable as it is. So thank you again for that. Hope you have a good rest of your day, end of your year, and happy holidays.
[01:02:36] Unknown:
Thanks, Tobias.
[01:02:43] Unknown:
Listening. Don't forget to check out our other show, podcast.init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at data engineeringpod cast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Meet the Guests
Getting Started in Data
Key Themes of 2021
Modern Data Stack
Vendor Ecosystem and Competition
Data Complexity and Metadata
DBT and SQL Challenges
Remote Work and Data Teams
Future Predictions