Summary
The data ecosystem has been building momentum for several years now. As a venture capital investor Matt Turck has been trying to keep track of the main trends and has compiled his findings into the MAD (ML, AI, and Data) landscape reports each year. In this episode he shares his experiences building those reports and the perspective he has gained from the exercise.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Businesses that adapt well to change grow 3 times faster than the industry average. As your business adapts, so should your data. RudderStack Transformations lets you customize your event data in real-time with your own JavaScript or Python code. Join The RudderStack Transformation Challenge today for a chance to win a $1,000 cash prize just by submitting a Transformation to the open-source RudderStack Transformation library. Visit dataengineeringpodcast.com/rudderstack today to learn more
- Your host is Tobias Macey and today I'm interviewing Matt Turck about his annual report on the Machine Learning, AI, & Data landscape and the insights around data infrastructure that he has gained in the process
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what the MAD landscape report is and the story behind it?
- At a high level, what is your goal in the compilation and maintenance of your landscape document?
- What are your guidelines for what to include in the landscape?
- As the data landscape matures, how have you seen that influence the types of projects/companies that are founded?
- What are the product categories that were only viable when capital was plentiful and easy to obtain?
- What are the product categories that you think will be swallowed by adjacent concerns, and which are likely to consolidate to remain competitive?
- The rapid growth and proliferation of data tools helped establish the "Modern Data Stack" as a de-facto architectural paradigm. As we move into this phase of contraction, what are your predictions for how the "Modern Data Stack" will evolve?
- Is there a different architectural paradigm that you see as growing to take its place?
- How has your presentation and the types of information that you collate in the MAD landscape evolved since you first started it?~~
- What are the most interesting, innovative, or unexpected product and positioning approaches that you have seen while tracking data infrastructure as a VC and maintainer of the MAD landscape?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on the MAD landscape over the years?
- What do you have planned for future iterations of the MAD landscape?
Contact Info
- Website
- @mattturck on Twitter
- MAD Landscape Comments Email
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- MAD Landscape
- First Mark Capital
- Bayesian Learning
- AI Winter
- Databricks
- Cloud Native Landscape
- LUMA Scape
- Hadoop Ecosystem
- Modern Data Stack
- Reverse ETL
- Generative AI
- dbt
- Transform
- Snowflake IPO
- Dataiku
- Iceberg
- Hudi
- DuckDB
- Trino
- Y42
- Mozart Data
- Keboola
- MPP Database
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png) Businesses that adapt well to change grow 3 times faster than the industry average. As your business adapts, so should your data. RudderStack Transformations lets you customize your event data in real-time with your own JavaScript or Python code. Join The RudderStack Transformation Challenge today for a chance to win a $1,000 cash prize just by submitting a Transformation to the open-source RudderStack Transformation library. Visit [RudderStack.com/DEP](https://rudderstack.com/dep) to learn more
Hello, and welcome to the Data Engineering podcast. The show about modern data management. Businesses that adapt well to change grow 3 times faster than the industry average. As your business adapts, so should your data. RudderStack Transformations lets you customize your event data in real time with your own JavaScript or Python code. Join the RudderStack transformation challenge today for a chance to win a $1, 000 cash prize just by submitting a transformation to the open source RudderStack transformation library. Visitdataengineeringpodcast.com/ rudderstack today to learn more.
Your host is Tobias Macy, and today I'm interviewing Matt Turc about his annual report on the machine learning, AI, and data landscape and the insights around data infrastructure that he has gained in the process. So, Matt, can you start by introducing yourself? Yeah. Absolutely. Thanks for having me. A long time listener, first time caller, big fan of the
[00:01:08] Unknown:
show. I'm a venture capital investor. I'm a partner at Firstmark, which is an early stage venture capital firm based in New York. And, I've been a big fan and, very active investor in the general space of data infrastructure, machine learning, and AI from the infra layer all the way to applications.
[00:01:28] Unknown:
And do you remember how you first got involved in the overall area of data and data management?
[00:01:33] Unknown:
Yeah. I started my career in technology as an entrepreneur. I was the cofounder of a company that was called TripleHub Technologies, that, did enterprise search and knowledge management. I like to joke that today, it would be a a really hot AI company because, ultimately, it was all about unstructured data management, and we used a lot of Bayesian techniques at the time. But I remember that we had to work really hard to convince my now peers in venture capital that AI was really a thing. In particular, we had a CTO work in Delgado, who had a PhD in AI.
And after a few pitches, we learned that, we needed to downplay the fact that his PG was in AI and instead position him as having a PhD in computer science just because VCs were looking at us, like, you know, kid, have you have you not gotten the memo AI is dead? You know, obviously, that was pre deep learning and in the middle of an AI winter. So different
[00:02:39] Unknown:
times. And so now that brings us to today where you have been compiling and publishing this mad landscape report for the past few years. And I'm wondering if you can just start by giving an overview about what that even is and some of the story behind how it got started and why you thought that that was a useful exercise.
[00:02:56] Unknown:
I've been doing this for 11 years now. And, yeah. Look, the the the general idea is, quite frankly, it's it's it's it's originally, it's for my own benefit. It's an exercise that, I feel I need to do to be able to keep track of what's going on in the space. And, so the the work is really the the key goal is doing the work. And I personally think that the era of the journalist VC has long gone to the extent that he ever existed. And, to be a good investor and a good board member, you need to have a deep, expertise in the space or at least spend a considerable amount of time, in the space. So that's 1 of the tools I I use to do that. So it's really the forcing function of publishing that landscape every year that, makes me, just keep, on top of, the companies, the trends, and that's that's that's really all there is to it. Then, you know, there's 2 different approaches. I could, decide to keep that as a sort of proprietary knowledge, or I could decide to sort of open source it. And the approach has has always been to do the latter, because, I think, it's just more fun, and I just get more out of it, by doing that. And, it's just started the
[00:04:20] Unknown:
never ending number of conversations with the the community and just find it both useful and immensely enjoyable. But, you know, if you if you publish all of your hard work as a VC, then isn't aren't all of your competition just going to snipe all of your potential, investments?
[00:04:35] Unknown:
Look, It's even it's even worse than that. Actually, I don't know if that's a a podcast conversation, but, it's it's happened more than once over the years that, you know, this the landscape, but I also write this long kind of state of the union blog post. And over the years, more than once, I've had some some VC colleagues, in meetings with me, 1 on 1, say something in a certain way with the sentence, know, organized in a certain way. And I was like, I literally wrote that. And you're playing it back to me, during this meeting without realizing that you are, which is absolutely priceless. Yeah. Look. It's it's the way I think about it is is, again, like open source.
Yes. There are downsides to open sourcing the analysis, but, you know, eventually, the information is all out there and, people can compile it. And, you know, I don't I don't write everything that's, going through my mind as I do this. I write mostly what's out there in terms of facts, in terms of analysis, and it's more of a compilation exercise. And and analysis of market trends as opposed to, where this is is Lee. I think, the world is going, and or where the most interesting opportunities are. So I tried to strike a balance. But ultimately, I tried to make this, as a free resource to the community
[00:06:01] Unknown:
for for, you know, the bulk of it. And in terms of that decision of what is it that's actually useful to include in the landscape, what are the things that are in my best interest to include in the landscape versus leave out? I'm just wondering what your initial approach was for figuring out what is the actual pieces of data that are relevant to the broader community, know, what are the questions that people are going to be asking about it, and how can I help to answer those questions based on the information that I include and just kind of how that process has evolved from your very first addition to where you are today, where you've been doing it for, I think you said, 11 years? Yeah. Look. I try to be broadly inclusive.
[00:06:38] Unknown:
I think the the difference this year is that, we've had a much more opinionated approach to selecting which companies we get into the landscape or not. In prior years, we tended to give a particular priority to companies that were a bit later stage. Either they had higher revenue levels or that raise more money or that kind of thing. The this year, we decided to also include a bunch of companies that we found very interesting. That's particularly, because we wanted to give a good amount of real estate on the landscape and in the writing to generative AI companies. And, literally, most of those companies did not exist 6 months ago. So if we said, we're only going to include series b or later companies, then, all the generative AI companies will not be in there. So that's 1 of the ways we're opinionated. But look, at some point, we have 1416 companies on the landscape. We could easily fit another 1, 000 or 2, 000 on it. So we have to make decisions.
And, you know, by the way, we we, you know, we we miss we, some companies, we get it wrong all the time, but, the community is sure to let us know, and that's 1 of the ways it's been really interesting. We has learned a lot from it. And also, 11 years ago, the overall space was much smaller, so you could fit it all in 1 graphic, which I'm sure is why you had that initial ambition of, hey, let's put it all in 1 place. And now now I I've only had known. Yeah. I imagine you're regretting that now. I'm gonna gonna need to get 1 of those giant rolls of butcher paper. Yes. Exactly. Making a mental note next time to pick a a space that doesn't expand.
[00:08:21] Unknown:
Maybe COBOL programming. There you go. And to your point of, you know, we publish it, we get it wrong, people give us feedback. I'm wondering what you see as the potential for being able to make this a properly open source activity where here's the code that we use to generate the the visual, here's kind of the set of fields that you need to fill in. You know, let's just update it piece by piece by pull request so that people can add and remove things as they see fit. Like, what what are some of the potential risks that you see in that approach? Yeah. That that's an interesting thought, and that crossed my our minds, a couple of times over the years. I think the,
[00:08:59] Unknown:
you know, the 1 of the key editorial decision that, wouldn't work there, would be much harder, is that, a lot of companies, and quite often rightly so, think that they should be in many different categories at the same time. So in some case, it's probably true if you look at the cloud hyperscalers, if you look at Databricks. Databricks at this stage could probably have a logo in most of the boxes, especially on the left side of the landscape. However, it gets a little more complicated when you have a seed stage company that says, hey. You know, we should really be in 27 subcategories. Absolutely. And
[00:09:36] Unknown:
whenever I look at the mad landscape, it also puts me in mind of the CNCF landscape that they've built, and I'm wondering kind of, did did they copy you? Did you copy them? Is this just kind of a general paradigm of how to organize information? Like, I'm wondering what you see as the interplay between these these types of visualizations of an overall ecosystem.
[00:09:54] Unknown:
Yeah. It it all started, I think, with the Loomis cape back in the day, which was, a Martech landscape started by I think that was LUMA, the investment bank. I may forget the exact details, but, that was the the granddaddy or the OG of, all those market maps, as far as I know, Where we did get inspired by the CNCF landscape this year in part, is that for the first time, we'll launch an interactive version of the landscape. You go to mad.firstmarkcap.com, you'll see the interactive version. So, you know, I like to joke that, it's a big innovation because, you know, apparently, this world wide web thing is, is major and you can, actually get on, this thing they call the web, a website where you can click on the logos and have an interactive experience. So we're very proud to have done this, this year. But it's it's, it's actually been very, helpful.
And, you know, you can also bring a card view, and as you click, you get stuff that was, data that was provided by our friends at CB Insights. So it's it's, it's good. Now the mad landscape is really a combination of the PDF landscape, the interactive version, and the write up, the kind of state of the union write up that, we produce around it. And
[00:11:16] Unknown:
as far as the overall landscape and the ecosystem that we're operating in and trying to catalog as these snapshots in time as you do this every year or roughly every year, I'm wondering, over that period, how you have seen the influence of the types of projects and companies that are founded as we go from this early stage of big data, everything is Hadoop, to where we are now, where everything is the modern data stack and kind of the different splinters that kind of break off as we go along that journey.
[00:11:47] Unknown:
Yeah. Absolutely. So indeed, the the big, initial sort of burst of energy into that ecosystem was really, a dupe and all the related technologies. And, actually, if you fast forward to this year in 2023, for the first time, we actually killed the Hadoop box, and we had kept it on there because the Hadoop footprint, is actually much wider and stronger than 1 would suspect. So we kept it on on there up until now, but now we've merged it. We've merged the vendors and the companies into the data lakes and data lake house, box, but that was a little bit of an the end of an error. And, you know, separately, indeed, the modern data stack was the the big next phase. And, as we all know, right, the the creation of this whole ecosystem. So not just the cloud data warehouses, but all the tools, you know, before and after.
So in the last landscape, we had the emergence of brand new categories like reverse ETL, you know, on the left side of the data warehouse. And on the right side of the data warehouse, we had metric stores. So there's, like, all those new boxes that that, that appeared. But, yeah, the landscape is very much followed this. You know, the the the poor parents of the landscape for many years was the right side, which is, applications. So the way to think through the landscape is that the roughly, the the left is data infrastructure. So that's where stuff is stored and computed and processed. And then the middle is, is data analytics. So data leaves and gets analyzed. And then the right side is, like, data gets used, so those data applications. And for for a very long time, the the action was, on the left side.
And it really feels that this year, in part with generative AI, a lot of the action has has truly and sort of finally started moving in earnest, to the to the application side of the house where, it's become more than more apparent than ever how you use all those technologies. So I, you know, think again of, like, data moving from the warehouse, wherever it's stored to, BI, on 1 side of the fork, and then on the other side of the fork, machine learning and AI, and, and then a lot of, like, ML and AI related applications.
[00:14:15] Unknown:
Another interesting challenge of this ecosystem and this problem that you have created for yourself is that question of categorization, whereas you mentioned Databricks could have a logo in every single box, basically. And also there's the question of how do you define what the categories are and what their boundaries are because a lot of these different tools and products, you know, maybe it started with a very narrow vision, but it has grown into encompassing some of the adjacent concerns. New capabilities, new categories have arisen. You know, maybe sometimes they're justified. Maybe sometimes they're a flash in the pan. And I'm wondering what you see as a a useful exercise to figure out what are the useful categories to break these things into, particularly as, you know, it has gone from a very linear flow of data starts in 1 place and ends up in the, you know, business intelligence dashboard to where we are now where it's become cyclical through things like reverse CTO and AI and AI being used in data infrastructure, and it's it's it's it's a tangled web more than it is a linear flow if it ever was 1.
[00:15:21] Unknown:
Yeah. Absolutely. That that's where you have to be somewhat opinionated and, and make calls and look, we certainly don't pretend in this landscape. Again, it's an exercise rather than a definitive statement on how things are, and in many ways, a way of, like, starting and generating, conversations. But, yeah. Look, you have to be, willing to explore and you have to be agile on your feet. And, sometimes you add a category and sometimes you kill a category. So, for example, we had the metric store that was added in the last landscape, and we decided to remove it in this landscape.
And, you know, what's interesting about this example is that the the need for a metric store is very clear, and that's certainly very important. However, as a separate box, it's felt less justified because, you know, DBT launched their own metric store, and then they acquired Transform. And then you had another company in the space called Supergrain that pivoted. And then you had another, company in the space in which I'm a private investor called Trace that added a whole application layer on top of the metric store. And, yes, there are other companies that position as metric store, but you end up with, like, 1 or 2. And by the way, those companies do other things as well. So it sort of felt like killing the category, as a separate box.
Again, fully acknowledging that it's an important functionality, but killing the category as a separate box was, the right thing to do.
[00:16:55] Unknown:
Another interesting element of this problem is that, you know, there are some problems that are solved by companies that have their own set of founders and their opinions. There are some categories that are largely dominated by open source projects that don't necessarily have a strong kind of corporate owner, many of them do. And I'm wondering what you have seen as the overall impact of investment and venture capital on the evolution of the data landscape and which problems are focused on and paid attention to and, you know, which of the problems maybe deserve attention but are being funded for whatever reason, whether it's, you know, societal, economic, pick your reason, particularly in the situation of the past few years where capital was very cheap and plentiful and easy to come by, and some of the ways that that has impacted the way that this landscape has grown. Yeah. Look, I think we are coming out of a pretty frenetic
[00:17:56] Unknown:
period of time in the data infrastructure landscape, which was in particular accelerated by the Snowflake IPO, which, to this day is the most successful and biggest, software IPO ever. I think, that had pros and cons. The the the pros was that for, a long time, we were in that, let a 1, 000 flower blooms kind of mode, where if you were, a technical founder, with, you know, enough experience in clouds and, like, in a in a vision, you could get money, which is, which is great. So people were able to just start companies left and right and experiment. And, and that was very exciting in many ways. There were, you know, plenty of, interesting comedies that were started all of a second.
The obvious drawback of this is that, we ended up with comp with categories that were overcrowded overnight. And, you know, overcrowded by companies that were all at the same time. And, you know, everyone was in the mode of, okay. This is a real problem. This category needs to exist. There is maybe another 1 or 2 companies in the space, but they're all early. Therefore, if we start, or fund another company, we'll just have just as good a chance as anybody else. And, yeah, as a result, we, you know, now the music has stopped in terms of, financing, and, everybody's looking around and trying to figure out, which 1 of those companies are going to be able to survive. You know, we we certainly, now in the market's phase where the market is not gonna be able to sustain, all those companies, which again tend to be, what I would call single feature companies through no fault of their own. It's just the nature of a of a of a young startup is that you start with something. You start with a with a product that typically looks like a feature, and that's, that's how you should be doing it.
But you need more time on earth, to be able to turn that feature into a product and a product that's truly enterprise ready and adopted and deployed in production by many customers. And, a lot of those companies were started in 2020 or 2021, or 29th to 2018. But, at the end of the day, there are 3 2, 3, 4 year old companies that are below, say, 5, 000, 000 in ARR and in a bunch of different categories. And that's very much an uncomfortable place to be, for those companies. So there's an oversupply of companies and oversupply of, products, and that is in the context where the buyers of technology, the customers, are under a very clear pressure, from their CFOs and CEOs to, cut costs or to keep costs under control.
A situation where the VCs are under their own pressure to, focus on their portfolio and be very discerning in the bets that they make going forward. And, finally, a situation where the potential acquirers of those companies are going to very soon find themselves in a situation where they could pick any of 10 companies, as the potential target that they will eventually acquire, which obviously will have a deflationary impact on the price of any of those potential acquisition, acquisitions. So it's, it's, you know, all of a second, a little bit of a tough situation.
Look. Some companies will navigate this very well through a combination of skill, having money in the bank, and a bit of luck, and will emerge on the other side of this stronger, leaner, fitter. And, you know, I think that will work out, great for the entrepreneurs and the VCs. But, I think for everybody else, it's gonna be more challenging.
[00:22:33] Unknown:
Another interesting aspect of where we find ourselves today is the combination of capital isn't being distributed as freely as it has been the past few years, and it seems like a lot of the, kind of focus and hype has reoriented from the data infrastructure level to these, AI and ML focused product categories, particularly with things like generative AI and the, advent of transformer models and the kind of general capabilities that are being built up there. And I'm wondering what you see as the interplay of kind of the state of data infrastructure as a broad category of problems and the kind of level of maturity that we've reached there, and the broad attention that's being paid both in terms of the enterprise and venture capital and technological investment, as well as the, kind of flashy headlines that are coming out with things like chat GPT, etcetera, and how that's going to influence where you see kind of new companies being founded.
[00:23:38] Unknown:
Yeah. So for sure, you know, the the VC train has moved away from data infrastructure into generative AI. From a data infrastructure founder perspective, it's both a curse and a blessing, I would argue. It's certainly a curse, because, it's gonna be tougher to raise that next round, especially if you're in a situation where you raise the prior round at evaluation that was way ahead of the reality of the business. At the same time, the blessing part, I think, is that, you can now, focus truly on the business, the product, and the customers with the relative comfort of knowing that you're not gonna wake up tomorrow morning, to, you know, 3 announcements from new companies that were just started and founded in your space, or a competitor that decided overnight to get into your space after they raised yet another big round.
So that is helpful, I think, and the market is going to thin out. So if you are 1 of those companies that sticks around that is, you know, in this kind of survival mode or sort of fitness mode, right, where where you are truly efficient and, really building product and selling customers and making customers happy and all the things. In in in many ways, if you're not completely overused keys in terms of valuation, you're in a you're in a, you know, in a in a pretty decent and challenging but pretty decent position. And, again, I think some companies will emerge from all of this as the leaders in the in their category, and, we'll we'll we'll find that what's currently happening is the best thing that could possibly happen to them as opposed to just, getting more money for free from VCs and getting distracted and having more competitors and and more noise.
So that's data infrastructure. If we look at the world of of generative AI, and I don't know if that's a separate podcast or not, but if if we look at the world of generative AI, my concern is that, that situation in data infrastructure that we are now in the process of untangling is forming again in generative AI. And, look, it's ultimately nobody's fault. You know, it's easy to blame the VCs or the press or Twitter or or or what have you. It's the logic of the capitalistic, system that, VCs and founders, everybody in technology is looking for those disruptive moments when something very meaningful is happening.
And, clearly, that's the case in generative AI. So, clearly, that's gonna attract a lot of attention. It happens to be in the particular context of an otherwise very dire market. So general AI is not just a major inflection point and possibly the next big platform of the future, but it's also the 1 bright spot, in this, you know, challenging economy and certainly challenging tech world. So, yes, people are rushing into it. The net effect of this is that all the stuff that happened in data infrastructure is happening now. So you have a bunch of companies, that probably should not be started.
You have a a bunch of, you know, very technically strong machine learning, engineers that are starting new ventures that, in an ideal world, will be excellent founding team members. But because they can and because they are quoted by VCs who tell them, hey. If you wanna start something, you know, here's 5, 000, 000 or 10, 000, 000 or whatever the amount is. Those people are starting companies, and that's gonna take a couple of years to play through the ecosystem of, you know, those companies are going to start. They are going to try and build a product that may or may not get to product market fits, especially given all the noise. At some point. So they will come to the conclusion that they will probably need to be part of other organizations. But, you know, when I say 2 years, it's probably more like 3, 4, or 5. So, you know, it it is it is, it is what it is.
It's the logic of the system to go through those booms and buzz. I think, directionally, it's produced great companies. But, you know, for somebody like me, I've been at Firstmark for 10 years now. Just during those 10 years, that's my 3rd hype cycle in AI. The first 1 was up to 2012. The certain second 1 was sick sometime around 2014, 2015. Feels like another 1 now. And, you know, those things, typically don't end as well as 1 would think. So it is a little bit, you know, there's a little bit of that feeling of, okay, here we go again. But, you know, what can you do? It is exciting, and, you know, I'm I'm excited for the founders who start businesses in the field. I'm I'm making investments in the field, and you to play the moment.
Just, you know, 1 just needs to be ready for the inevitable backlash that that will happen at 1 point or another. So more than ever, it's built, AI businesses for the sake of building a business and making customers happy and not because,
[00:29:22] Unknown:
you know, you can do cool things with generative AI. And the other interesting thread between these 2 moments is that you can't build AI if you don't have solid data infrastructure and data engineering. And I'm wondering from your conversations with people on both sides of that, how much you see people understanding the kind of dependency chain, and if you have seen any kind of concerning elements of people jumping straight into AI saying, well, all that stuff's done. I don't have to worry about that. I just throw data at it, and it's good.
[00:29:54] Unknown:
Yes. And, I've I've I've certainly heard that, and I've heard the flip side of, you know, what's really hard is the data engineering stuff. Like, the AI stuff is done. That's that's easy. That's just what you add on on top. So, you know, it's it's a really good question because that's a question we've been asking ourselves every single year when we do this mad landscape is whether we should keep everything on 1 chart, especially as the real estate has become more and more expensive, because there's only so many little logos we can fit on, 1 1 page.
Whether you know, the question has been whether we should do 2 landscapes, 1 for machine learning AI and 1 for data infrastructure. We've decided to keep everything on 1 precisely because of that symbiotic relationship. But, that that's 1 of the reasons why, I am very excited about AI. There's certainly the generative AI part, for sure. But almost separately from generative AI, I think we also are at a phase of the cycle where a lot of companies are much closer to having their data house in order than ever. And, indeed, having your data house in order is the absolute requirement before you can do anything meaningful, with machine learning and AI.
So I'm excited to be at that phase of the cycle. And, of course, that's the result of the modern data stack, which we talked about. But the the the the rise of the data warehouses and the data lighthouses for the first time, has brought us to the level of maturity where, enterprise AI becomes truly a possibility at scale and in the ubiquitous manner. So, look, not to be the VC that, talks about these companies all the time, but, have been, a very, proud, investor and board member at Dataiku, which has now emerged as the leading enterprise AI data platform, and, the acceleration over the last few years of that company as a bellwether of the broader industry has been really interesting to to to watch. And, they have a close relationship with both Snowflake and now Databricks. And you can almost see it mechanically.
The companies that have deployed Snowflake at scale or Databricks at scale or any, you know, comparable situation have been turning their attention to enterprise AI, a part of which is generative AI, but most of which is actually not generated AI. Most of of which is, you know, fraud detection and churn prediction and, you know, supply chain management, inventory management, all those use cases that actually don't require, GPT or NLM or or or what have you. So, it's it's we're really at that phase now where, enterprise AI is going from being the the poor parent of BI, to being, 1st class citizen in the enterprise.
[00:33:12] Unknown:
In this phase of kind of adjustment or contraction or however you want to phrase it, I'm wondering if there are any general product categories that you see as being particularly ripe for consolidation or being swallowed up by adjacent concerns or adjacent problem spaces? And, if there are any kind of niche product areas that maybe look like they're an opportunity to be subsumed by other products or other companies that you see as actually likely to remain competitive as we move into this uncertain future?
[00:33:48] Unknown:
Yeah. There there's plenty. And, by all means, a little bit like the metrics store I was describing earlier, I I don't mean any of those categories. I don't mean to say that any of those categories are not very important categories and that important companies will not emerge from those categories. Not at all. But having said that, there are certain categories which are clearly ripe for evolution, consolidation, all the things. 1 of them, I think, is a world of data observability. And to some extent, that's already happened to some extent. As you remember, we used to have different categories or subcategory that was like a whole data lineage, world. And then there was, data quality. And within data quality, there's declarative declarative data quality, and there's this, like, machine learning driven data quality, and then there's data observability, which covers some part of this or all of this at the same time. And we've already seen data lineage sort of disappear as a subcategory.
To me, all of this is more or less the same thing. And look, if you if you speak to the companies that you do all the time, everybody is going in the same direction, which is to be the Datadog of data, which, again, is 100%, a a a a beautiful prize if you get it in the a very ripe opportunity to build important companies. But this all these companies are going to need to work together. And arguably, you could you could say, that, orchestration is a part of that whole discussion as well. Because if you're a customer, ultimately, what you want is not a data lineage vendor, data quality vendor, data observability vendor, and an orchestration, vendor or open source project that you use.
Ultimately, what you want is for your data to be of a high quality. If that's an issue, you wanna hear about it quickly. You wanna know where the data issue comes from, and then you wanna be able to fix it. And, all the things should kind of work together. And what I described is a combination again of, like, data quality data in it and, you know, orchestration. So I I think the this whole world needs to evolve towards, simplification. As painful as consolidation might be from a vendor perspective, so founders, VCs, and startups.
From a customer perspective, consolidation is going to be generally very helpful and, I think, very welcome because that's less technologies that you need to become fluent with as a user. That's less contracts that you need to manage. That's definitely cheaper because you're not in a situation where every vendor needs to increase their revenue and increase their margin in order to get to the next round of financing. So that I think the the the customers are gonna end up, you know, being, the beneficiaries of a lot of this. So that's 1 example, data observability, data quality. Again, not to pick on them.
Another category for me clearly is, MLOps. So I don't know if that falls into data infrastructure or AI machine learning. That's another category where you've had the dozens of companies founded over the last, few years, and some of them are closed source, some of them are open source, and some of them started doing data model management, data model governance, or, you know, AI fairness or AI transparency, and everybody's coming from a different angle. But fast forward to today, especially in a context where VC financing is less abundant, everyone is, realizing that, okay. Well, you know, I have this product now, and I've been able to get to 2, 3, 4, 5, 10, 000, 000 in in revenue. But, to grow into my valuation and Windows categories, I'm going to not be just the best company in AI fairness, but also I'm going to need to expand and effectively become an MLOps platform, which, again, is not a feature, not a bug. That's that's how you wanna grow as a as a startup. Just thinking this is going to need to get accelerated by the overall pressure on the on the category.
So, this is already starting to happen, and you see companies, starting to do lots of different things and evolving towards a platform. But you're not gonna the market is not going to be able to sustain having 30 different MLOps platforms. So something is going to happen to that category, for sure. But again, as always, some companies will do great, just not everyone.
[00:38:46] Unknown:
Another interesting element of our current point in time and space is the fact that because of things like the Snowflake IPO, because of the general kind of evolution from Hadoop of this is operationally very heavyweight and difficult to manage, but we want to be able to get this power of being able to compute across massive data to we can do this in a cloud data warehouse, but it doesn't solve everything. You can't use SQL for all of your business logic. You know, hence, we have the modern data stack as kind of the de facto architectural paradigm. I'm wondering if you foresee any impact on kind of what that de facto architecture looks like as we move into this phase of contraction, as we have explored a lot of the potential space and tried figuring out kind of what is the proper balance of cost versus compute and scalability versus storage space, etcetera.
[00:39:40] Unknown:
Yeah. This last 18 months or so for the first time, I've seen the core principle of big data being challenged for the first time. And by core principle of big data, I mean, this general idea that you should, collect and store all of your data and, quite frankly, occasionally, just figure out what to do with it later, which was, you know, it was a whole do by gear. Like, stop throwing out your data, just dump it into this big bucket, and, then magic will ensue. So as we all know, we took, a little bit of time for magic to actually ensue after that. But, you know, that that that big data logic very much translated or carried over to the wall of data warehouses.
If you think of Snowflake, ultimately, the beauty of it is that it's this infinitely elastic warehouse in the cloud and the data lakes or data lake houses, like, central idea. Of course, storage is 1 thing. Compute is another thing. As it turns out, if you dump a lot of data, into those repositories and, you try to compute all of it or even a significant portion of it, at scale and repeatedly, is going to cost a lot of money. We are not clearly in a world where it's not okay to spend a lot of money, except you if you have a very clear ROI for it. The c f o, of each customer is now breathing down the neck of the data teams and data engineering teams, so a different paradigm.
So I'm seeing, an evolution of the conversation towards, do we need all this data? If we do need all this data, what is it for? And if we have a clear business objective for all of this data, then are there cheaper, faster, easier ways of, processing it? At the high end, I'm seeing, the beginnings of, new coming up. So when I say new, look, it's stuff that has been, percolating for over a long period of time that that seems to be accelerating and a kind of architecture I'm seeing discussed a lot these days is, okay, well, let's do s 3 for storage, because, that's not very expensive.
And then, to add a little bit of a structure, let's add stuff like, you know, Iceberg or Oodie. And then for the bit of, all app, you know, dev DB is sort of emerging kind of out of nowhere as, like, everybody's favorite solution to at least talk about. And then, for, you know, query, Trino. So, you know, possibly different tools, not necessarily those ones, but that's, emerging as, something that's that's a little different from your sort of centralized, let's double the data in the Snowflake kind of architecture. So I'm seeing that at the at the upper end of the market, and and and the reason for that is that you need smart data engineers to be able to figure that out and collect all those, connect all those things, teach all those solutions together, and figure out how that works.
At the lower end of the market, which, by the way, is sort of, you know, 90% of of companies, I'm seeing the acceleration, it seems, of, the the fully managed, data platforms, which, you know, felt like arguably a weird idea 2 or 3 years ago and, now seems pretty interesting and logical. So, you know, I don't know how big a market category that is, but I'm I'm certainly seeing and hearing a lot. So by that, I mean, the, you know, y 40 twos and Mozart data of the world or Kubula, which is, you know, a a different approach, but, like, with the same end result. The the the former, that's really this idea of, like, abstracting away the modern data stack and using all the usual suspect vendors, but to sort of, like, stitching them together and offering the customer just 1 contract and, just 1 relationship.
And, you know, ultimately, I assume with the general goal of being able to then turn around to the underlying vendors and negotiate better prices. So I think that's that's, that's interesting, and I'm seeing, you know, people, finding that interesting in a context where people would want more simplicity, more convergence, and, just don't have as much of a budget to hire a bunch of data engineers to do all the stitching together.
[00:44:31] Unknown:
As the landscape has evolved, as the overall interest in this space has evolved, as you have gained a greater appreciation for the kind of nuance and detail that's necessary to be able to differentiate between some of these different vendors, the different product categories, You know, why do we even care about this thing? How has that affected the way that you think about the presentation of that information in your collection for the mad landscape and the ways that you think about communicating around those vendors and product categories and tools?
[00:45:06] Unknown:
I've I've tried to try the right balance, between, evolving, but also keeping the broad architecture of the landscape consistent over the years to make it easier for people to sort of, compare. And I it is actually, it's it's it's fascinating. It's like a whole group of of, you know, people in the community that that should take those landscapes and compare the images and and all the things. So, like, I've I've, on the whole, tried to keep it reasonably consistent.
[00:45:33] Unknown:
In the process of building this landscape, investing in this space, working with some of these companies to help them understand kind of what are the competitive opportunities, what are the ways to think about positioning your tools and products, what are the problems that need to be solved. I'm wondering if there are any particularly interesting or innovative or unexpected entrance into the market that you have seen, whether as far as the way that they're attacking a particular problem space or the ways that they're thinking about trying to make themselves valuable to the broader ecosystem or indispensable in a particular
[00:46:08] Unknown:
way? I don't know if there's initially innovative or different, but, what what I've seen again and again work is this general approach of, starting as a tool and sometimes arguably even a a toy and, evolve bit by bit into a product and then over time a platform. And, what what I've seen conversely, you know, often, is companies that that that try to be more of a platform early on, and that's, that's challenging. That's challenging. So that's 1 thing visible this, know, starting, with a game plan in mind with something that may feel like a a little thing, in in the future, but, like, over time, evolves, as you get product market fit, around that product. And that's true in general, but that's certainly been true in in data infrastructure.
The second, thing that comes to mind is there is a, series of companies, that have actually won their respective markets, by doing something that's counterintuitive, for a lot of deeply technical founder, founders, which is building for the many and focusing on democratization and collaboration as opposed to trying to build, the most bells and whistles for the more technical users. Because as it turns out, pretty often, the number of, very technical users that will appreciate the fine nuances of all the features that you build for them can turn out to be pretty small, in an enterprise. So the, you know, the the perfect example is, like, this whole generation of, platforms that built purely for data scientists that, that platforms that built purely for data scientists.
That's, you know, great for initial product market fit. But as it turns out, to this day, it's very hard to hire data scientists. There's just not that many of them around. And then, they, to these day, don't always have the biggest budget versus an approach of saying, hey, you know, data in the enterprise, it's not about the most taking all users. It's actually a combination of tools and processes and, humans. And, it takes a village and different people are going to be involved. Therefore, we're going to build tools that are approachable by many folks. And, you know, sometimes it's combination of, being very technical, but also no code.
And then, we're going to empower people around the organization. And the the 2 things I just mentioned, this this evolution from Turbo platform and collaboration and democratization, those can be very related. You can start with a very technical tool and evolve towards a broadly democratized platform. But those are 2 of the strategies that have since succeeded over the years.
[00:49:18] Unknown:
And in your own experience of working in this space and trying to kind of gain perspective and understanding of the problems that are being addressed and how to solve them in useful and economical ways. What are some of the most interesting or unexpected or challenging lessons that you've learned personally?
[00:49:39] Unknown:
In in the 1 of, VC over the last, few years, for people that focus on data infrastructure, the typical heuristic, has been, hey. Let's, find out founders who are have built, a platform within big tech company x, y, or z. So Uber, LinkedIn, Lyft, you know, and several others being the the usual suspects. Let's take that product, open source it if it's not already, and turn it into a brand new company that, we, VCs, will will fund. What I've learned over the years is that that approach works, but doesn't work as systematically as the rabbit funding around that model would lead 1 to believe.
I think it turns out that, a lot of the problems that you experience at an Airbnb or LinkedIn or any of those companies are years ahead of, where the market is. And, yes, you'll have the most thoughtful approach, around the most, you know, vexing problems of getting to massive scale. But in terms of where the where the bulk of the market is, you're going to be just too disconnected. So that's that's 1 lesson, I think. So I'm not saying, those deals are not great or those companies are not great. I'm just saying it's much less systematically successful or heuristic as as 1 would, wouldn't believe. So that's that's 1 lesson.
Another lesson is that, you can be right on the analysis, but you can be wrong on the outcome. Meaning that, time and again, it's very hard to predict where the market is going to be, and you can have the smartest founders building the most interesting product. Ultimately, this is an exercise in company creation, not in building the best product even though you hope that 1 will follow the other. And that's why you sort of end the back on the, you know, the most obvious cliche of venture capital, which is to invest in the best people you can find. As it turns out, the best people is not, necessarily just the most technically strong, people you can find, but people who are truly starting companies because they want to focus on making customers happy and truly enjoy this interaction with customers. And I think the market, the VC market has has led the VC market being frothy as it was for a couple of years has led to the creation of a bunch of companies, that, people started because they could and because the technology was really great, but ultimately, you know, in in part encouraged by the whole, like, PLG bottoms up marketing, kind of way of going to market where you don't really need to talk to customers. You don't really need to get on that awkward sales call, and you hope that people just, like, show up magically through your self-service and and all the things. I think that that whole, combination of of, the frothy market plus PLG motion has led to the creation of a a bunch of companies where where where people don't truly enjoy working at an end with with customers.
So you can be wrong on the analysis, and right on the outcome or vice versa. Therefore, pick the best people. Therefore, lesson learned around the best people, not necessarily the most technically astute folks, but folks who are both technically very strong and truly, in their heart of heart, enjoy working hand in hand, with customers solving business problems.
[00:53:54] Unknown:
And as you continue to work in this space and invest in these categories, what do you have planned for the future of the M. A. D. Landscape? Either ways that you want to think about, updating its presentation or content or ways to make it a long term sustainable activity, either for yourself or putting it into the hands of kind of the broader ecosystem? Just kind of wondering what you have planned as you look forward.
[00:54:21] Unknown:
Yes. I'm I don't I don't even really wanna think about, the map 20 24, considering I'm just, exiting the map 2023 period, which was, an effort and a half. But look look, I I as stated above, I very much want this to be a conversation. We are already, going to make a second version of the map 2023 landscape based on all the feedback we got. So we created an email address for comments, thoughts, and suggestion, which is mad 2023@firstmarkcap.com. We got hundreds of emails. We've been parsing through those. We're going to create a as, you know, a second version, as I said, of the of the landscape, trying to capture most of those comments, probably not all of them, but most of those comments. So that's the immediate future.
In terms of, 2024, yes, I do like the idea of, open sourcing this even more to the community. But, again, going back to the first principles of why I'm doing this, I'm doing this for the forcing function of, me doing the work. And then, you know, through conversations like like this, get the team to complain about it. But, you know, I'm I'm I'm French, so complaining is 1 of the things I do best. So if I completely open source it to the community or had other people do it, then that would sort of defeat the purpose of, really doing the work. So I'm, probably mostly going to continue as is for the foreseeable future. Although, again, I'm I'm open to all sorts of thoughts and and suggestions.
[00:56:07] Unknown:
Are there any other aspects of your experience investing in and engaging with the data infrastructure landscape and your work on the mad landscape for those purposes that we didn't discuss yet that you'd like to cover before we close out the show?
[00:56:21] Unknown:
No. I thought I thought that was pretty thorough. I would look at a message of hope, I guess. The market is what it is. I do think that, data infrastructure in general is, the gift that keeps on giving. I think we keep going from phase to phase from, you know, the world of, the old NPP databases to Hadoop to the modern data stack to whatever it is that we're doing today. There's always something new. It's always a very fertile area to start companies. I think, you wanna be very careful in this market when starting something. You wanna make sure that you are in it for the right reasons. You wanna make sure that you truly enjoy working with customers on a daily basis and not just, build product. Having said all of that, another VC cliche, but that's very true.
This kind of market is a wonderful time to start a company. Talent is available. There's much less noise. You have more time to iterate. You have much less risk that 5 competitors are going to emerge overnight. So it's a great time to start the company, a company. And, you know, while everybody's busy trying to build the next, thin layer on top of a chat GPT that gives you time to be thoughtful always a it's always a great area to build a company.
[00:58:09] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:58:26] Unknown:
I think there's a really interesting set of opportunities around, Jira Day of AI for data a little bit as a perfect end to this whole, conversation that I I find very interesting. I I have seen a bunch of companies sort of jump into this opportunity. But a little bit to the other conversation around democratization and making, data infrastructure, available to a broad set of people within enterprise. I think, the opportunity for people beyond data analyst to interact with data analytics, in particular, also, potentially, data broader data infrastructure through English or through natural language is really intriguing.
I don't know what it means yet. I don't think that it, puts the jobs of data analysts at risk, just yet. But I I I I I think it could be a very major unlock for the space. To this day, we're still very much in this situation where if you think of the most basic output of this whole data infrastructure that we've been talking about, which is a a BI dashboard, We're still very much in this world where, it's the province of a handful of people in the enterprise. And, as we all know, if, you are the CEO or senior ranking member of an organization and you want some kind of, BI analysis beyond the dashboard that's available to everyone, sure. The Tableau analyst or the Looker person in your organization will say, you know, right away, I'll do you know, let me get on that, and you get the result within a few hours. For anybody else, which is really 95% of the organization, you know, take a number and wait your turn. And that that just doesn't feel, like, the best way of justifying all this investment in the the tools. So, look, the the the the dream of self-service analytics, has been around, you know, forever.
But I think, you know, generally, the AI tool as an interface to all of this gives really new life to the IT. And I think that's a really interesting and fertile area.
[01:00:46] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share your experiences of compiling this landscape. Thank you for all of the work that you've put into making it a reality and going to the effort of presenting it to the broader community and ecosystem. So I appreciate, all of the time and effort that you and your team have put into that, and I hope you enjoy the rest of your day. Thanks, Tobias, for having me. Love the show.
[01:01:16] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at dataengineeringpodcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and just tell your friends and coworkers.
Introduction and Guest Introduction
Matt Turc's Journey into Data and AI
The MAD Landscape Report
Open Sourcing the Landscape
Evolution of the Data Ecosystem
Impact of Venture Capital on Data Landscape
Generative AI and Data Infrastructure
Consolidation in Data Product Categories
Challenges in Big Data Principles
Successful Strategies in Data Infrastructure
Future of the MAD Landscape
Opportunities in Generative AI for Data