Summary
Data has been one of the most substantial drivers of business and economic value for the past few decades. Bob Muglia has had a front-row seat to many of the major shifts driven by technology over his career. In his recent book "Datapreneurs" he reflects on the people and businesses that he has known and worked with and how they relied on data to deliver valuable services and drive meaningful change.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
- Your host is Tobias Macey and today I'm interviewing Bob Muglia about his recent book about the idea of "Datapreneurs" and the role of data in the modern economy
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what your concept of a "Datapreneur" is?
- How is this distinct from the common idea of an entreprenur?
- What do you see as the key inflection points in data technologies and their impacts on business capabilities over the past ~30 years?
- In your role as the CEO of Snowflake you had a first-row seat for the rise of the "modern data stack". What do you see as the main positive and negative impacts of that paradigm?
- What are the key issues that are yet to be solved in that ecosmnjjystem?
- For technologists who are thinking about launching new ventures, what are the key pieces of advice that you would like to share?
- What do you see as the short/medium/long-term impact of AI on the technical, business, and societal arenas?
- What are the most interesting, innovative, or unexpected ways that you have seen business leaders use data to drive their vision?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on the Datapreneurs book?
- What are your key predictions for the future impact of data on the technical/economic/business landscapes?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Datapreneurs Book
- SQL Server
- Snowflake
- Z80 Processor
- Navigational Database
- System R
- Redshift
- Microsoft Fabric
- Databricks
- Looker
- Fivetran
- Databricks Unity Catalog
- RelationalAI
- 6th Normal Form
- Pinecone Vector DB
- Perplexity AI
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png) Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable enriched data to every downstream team. You specify the customer traits, then profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/ rudderstack.
Your host is Tobias Macy, and today I'm interviewing Bob Maglia about his recent book on the idea of datapreneurs and the role of data in the modern economy. So, Bob, can you start by introducing yourself? Hi, Tobias. My name is Bob Maglia. I have, been involved in in technology in the industry for over 40 years. Started at Microsoft in 1988.
[00:01:03] Unknown:
Was the 1st technical person on SQL Server at the time. Spent 23 years there. When I left, I was president of server and tools, running SQL Server, Windows Server, System Center, and our Visual Studio line of developer products. Spent a couple of years at Juniper and then 5 years in the very early days of running Snowflake from 0 revenue to get from about from 0 exactly, you know that number perfectly, to about 200, 000, 000 in revenue. And last number of years, I've been, an investor and entrepreneur working with other entrepreneurs and helping them as they build their businesses. And in the last couple of years, I've written this book called the Datapreneurs, which is really about the history of of data. And in particular, the people and the values that have created some of the great companies and some of the great products,
[00:01:49] Unknown:
that we all use in our daily lives. Yeah. It's definitely very impressive background, and you've, touched on it briefly. But I'm wondering if you can just, give a bit more color to how you first got started working in data.
[00:02:00] Unknown:
That's a great question. I it goes all the way back to when I was in college, really. I went to University of Michigan and got a computer science degree there. And while I was working, I was able to get a, a job at a company called Condor Computer, which was in Ann Arbor. Condor had been started by really a first datapreneur I worked with. His name is Malcolm Cohen, and he was very early. He was a professor at Michigan. He was very early, in in involved in relational databases. He actually they built a, a a a relational database called Condor.
It ran on, this was 1978, 79, just to put a time frame on things. So very early. I mean, to give you an idea, relational was invented in the in the late sixties, early seventies, and the first SQL database came out of IBM in in about that same time frame. And, this was not a SQL database, but it was truly relational, and it ran on a Chromemco microcomputer with a z 80 Processor. And if you can believe it, 16 k k of memory in it at the time. These big 8 inch floppy disks stored stored the data, which still they were tiny too, but 8 16 k. We had an upgrade while I was there to 32 k. That was a big deal.
[00:03:19] Unknown:
Yeah. It's I I have not been around for as much of the history of computing as you have, but it's still it's still interesting and mind boggling to see how things have grown. I can remember the first time I got flash drive. It was a 128 megabytes, and I was like, wow. I could fit so many songs on here.
[00:03:33] Unknown:
Yeah. We have to I mean, think about that. That is not even enough to load the libraries. You know? I mean, that's not enough to load a tiny fraction of the of the base anything in in today's Exactly.
[00:03:44] Unknown:
You've you've touched on this idea of a datapreneur, and another side channel conversation I'd like to weave through here as well is your opinion of SQL as the lingua franca for data, but we'll we'll put that to the side for now and focus on, just getting your concept and description of this term datapreneur and what it means for somebody to be a datapreneur and some of the distinctions from the common idea of an entrepreneur.
[00:04:09] Unknown:
Sure. Well, yeah, I mean, the term came from the the work that we did when when my coauthor, Steve Hamm, and I were writing the book. And, you know, in the very early days when we were putting the narrative together and realized we had a story for a book, I was thinking about this and I said, you know, Steve, I'd really like to tell the story, from the perspective of the, you know, incredible people, technologists I've worked with, data entrepreneurs that have really made these technologies happen because I mostly facilitated it. I mean, I was what Microsoft called a program manager, what many people would call product manager historically. And I helped to facilitate the the creation of the products, define what the product should be, work with customers.
And, you know, these technologists are the ones that created it. And so I said we should talk about these data entrepreneurs. And Steve said, you mean the datapreneurs? And and that's where the term came from, and and and we we immediately stuck with it. And before we get too far into some of the contents of that book, I would be interested to hear your opinions as somebody who has been around for a lot of the history and evolution of SQL on why you think it has been such a lasting
[00:05:16] Unknown:
interact with data and, maybe some of the, some of the missed opportunities that have come up along the way of either ways to make SQL better or alternatives to SQL that might have been. Well, it's interesting
[00:05:29] Unknown:
because the you know, what happened is basically, if you kinda go back historically, databases the earliest databases were what we call navigational databases, where there's actual physical linkages between data items within them, and those were always programmed very directly by the programmer. And those were hierarchical and networking databases in the early days. And when relational came out, it was invented by IBM. When relational came out, it was pretty big revolution in data because it allowed you to create data of different shapes and to change things and to migrate it over time. And the other thing Relational did a really good job of, and IBM did a particularly good job of, is they made it straightforward to do transactions to actually, ensure that information is accurately stored across multiple different different accounts to make sure that things stay consistent in financial systems and whatnot. So SQL made things much easier than the old systems, and it very rapidly took over, in the 19 eighties as the primary type of database that people worked with. You know, it it dominated then, and it still dominates today. I mean, it's been dominant the entire time, largely because of its overall flexibility and the fact that you can do so many things with it. Over however, over time, there have been new classes of data introduced that SQL does not work as well with. The 1 that that is probably the most predominant is semi structured data, which came into being in the early 2000 as these Internet systems began to generate vast amounts of data that described what the processes were doing.
The semi structured data is really focused on on on how, and data created to talk about what's happening inside these applications. And structured databases, SQL databases, we're never particularly good at that because they're focused in handling tables. So SQL has dominated a lot, but we've augmented it in a number of ways. We have new classes of databases like document databases that are more more naturally able to handle semi structured data. And at least in the analytics space, SQL has been extended to allow you to work with and do and do, analytic computations on different types of data, but it has a number of limitations. You know, those limitations go way back to the early days of 19 seventies and then, you know, and to and to system r, which was the system that IBM created, and it was the first instantiation of SQL.
And again, I come back to that, you know, I mean, that 16 k of memory that I had in those commensurate microcomputers, that's about what IBM had too in those initial big machines. Maybe they had a little more in the mainframes. But they were working on very with very limited amounts of of capacity in in in in computing systems, certainly, by today's standards. And so 1 of the reasons that SQL is the way it is is because of of the limitations of that time. And probably the most important of those is is that SQL works with tables. And a lot of data, that's that's a very natural way of look of working with data, but it's not the only way. So, you know, there's a number of limitations. It was SQL is also a language that's been developed, by a number of committees, open you know, committees of different vendors. So there's a lot of vendor specificity in the language, it's evolved over time. You know, it's not the most elegant language by any means, but it certainly served a lot of people well. I do think that that SQL dominance and its focus is somewhat limited in time. I I think that while probably more SQL will be written in the future than has been written in the past, so that says it's not going away anytime soon, we are going to see other languages introduced for data. And also the new language that's by far the most important for working with data is English. And I think that's going to become very predominant, and SQL will move to become more of an intermediate language, which is a very important concept because, you know, many of your viewers are very, you know, are very good SQL developers. They know SQL inside and out.
You know, I think we're gonna more and more see that as a target, an intermediate target, where the primary things are being written in other, you know, other languages like English. And and certainly for business intelligence, we'll see that shift over the next few years. Yeah. And that that evolution is happening quite rapidly with the, rapid rise of large language models and things like chat GPT and being able to go from natural language to SQL as that intermediate intermediate representation to then be executed on the different target systems. And we've got a lot of announcement. I mean, a lot of people are announcing products and capabilities in that in that regard.
Probably the most, you know, most most widely used of which is is the announcements Microsoft made with Power BI to incorporate the natural natural language into the BI process there.
[00:10:06] Unknown:
And now going back to, the book and your reflection on some of the history of the ways that data has been pivotal in these different shifts in the technology landscape
[00:10:27] Unknown:
30 plus years. Yeah. Sure. So I I would say okay. The first 1 I would say if you go back and and these are chronicled in the book. In the book, I talk about something called an arc of data innovation, which is a series of progress that's happened over the last 50 years. And just and in there, I describe the most important innovations in data that have occurred. And I also talk about the new types of data that have been introduced over time. I mentioned structured data, which is what IBM was working with with SQL. Semi structured data came into into much into much more, usage in the to early 2000. And now we're beginning to work with with complex data, which is pictures, videos, audio, documents, things like that, and we're breaking those open. The other data element that I didn't mention, which is very important, is text and how text and paragraphs have have emerged as important sources of data. So historically, if I go if I do this chronologically over time, I would say that, you know, the event of relational technology in the seventies and the popularization of that with a wide variety of products in the eighties was an important inflection point. I think that 2 things happened in the 19 nineties that was very important. 1 was the democratization of those products, largely by Microsoft. I feel like I was very much a part of that in lowering the cost of those systems and building them on, you know, personal computer class devices, which really opened up business computing and databases to literally millions of small businesses around the world. I mean, if you went back to the early 19 nineties, most small business kept their records in pen paper and pencil. And that all changed, and it changed with business systems that were largely built on Microsoft products. So that was a major change. Another major change in the 19 nineties 2000 was the advent of the Internet. You know, Bill talked about this, and I described this in some detail in the book about a vision that Bill did in 1990 Bill Gates, I mean, called Information at Your Fingertips, where Bill really envisioned a world. You know, you gotta go back to 1990 when, really, most people didn't have email and and and computer people used computers in a much more casual way. You know, Bill envisioned a world where we would get all of our information almost instantly from computers.
And, I mean, while the vehicles that he was building and the things we were creating at Microsoft did not pan out the way we thought they were, The vision very much did pan out, and it's the Internet world and the world of search that we have today. You know, I look at the the advent of the Internet and in particular Google as the fulfillment of the information at your fingertips vision. So those are a couple of things that happened in the in the 19 nineties. You know, in the 2000 was the spreading of the Internet and the creation of new data sources, like semi structured data that were rich sources of information to mine. That period, however, was rife with very poor tools. Hadoop was a misery for people in general, and most business analysis was done on a desktop, and by far, the tool of choice was called Excel.
You know, the really cool cats were using Tableau, back then, but and Tableau was a pretty cool was a was a really innovative product. But that that was a rough period for data analysis because the data was everywhere. You know, in the last 10 years, the modern data stack has collected all of that data and put it together in cloud data systems, which make it capable of being analyzed at much larger scale and, essentially, allow a single copy of data for the enterprise in a single source of truth. And I see that as a huge set of for the enterprise in a single source of truth. And I see that as a huge set of progress to happen. And now, of course, you know, we're in this world of of where all of this data and information, including that complex data that I described, those movies and videos and documents and things, become data sources that can be fed into these artificial intelligence models, which are really bottled intelligence for the first time. I mean, we've never had that. In my entire career, we've never had the idea of intelligence that mimics human capability in a machine, and now for the first time we have that. So it's been quite a history, and all of these things have built. I mean, the point I'm trying the point of the Datapreneur's book is all of these are innovations that were built up over time, Largely the creation of great people, by great people, and and people who who who worked with a set of values and and ran teams and motivated teams to do things.
[00:14:45] Unknown:
That's the story of the book and it's really the the history of of the world, the technology world that we're living in. And an interesting aspect of what you were saying too about the relatively recent rise of cloud data systems and the power that they bring is the fact that there's an uneven distribution of usage of those capabilities depending on the size of the business, their, appetite for risk, where they are in terms of their overall technical maturity, where they are in terms of their regulation you know, regulatory requirements, and then there's also the the cost factors that come into play of capital versus operational expenditures.
And I'm curious what you see as some of the current limiting factors for businesses that are maybe interested in adopting those cloud native capabilities, but are impeded from doing so, whether it's for regulatory or business
[00:15:39] Unknown:
or, risk appetites or capital reasons? Well, I what I would say is is that those reasons are no are are disappearing very rapidly over time. I mean, when when we started I mean, I started at Snowflake in 2014, and that was really before the existence of the modern data stack. You know, the but the the only cloud database data warehouse in the market was called Redshift, you know, from from from AWS. And, you know, Redshift was really a milestone product because it demonstrated did something that only Amazon could really do. It demonstrated the viability of a data warehouse in the cloud, and it was incredibly priced in an incredibly competitive price point relative to the on premises products that were available in the market. So it was a revolution in terms of people being able to have a product a price reduction, in in end capability.
Fortunately, from Snowflake's perspective, Redshift had the attribute that it did not scale, because it was an on premises product hosted in the cloud. It did not have the built for the cloud attributes that Snowflake had. So it didn't scale like Snowflake did, and and we wound up benefiting as companies grew up. What I would say is the issues that that, you know, we saw existing, the regulatory issues, the concerns about security, those have largely faded over time. And right now we're in a world where I mean, we just saw announcements 2 major data analysis companies did announcements this week. Snowflake had their summit and Databricks had their conference. And just a few weeks ago, we watched Microsoft announce Microsoft Fabric. Essentially, we now have 5 viable cloud platforms, cloud data platforms that customers can choose from, Snowflake and Databricks, which are the 2 cross products, and then products from all 3 of the major cloud vendors, AWS, with their analytic offerings, Google with their BigQuery offerings, and now Microsoft with their more integrated fabric offering. And, you know, we have 5 platforms that are roughly equivalent in terms of the kind of capabilities they're offering, and they're converging in terms of their capabilities.
They come from different places, so they all have different strengths and weaknesses, but all of them are viable. And certainly, I think 1 of those is viable for just about any organization. In fact, 1 of those is viable for every organization on the planet right now. You know, the 1 that's probably the most difficult to deal with is some of the most security constraints associated with federal, and some of the very high high, constraints there. But then, you know, Microsoft is pretty good with that stuff. So there's their product offerings, even for the government customers that that have the highest security offer highest security requirements.
Amazon and Microsoft both have strong offerings in that area. So I don't think there's anybody that should I mean, JPMorgan's going to cloud, Goldman's going to the cloud, you know, who's, you know, everybody's going to the cloud. I mean, at this point at this point, why aren't you going to the cloud is what I would ask. What's the real reason? And and and there aren't a lot of great reasons. I mean, there are some, but not a lot. I mean, business systems, core operational business systems, there isn't as strong a motivation to move those to the clouds, and many of these highly regulated companies will continue to run those at scale. But data analytics requires the the it's just the nature of it is the variable compute required benefits tremendously from the the fundamentals of cloud computing, where you have essentially infinite resources at your disposal that you can take and give up, you know. So the cost the cost,
[00:19:02] Unknown:
advantages are there too. So I think everybody should go. I think everybody should go. Absolutely. And the term modern data stack, you've thrown that out a few times. And in your role as the CEO of Snowflake, you had a first row seat for the rise in creation and evolution of that. And I'm wondering if you can talk to the main positive and negative impacts of that paradigm that you have seen from your perspective at the helm of Snowflake and now in your role as an investor and adviser to these companies that are
[00:19:35] Unknown:
working in the space? Yeah. So the term didn't exist when,
[00:19:37] Unknown:
when when I started Snowflake, and and it sort of evolved over time. And it was really created by a number of companies, like, say, Amazon played a major role in it. Snowflake, certainly, Fivetran. Looker played a big role. The 3 companies we the the 2 our our first 2 big partners were Fivetran and Looker. And collectively, and, of course, Amazon. And collectively, that was a full product suite, you know, full analytic offering for companies. So that developed developed over time. You know, I define the modern data stack pretty cleanly. It it's, you know, it is data analytics that is delivered as a software service that leverages the public cloud for scale and low cost, and where data is modeled for SQL databases inside it. And that's pretty much what all 5 of these technology companies are building with their product offerings for the modern data stack.
And I guess to my to to answer this to my my previous question, it was such a mess before this came out. I mean, you can't data systems couldn't hold much, a relatively large proportion of the data for a large enterprise. So enterprises need to have tens or hundreds of these different databases to solve their problem. Data was scattered everywhere, there was no consistency. I don't see any downside to moving to the modern data stack, to be honest. I think it's all upside, and I think everyone will move to this to this. Now I don't think the modern data stack is perfect. There's many elements that are missing in it, and and there's and and some are suboptimal in some ways. Know, there's a lot of different products you have to buy. It's not a perfect it's not a perfect offering by any means, but it's infinitely better than anything that came before it. And it's and I think, frankly, it's infinitely better than anything else that you can do. The only alternative really is to do it yourself. And I don't know why anyone would do that. I mean, it's why wouldn't you leverage the incredible IQ that is being put into companies like Snowflake and Databricks and and the cloud companies? I don't know why you wouldn't leverage that.
[00:21:35] Unknown:
And an interesting reaggregation by adding reaggregation by adding a facade on top of all of those underlying components. And then there have been there has long been the pendulum shift of this disaggregated versus fully integrated, product stacks. And I'm curious what your sense is of maybe where we are in that pendulum swing and what you see as the, longest term viable approach to managing the constantly evolving means of building that end to end solution.
[00:22:16] Unknown:
I think this is a, you know, this is a a classic life cycle of a new technology conversation. You know, the modern data stack is has a has a lot of moving parts to it. It's complicated. You know, you have pipelines to bring data in. You know, you've got transformation products to transform data into whatever shape you want. You have SQL databases to do analysis on it. You've got all this machine learning work that that's being done. There's data quality tools. There's tools to get product to get, get results back, reverse ETL tools to get results back. There's BI tools. There's all these different pieces. There's just a lot of different pieces to it. The nature of technology is is that as a new platform emerges, these these, you know, many, many organizations build components of that technology, and and they establish, you know, best in class capabilities over time. And then, typically, as these technologies mature, the larger players, you know, of which I think we've identified the 5 players, I think the 5 players are pretty well established. I mean, those all have tens of bill somewhere between tens of 1, 000, 000, 000 and 100 of 1, 000, 000, 000 of dollars, and the rest of us don't. Let's put it that way. And, and because of those resources, I think they'll continue to be in a strong position, those 5 platforms. And I think we will see consolidation.
I think, you know, you'll you you know, you're gonna see a number of these companies buy some of these smaller companies and build out a more complete and holistic end to end product offering. I do think we're beginning to enter that phase where, you know, the solutions from the 5 vendors will become more complete over time. And meanwhile, let me say 1 thing. And meanwhile, there will still be best in breed products playing side by side with that. I think that we'll continue to see that coexisting. However,
[00:23:56] Unknown:
you know, will there be 10 offerings in data quality? Probably not. Probably not. You know, we'll see some of that consolidate down. Yeah. We're definitely starting to see some of that consolidation happening, and it'll be interesting over the next 2 to 5 years how that plays out and what are some of the new categories that start to get spun up as the the different ecosystems coalesce together a bit more. And from your view of somebody who is advising people, advising these companies, what are some of the key issues that you see as the, open problems that are not yet
[00:24:28] Unknown:
addressed in the overall ecosystem of the modern data stack? Governance is my number 1 issue. If you were to ask me 1 thing, I I mean, I always say is what you can do around governance and helping people to to manage this. Probably the biggest challenge we've created with the modern data stack is a governance challenge. In the sense that there's this odd irony, which is that if you if your data is scattered all over the place and you can't find it, it's actually kind of secure in an interesting sort of way. Right? If nobody can find it, it's pretty secure. But when it's all in 1 place, it becomes a a vulnerability threat potentially, and so making sure that the right people have access. Access. And it's not that easy. It's not that easy to structure the roles to set up the roles and access control today to allow people to have access to just the information they should they should and not have other access. And it's super important from a regulatory perspective.
So the tools here are still suboptimal. I, you know, I was pleased to see Databricks' announcement this week of the Unity catalog. I thought, of all the things Databricks announced this week, that was the 1 that caught my eye the most. And I think that seeing vendors take on and help, help maybe it'll be third parties, maybe it'll be some of these smaller folks, but seeing the big folks begin to take on and focus on that is 1 big thing. Related to that, in my in my sort of favorite topic, you know, my my holy grail that I continue to pursue is is is the semantic model for the modern data stack. And really the cement and what I really mean by that, to be clear, is the semantic model for the business. Because there is no place where we define, in in a single place today, where we define a business model for a company. That model is scattered in so many different places. 1st and foremost, in the heads of the people that run the company, but also in a whole variety of different applications, in a huge number of different queries, all kinds. It's scattered all over the place, and and the algorithms and things are all over, and there is no central canonical place to define that. And I think that's gonna change. I think we're gonna start to see knowledge graphs emerge that become the semantic model that defines the business process. And then subordinate to that will be the data model, because the data model is only there to reflect and support the business model. And today, that's all opaque and in people's heads, and Slack messages or whatever. And
[00:26:48] Unknown:
to that end, I wonder what you see as the role of these metadata stacks, data catalogs, whatever terminology you wanna call them. Like, is that the right place to build from for having this cohesive, unified business metrics, business model that is linked to and fed by all of the dependent systems for the overall organization? The right idea
[00:27:13] Unknown:
I mean, everybody has I mean, you see I watch across the industry, and many players are converging on a similar idea of having and building a semantic model. The challenge, and the 1 that I've been focused on, is there ain't no place to put it. I mean, where are you gonna store the semantic model? I mean, the only place you can really put it is in, like, an XML text file or JSON text files today. I mean, you can't really define it in a database. You know, in particular, SQL databases are just inappropriate. And this is where SQL yes. You know, we ask where SQL breaks, you know, SQL sort of ends. SQL is about modeling data, and it's about modeling data fundamentally in tables. I mean, that's the way it has always been structured.
And, yes, it's been extended and to go beyond in a few ways, but fundamentally, that's what it does. The data models required to model a business are actually they're they're some form of graph, ultimately, they are a graph of some of some kind. And in fact, what they very specifically are is a knowledge graph. You know, a graph of the knowledge of the different attributes of the business. Now there has never been a database built that effectively models. There are several graph databases in existence today, but they are all navigational in their in their structure. In other words, the way the connections between the the edges and the nodes of the graph are actually hard coded, you know, in some form of linkage that's stored inside the database. And, you know, while that can work in a number of ways, it's less appropriate for the analytic problems we face because the relationships between the items are undefined, and what's most interesting are the relationships that you haven't even discovered yet. So the concept of a navigational graph to build a knowledge graph is just the wrong technology, which is why I've been on this pursuit, you know, with my friends at Relational AI to build a relational knowledge graph that leverages the fundamentals of relational mathematics to allow the full relational calculus to be expressed. You know, you look at what Codd had had had designed from a mathematical basis, It was a very complete mathematical system, this relational model.
And SQL only supports a subset of that relational model, and that subset is largely defined by the constraints associated with the way the data is stored, which is the tabular format. When you move to to to, to using to to fun to to breaking loose and thinking about relational as a knowledge graph, the way that the way that the Relationally I guys have thought about this is to think about storing base relations, which is essentially 6 normal storing data essentially in 6th normal form, they call it graph normal form. So everything is stored as a fundamental element that has to be related through relational mathematics to everything else.
Now, the challenge in that is that those the mathematical algorithms to do that have historically been very costly to execute. And, in fact, if you look at every single SQL database, you know, from from system r all the way to Snowflake, Every single 1 has been built on a series of algorithms that are based on what's called a a binary join where, you know, when you when you wanna join 2 items together, you take this table and this table, you join it together, you get a result set. And then if you have a complex query plan, you it's a series of joins that are done, you know, it 1 right after the other. And for some relational operations, that is ungodly expensive. I mean, you create these intermediate result sets that are gigantic, and then you immediately throw them away on the next join. Well, guess what?
In 1976, that's all you could do when you had 16 ks of memory. You had to do these joins serially. But now we have gigabytes of memory available to us, and we can do multi way joins. And in fact, there's a whole new set of relational mathematics. And probably the most exciting thing that I saw and the thing that made me so interested in relational AI is that the CEO, Moham Arif, has been working for over 10 years with over 20 research organizations around the world, building an entirely new set of relational algorithms.
Literally, they've created over 300 papers that are new concepts of how to work with relational in a multi way form, and these algorithms are just beginning to be applied. And while they could help make SQL databases better, they most interestingly enable this new type of database, which is a relational knowledge graph, which allows you to to essentially describe objects with any structure, and relate them together.
[00:31:55] Unknown:
And as you're describing it, as in particular also the fuzzy relations between different entities within that knowledge graph, it also puts me in mind of the work that's being done on vector databases and vector space similarity search, and I'm wondering what your thoughts are on that and some of the applications to this relational mechanics and relational algebra.
[00:32:16] Unknown:
Right now, it's worth it's somewhat orthogonal. Okay? And I know it's not gonna be that way in in the 18 months or so. And in fact, you know, I've been not I've been 1 of my things I've I I'm working with a number of different companies in the space, including including EDO Liberty at Pinecom, which is 1 of the leading the leading, vector databases. And understanding the evolution of of vector databases and how they are going to relate to these relational knowledge graphs. I actually don't know the answer to that question. I actually don't know the answer to that question, but I know there's a relation I know there's a relationship, and I know that we will learn that in the next year or 2. We're still at the discovery stage. I keep this is the question I keep asking now, is how do these vector databases relate to other databases? There's a natural connection between vector databases and document oriented databases because the properties that you want to associate with the vectors are fairly naturally structured as a hierarchy.
And so and so there's some connection there, potentially, but we will see over time.
[00:33:18] Unknown:
And as you mentioned in your introduction, a lot of your work these days is as an investor and as an adviser and working with other entrepreneurs.
[00:33:27] Unknown:
And so for technologists who are thinking about launching new ventures, I'm wondering what are some of the most useful pieces of advice or insight that you'd like to share? I think the most useful thing for anyone to do is to gain expertise in a topic, and then apply that topic, use use technology and artificial intelligence to reinvent that area. So to me, the the thing about what's happened, I mean, you know, I've I've watched over my career, my long career, I've seen major technological revolutions be introduced into society, and then watched over a period of you know, really, because it takes a decade or more, really, for those technologies to have the long term impact on society.
You know, I saw the PC and the lowering of the cost, which was really about the lowering of the cost of computing. Right? Because until PCs came out, computers cost 100 of 1, 000 of dollars, and all of a sudden they cost, you know, 1 or, you know, a couple $1, 000 and became accessible to a much, much larger community of people. So there was that democratization that happened. There was the Internet that that that occurred. You know, some of these things really don't introduce you know, some of these things benefit only the only the incumbents. Sometimes there's lots of new there's lots of new opportunities. Heck, there's always some new opportunities that get created. This is a time where every application is going to be re going to be reconsidered with intelligence added into it. And so the interesting thing now is if you have if you have domain expertise, you can you can apply that expertise. You can take your expertise and bottle it. That's what I love, this idea, that you can take intelligence that people have, and actually inject it into these systems for the first time, and now allow the systems to perform that capability. And and I think that opportunity exists in every application, and that's where I see the biggest the biggest opportunity.
I mean, all the the the heat is in OpenAI and Anthropic, and these big, you know, foundation models that are being created, and that's great. But what's interesting is what people will do with this technology, and that means applications. And this technology, perhaps as much as any I've as I've ever seen, is an opportunity to reinvent every application. So entrepreneurs have an incredible opportunity right now. To your point too about the major
[00:35:40] Unknown:
catalysts in this space right now being things like OpenAI and Anthropic and some of the, notable large language models, obviously, chat gpt being the 1 that everybody's talking about. An element of that for people who are thinking about building a business is the concept of platform risk, where if I'm going to invest my business and all of my livelihood on this 1 venture of building on top of this other system, What are the risks that I'm taking on? What are the capabilities that I might need but don't exist yet? Or what are the pieces that I don't understand but will need to understand about that system? Just curious what your thoughts are on some of those elements of people, deciding
[00:36:22] Unknown:
how much to rely on these 3rd party models. You're relying almost certainly, you're relying on some model. Right? You're relying on some model. And, the it doesn't have to be commercial models like you described. You know, the other part of it and and many of the the start ups and and small companies I'm working with are not developing on on GPT 4. They're developing on open source models. And what we've seen since the introduction of GPT 35 and, Chat GPT, you know, besides the continued improvement in the large commercial models, you know, the introduction of GPT 4, we've seen an explosion of innovation, just an explosion of innovation, almost on a weekly, daily, weekly basis. There's like new there's a new model that somebody created that are now, from scratch, bespoke models that that are not that don't have commercial connections to them, that are quite capable.
And some standard tests have emerged to actually to actually compare these models to each other, so we actually can see how these models rate against each other. And while there's no question that GPT 4 is way ahead, these other models, these open source models, have become quite good enough to do many of the tasks that you need to do inside the enterprise. So we're gonna see a variety of different approaches to to solving this problem. And, you know, and in some cases, customers there are customers that have regulatory concerns where they don't wanna send their data to an open AI. In that case, some of these more focused open source models are perhaps more appropriate, and, you know, we'll see a variety of different ways to do that.
The interesting thing is that right now, I mean, if if you're an entrepreneur, you're building something, you know, hopefully, you've got some data science people that can help you with with, you know, figuring out these models and can can decide which model is most appropriate for given a given problem set. What's going to happen it hasn't happened yet, but what's what's happening rapidly is that while today it's a wild west where you have to kinda build all of this yourself, within a year, certainly, that will not be true. I mean, that that all of these major platform vendors essentially, we saw announcements this week from both Snowflake and Databricks, providing really complete end to end solutions for people to build applications.
Whether sometimes it involves some third party software, but in either case, it's a full platform that people will put things it'll put things together. And all of these vendors are focusing on making it easy for enterprises to take their internal applications and add intelligence to them in the form of these new models. And building that will not be a, you know, 1 off, I'm gonna do this on my own effort. I mean, literally, there will be recipes, well defined processes that people can follow to build their applications on a given platform. In that case, you know, you're choosing this gets right to the point you asked, which is you're choosing your platform vendor. If you're an enterprise, I think you've already done that. I think in enterprise, you're already choosing your platform vendor and the 1 you're gonna you're gonna, you know, rely on. And if you're an ISV, what's interesting is, can you build your product in a way that maybe supports multiple of these platforms, so that you don't have to have to completely bet on 1. There is work, certainly, to move between them, but I'm now beginning to think of certainly, I begin to think of Snowflake as a cloud, with a a relatively complete set of cloud capabilities for an application developer or an, you know, at least an intelligent data application developer. And it's like an alternative platform now to Google or AWS or or Azure, and you can target that. But I think some ISVs, certainly some of the ones I'm working with are are focused on targeting multiple of these platforms. And from your perspective, I'm wondering what you see as the short, medium, and or long term impact of AI on the technical and business and societal arenas.
Well, it's a very big impact. I mean, it's it's it's the most significant technology of my lifetime. I continue to say that. I've lived through a lot of major technological revolutions. This 1 is the 1 that that most impresses me. Because my gosh, computers can understand and respond to English now. And English has become the most important programming language. And it is an API. I mean, all these things make my head explode. I would if you asked me these questions a year and a half ago, I would never I would never said this was going to happen. And now it is happening, and it's gonna happen in a very rapid way. I mean, in the short term, we're gonna see intelligence introduced into everything, and we're gonna have these copilots or or agents that assist us and help us. I think that's the first wave of the technology that we're going to see, and that's the next 3 years or so. I think after that, we're going to begin to see progressively more intelligent systems, including robotic systems that, begin to put this intelligence in devices that move and are, you know, are part of our society and part of our lives. And I very much think that that's what the 19 thirties is going to be about. I mean, I think 19 thirties is all of is the era of robotics.
And I predicted by the end of 19 thirties, we're gonna have humanoid robots. Not that unlike the robots of the Asmafian, you know, of of Asmafian science fiction, that, you know, will live amongst us and help us, help care for us, and help us in our daily tasks. I I totally believe that by 1940, by by 2040, that we'll see that, which I think is is is crazy. I think it's absolutely crazy to think about that. I never if you would ask me 5 years ago when that would happen, I would have told you 21100. And now, I mean, I I that's the biggest change that happened while I was writing the book, is is my horizons for artificial general intelligence moved in by about 50 years.
[00:41:59] Unknown:
Yeah.
[00:42:00] Unknown:
It's definitely a crazy world we're living in. And in your work, the book and reflecting on your experiences over your career, I'm wondering if you can talk to some of the most interesting or innovative or unexpected ways that you've seen business leaders and, technical founders using data to be able to drive their vision.
[00:42:21] Unknown:
Well, I I think that what what I found was that, you know, it was what I what to me, the biggest thing was what does it mean to become a data driven organization, And how do you change the culture of people? I guess the thing I would tell you is the thing I learned through that process is the most difficult problem is the cultural transition for people and making organizations data driven organizations, which is 1 of the reasons why I focus so hard on values throughout the book. You know, I found values to be an important part of running teams and and and leading organizations, and a crucial tool, frankly, in in in managing people in managing people and making the right things happen.
And and when you with with data, you know, you you change the way you think about things, because you you the first question you ask I mean, literally, I I I would get to the point where where, you know, you'd be in a meeting at Snowflake, and, you know, couple years in, we had our data in in a in our centralized data warehouse, and we had everything together. And, you know, you're in a meeting, people are talking about things, and you say, what does the data say? And literally, somebody could run a query in that meeting, and 2 minutes later, you'd have have an information, and you'd you'd put a chart up on the screen, and everybody would be talking about it. Now, everybody converges. I do think data can help people reach agreements on things. And the thing about it is, you know, I talk about data and knowledge, and data is raw information, knowledge is data that has been analyzed in a conclusion that has been reached.
And, you know, the goal is always to get to knowledge, and to to have data translate into knowledge that allows you to make better business decisions and do things. Maybe, you know, that's always been a human thing in the past, you know, now that is is becoming a machine thing, where that knowledge can and the intelligence can actually aid. The knowledge can actually do the analysis. The intelligence artificial intelligence can help to do the analysis and create knowledge, but it can also apply the knowledge and take action based on knowledge. So it's becoming all the more important and and values are integrated into all of these things. I mean, to answer your question, I think the thing that that's been most, you know, to me, it's always the people element that is is the most challenging aspect of any of these things. But I have found that data is a tool that can facilitate agreements amongst people, if you believe the data. You gotta believe the data. And that's where the the modern data stack has really helped. Because before, 3 different people could be in a meeting and they could all have data and they could all have 3 different answers, and that really sucks. That really sucks. It's hard to get agreement when 3 people are are quite quite convinced that they have data and it's all different. And that's the way life used to be, and the modern data stack has really helped to correct that. Yeah. And that's also where the semantic modeling becomes more critical because even if you're in the modern data stack and you all you are all working on the same data, if you,
[00:45:06] Unknown:
do the math differently or in the different sequence of operations, then even if you're using the same data, you might still come to different answers.
[00:45:14] Unknown:
Absolutely. At least you can reconcile them. Right? At least there's a process where it can be reconciled. But this idea of having a semantic model for the business means there's a single canonical algorithm for any given thing, and you can apply that to it directly. So, again, it eliminates some of that confusion. It's it has the potential to take and bring this to another level. The other thing that I fundamentally believe, and I haven't proven this yet, and this is this is still this is still theories to be determined. But I fundamentally believe that in order for these new models that we're creating, these new intelligences to learn from us, they need to really know what we think. And the idea of having a fully defined mathematical semantic model has got to be something that would be very helpful for them. It has to be something that they can train on and and learn more about a business. You can imagine how much information could be transferred into these models if you actually had your your business what it is your your entire business process into a semantic model. That would be very, very helpful to have. Let's say that would be helpful for people, but it will also be helpful for these new these new algorithms. Yeah. Absolutely. And then then that also brings in some of the work that's being done around,
[00:46:23] Unknown:
causal modeling and causal statistics, but that's a whole other conversation. And so as you continue to work in this space and work with entrepreneurs and invest in these different companies, I'm wondering if you can talk to the key predictions that you have for the future impact of data on the technical and economic and business landscapes.
[00:46:46] Unknown:
Well, I think it defines it, I mean, for all practical purposes. I think I think the new world is a data world. And and as and data is an asset in every business, in every organization. And if you're not if you're not collecting that data and mining it and leveraging it, you know, you're not fully you're not fully taking advantage of your business capability. You know, as an example, a really interesting example of this is the health care organizations, you know, are now all the, you know, the hospitals, then hospital systems, basically, are sitting on gigantic gold mines of data. I mean, they have all of this patient information, and that is incredibly valuable to pharmaceutical companies. And it'll help people. Right? I mean, you know, if by by when we can take statistical things and, you know, of of diseases and challenges people have had and leverage that to understand the causes of it, you know, we can do new drug discovery, and we can new and do new procedure discovery to make people's lives better. So data is really gonna affect everything. And and every so many jobs are gonna change because of these assistance. I I I really think that that having an assistant that is is tuned to your needs, you know, for your job and your life is gonna change things immensely for people in terms of the way we live and work.
Technology has done that already to such a huge extent. I mean, you know, when I joined Microsoft, I mean, getting a dial up link was a challenging thing. I mean, this is 1988, and and I remember my first ISDN line. I had a 128 k coming in. It was such a giant, you know, and I found all of a sudden, I could work at home. And, I mean, until then, you couldn't work at home. I mean, just imagine that. Look at how Zoom has changed our lives in the last few years. I mean, the our entire so much of our social connections are now through are now through video conferences, and so much of our business are that way. That wasn't true. Certainly, 10 years ago, even 5 years ago. The damn stuff never worked. I mean, I spent every time I was on a video call for 10 years, they spent the first 15 minutes trying to get the freaking thing to work, and then we'd give up and do a conference call. And now it just works. Yep. It just worked. What is it in Zoom? What's so good about Zoom? It just works. That's the thing. So many people don't understand how that fundamental thing, where making a product reach a point where it works for people and solves a problem is the real breakthrough.
And, you know, clearly, this technology has is beginning to do that. You know, let me give you an area where I just think is is is completely it's gonna change completely. Search. Internet search is is gonna be totally different now. I mean, I'm going to answer bots now to ask to to ask questions that I got very poor answers from Google on, and I get very good answers from Perplexity. And and I think that that we're gonna start to see the way many of our tools work with us are totally different because they're gonna be tuned to what we, you know, what we are looking for because we're gonna have these agents that are gonna know that. Are there any other aspects
[00:49:43] Unknown:
of the work that you've done on your book, this concept of datapreneurs, the evolution of the data capabilities
[00:49:51] Unknown:
that we have had both up till now and into the future that we didn't discuss yet that you'd like to cover before we close out the show? Yeah. Sure. I think we should talk about data engineering and how data engineering is gonna change because that's sort of the title of this show. Because, you know, I I I've watched it change dramatically over time. And, first of all, there's a first question, what are you engineering for? And I think now everybody sort of knows you're you're you're largely engineering for SQL databases. A large part of it is is is getting data prepared to be worked with on SQL databases. And there's a variety of tools people use, things like Spark being among, you know, amongst the major ones. We've watched SQL take on a broader role of data engineering, you know, with tools like DBT in the last few years. But I think the role of data engineering is gonna change in the sense that I think the data models in the future are gonna be suggested by these are gonna be created and suggested by these agents.
And and, again, you're gonna have your your the role of the data engineer will be assisted by these tools that will help them define the data model. And I I I think that that will also be foundational as to how data data engineering, I think, will change into business engineering, if you ask me, really. It will shift from focusing on just the way the data is shaped to making sure you understand the entire business process and the relationship between the business process and the data. And here again, I'm now I 1 of the convinced things I've come to convinced of is with these knowledge graphs that will get created, is that 1 of the biggest challenges is just understanding how to define the semantics of a business.
And I've come to the conclusion that people aren't, generally speaking, that good at that. And and, you know, and there that is a skill set that is still that's still to be learned. And I'm quite convinced that that that it's going to happen in in connection with these large language models and these tools. And the tools will play a bigger role in defining the business semantics, as well as the data semantics. So I think, you know, people's jobs will change, you know, a lot from being the creators of these things
[00:51:51] Unknown:
to essentially being the manager, in a sense, of the creation of these things, if you wanna think about it that way. Yeah. Absolutely. To some extent, data engineering has always been that aspect of business engineering. It's just that it's been clipped in the facade of the mechanical operations of moving bits from a to b. And, it's always been necessary to have that business context if you want to be a a good data engineer and effective at your role. But, because of the fact of all of the technologies technologies that are involved, that has also colored the, types of personalities that are attracted to the role. So definitely be interesting to see how that evolves as the specific responsibilities shift away from the very core, you know, writing software, moving bits from a to b to actually interacting more with the business concepts and the requirements that they have and being able to translate that into a representation the systems are able to understand and operate on. Yeah. So here's a prediction for you that, you know, in 3 years, this job is gonna transition become business engine. Alright. Well, we'll have to bring you back in 3 years to see how that plays out. Alright. Well, for anybody who is interested in getting in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Recognizing, of course, that we've already touched on that a bit, but maybe just, kind of summarize into a a pithy statement.
[00:53:20] Unknown:
I I say the the biggest thing is is is controlling and governing the data, and and making sure that the only that that that the people inside the organization that should have access to the data have access, and those that shouldn't don't. And I think that's probably the biggest challenge people have right now. And I do think that's a problem that's gonna get solved in the next year or 2. Alright.
[00:53:39] Unknown:
Well, thank you very much for taking the time today to join me and share some of your history and perspectives on this very fascinating and fast moving space. Appreciate the time and energy you've put into writing the book and, sharing your expertise with us. So, thank you again for that, and I hope you enjoy the rest of your day. Great. It's good to be here. Thanks, Sebastian.
[00:54:04] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com. Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at dataengineeringpodcastdot com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Guest Background
The Concept of Datapreneurs
The Evolution and Impact of SQL
Historical Innovations in Data
Adoption of Cloud Data Systems
Modern Data Stack: Pros and Cons
Consolidation in the Data Ecosystem
The Role of Metadata and Semantic Models
Advice for Technologists and Entrepreneurs
Impact of AI on Business and Society
Data-Driven Organizations and Cultural Shifts
Future Predictions for Data and Technology
The Future of Data Engineering