Summary
Data is an increasingly sought after raw material for business in the modern economy. One of the factors driving this trend is the increase in applications for machine learning and AI which require large quantities of information to work from. As the demand for data becomes more widespread the market for providing it will begin transform the ways that information is collected and shared among and between organizations. With his experience as a chair for the O’Reilly AI conference and an investor for data driven businesses Roger Chen is well versed in the challenges and solutions being facing us. In this episode he shares his perspective on the ways that businesses can work together to create shared data resources that will allow them to reduce the redundancy of their foundational data and improve their overall effectiveness in collecting useful training sets for their particular products.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
- When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
- You can help support the show by checking out the Patreon page which is linked from the site.
- To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
- A few announcements:
- The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20%
- If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.
- Your host is Tobias Macey and today I’m interviewing Roger Chen about data liquidity and its impact on our future economies
Interview
- Introduction
- How did you get involved in the area of data management?
- You wrote an essay discussing how the increasing usage of machine learning and artificial intelligence applications will result in a demand for data that necessitates what you refer to as ‘Data Liquidity’. Can you explain what you mean by that term?
- What are some examples of the types of data that you envision as being foundational to multiple organizations and problem domains?
- Can you provide some examples of the structures that could be created to facilitate data sharing across organizational boundaries?
- Many companies view their data as a strategic asset and are therefore loathe to provide access to other individuals or organizations. What encouragement can you provide that would convince them to externalize any of that information?
- What kinds of storage and transmission infrastructure and tooling are necessary to allow for wider distribution of, and collaboration on, data assets?
- What do you view as being the privacy implications from creating and sharing these larger pools of data inventory?
- What do you view as some of the technical challenges associated with identifying and separating shared data from those that are specific to the business model of the organization?
- With broader access to large data sets, how do you anticipate that impacting the types of businesses or products that are possible for smaller organizations?
Contact Info
- @rgrchen on Twitter
- Angel List
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Electrical Engineering
- Berkeley
- Silicon Nanophotonics
- Data Liquidity In The Age Of Inference
- Data Silos
- Example of a Data Commons Cooperative
- Google Maps Moat: An article describing how Google Maps has refined raw data to create a new product
- Genomics
- Phenomics
- ImageNet
- Open Data
- Data Brokerage
- Smart Contracts
- IPFS
- Dat Protocol
- Homomorphic Encryption
- FileCoin
- Data Programming
- Snorkel
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode at data engineering podcast dotcom/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. And go to data engineering podcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page, which is linked from the site. I've got a couple of announcements before we start the show.
The O'Reilly AI Conference is also coming up, happening April 29th to 30th in New York. It will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to data engineering podcast.com/aicondashnewdashyork to register and save 20% off the tickets. Also, if you work with data or want to learn more about how the projects you have heard about on the show get used in the real world, then join me at the Open Data Science Conference happening in Boston from May 1st through 4th. It has become 1 of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective.
To save 60% off your tickets, go to data engineering podcast.com/odscdasheastdash2018 and register. Your host is Tobias Macy. And today, I'm interviewing Roger Chen about data liquidity and its impact on our future economies. So, Roger, could you start by introducing yourself? Hey. Yeah. Thanks for having me, online with you. So,
[00:01:48] Unknown:
you know, I've been interested in the area of data, for quite a while now and in many different forms and fashions, originally, as a double e electrical engineer by training, and then eventually as, more of a more of an applied physicist, during my PhD research at at Berkeley, and then subsequently as an investor in, startups working within the realm of data machine learning and AI and, most recently, in terms of starting our own company, that we're we're founding around a data network.
[00:02:19] Unknown:
And how did you first get involved in the area of data management?
[00:02:23] Unknown:
Yeah. It's interesting. So I'd say the first the first kind of foray was actually when I was a research scientist, and, this is when I was studying the field of nanophotonics. And, academic research, just getting access to open science data, stuff that could augment your own research, stuff that feels very non 0 sum for everybody in a field, and kinda feeling the pains of not having enough seemingly access to to kind of data. Right? And that that kinda was, what first opened my eyes to the new importance of data. And, you know, it was kind of a weird windy road. That's a too long too long a story to tell here, but, basically, I ended up in the startup ecosystem, pretty quickly while I was in graduate school. And, ended up at a venture capital firm, working full time right after my PhD. And, you know, sadly, you know, as a venture capitalist, you don't go around investing in silicon nanophotonic companies all too often, and so that's when I kind of pivoted myself into thinking about data, in a sense of software and machine learning and AI since a lot of companies that were most exciting and interesting were emerging in that field and that area. That's when I really kind of, really, really got into, the area of data management as, a focus area. And it's it's kinda funny too because, if you think about it, I I I kinda saw the same pattern, and problems when it came to AI and data startups that I saw when I was a researcher myself, which is that there are just all these data silos that prevent you from making a progress that should be able to be made, right, in a sense that you could have great algorithms, you could have great ideas, but you don't necessarily have the data to, train against or use to kind of fulfill and explore, some of your ideas. And I I just think that's an incredible problem to solve, and that's something that, you know, I'm kinda set out to do now as well too.
[00:04:18] Unknown:
And so a little while ago, I came across an essay that you wrote discussing how the increasing usage of machine learning and artificial intelligence is going to result in a demand for data that necessitates what you refer to as a an increase in data liquidity and a broader access to more fundamental datasets. So I'm wondering if you can discuss a bit about, what you mean by the term data liquidity and some of the ideas that you were uncovering in the, article in question.
[00:04:51] Unknown:
Sure. I mean, when I when I think about liquidity, there's really 2 2 ways I think about it. 1 is, obviously, the notion of having, data that can move freely, between, you know, different entities in order to unlock, you know, discovery and development and all sorts of innovation. But the other way to think of liquidity too is just it's a it's a marketplace and a finance concept. Right? If you if you can create, you know, a marketplace where you have assets that are liquid, you know, they tend to be healthier, marketplaces where that asset gets to be shared, right, which leads to productivity and growth. And 1 thing I just found that was really, you know, from that perspective, a really interesting, thing about the data ecosystem is that data is a world where there's a lot of illiquidity. Right? You have a lot of data silos, that prevent this kind of movement of data to to spur progress.
And, you know, on a 1 hand, I I fully understand that because I think, you know, in this day and age, I think having proprietary access to data is a linchpin to maintaining a competitive edge as a business. But at the same time, I don't think that's true for all data. I think in addition to a company having proprietary data, there is plenty of data where, the fragments of data by themselves are not enough to create value for any, you know, owners of fragments and where data commons, across a bunch of businesses can unlock value for everybody, grow the pie, if you will, for everybody. And in addition, you know, to that, you know, that having that data commons layer could actually even augment, the proprietary data that some businesses might have. But in in order for us to get there, I I really do think it ties back to this concept of, 1, mechanically, can you create liquidity in a sense that can we actually move data around and make it accessible and ease easily, attainable? But other, do we do we have an an ability to create the incentive layer? Right? The ability to, motivate, and bring marketplace dynamics into this data ecosystem so that people want to trade data and exchange data.
[00:06:55] Unknown:
Yeah. 1 of the common discussions is around the concept of data capital or data is the new oil where in order to be able to benefit from the data that's being amassed, You need to be able to refine it into additional data products, and there's actually a really interesting, article I read a little while ago about some of the ways that Google Maps has been able to take that refinement to create the products such as all of the, area of interest information that they have in their mapping. So I'll add a link to that in the show notes. But I'm wondering if you have any thoughts or examples of the types of data that you envision as being foundational to multiple organizations that you would want to see consolidated into a data commons and some of the problem domains that they might be applicable to?
[00:07:43] Unknown:
Yeah. I I think there's tons, and I'll just sit on, you know, a couple here. You know, 1 I think is health data in general. I think, right now, there's a lot of outsized interest in the idea of personal health or precision medicine. And part of, what's driving that is the rise of genomics and the ability to do low cost sequencing. But, you know, what's interesting to me is precision medicine really only becomes a reality if you're able to unify data. Right? It's this weird, it's it's almost ironic. In order to have a precise understanding of a person's, you know, health disposition, you actually have to have broad, understanding human population data in order to train against, in order to make those sorts of discoveries.
And, you know, that's 1 form of data commons that I think would actually unlock, all sorts of new products, whether they're drugs or otherwise, or even just wellness products, as well as maybe even creating new industries. But to get there, again, you need to kinda get to that basic foundation of having enough access to enough critical density of genomics data, as well as phenomics data to to make that possible. And then on top of that, I still think, you know, whether you're a life science company, biopharma pharmaceuticals company, a, biotech company, a wellness company of any other kind, you can still have your own kind of proprietary, datasets that you you build on top of those data commons, and let let that kind of data commons, unlock unlock discovery of your own dataset, right, within your own dataset. You know, other things I think about, you know, beyond health care is just what what are the kinds of data that if they were publicly crowdsourced and publicly available would just create a lot of, value for everybody by helping them contextualize whatever it is that they're working on. Right? So let's let's use another example of, let's say weather data. Right? Like, weather weather data should not be something that's proprietary yet. If it was, something that was a public good and publicly shareable, it can enable a lot of industries. Right? Whether it's, farming and agriculture or whatever or whatever else. And then the last thing I would actually put out there too is, even just the notion of having the ability to have publicly accessible training data for AI.
And, you know, that that might sound funny or or a little bit kinda idealistic, but, you know, I can give a very concrete example of that. I mean, if you think about what ImageNet and some of these other datasets have been able to do to help kinda spur innovation and results in that field, you know, then you gotta wonder, well, what if I can have, you know, the ImageNet for all sorts of other data types. Right? Whether it's for autonomous vehicles or, natural language or whatever else. Right? So, you know, all these sorts of datasets, if you're able to create public repos, to make them available for people to, you know, train off of. I just think that can unlock tremendous progress, on an academic level, on a research level, but also on an enterprise level as well too.
[00:10:37] Unknown:
1 of the things that I was most curious about when I was reading your article is the idea of the different types of organizational and technical structures that can be built around this idea of a data commons. And so 1 of the most obvious is the idea of open datasets, which are proliferating as the storage capacity for those datasets becomes more accessible and less expensive and also as there becomes a greater awareness of and push for it. But you also call out some other additional structures such as having a, sort of federated dataset. I'm wondering if you can just talk about some of the different ways that these common datasets can be structured, both from the business side, the organize organizational side, as well as from the technical level of how the data is actually located and accessed?
[00:11:31] Unknown:
Yeah. So I think, it's a great question. I think there's certainly a lot of different models for sharing data. And, you know, the way I break it down, there are kind of 3 buckets, but even within those buckets, there are so many different, permutations of how you how you share or exchange data. You know? But, largely, I do think there are 3 categories that capture it. I think 1, there's just the open data movement where the idea here is that we, you know, collect data, crowdsource it, obtain it however we do, and we publicly contribute it, to the Internet for public consumption. And I think it's fascinating because, you know, I think that kind of behavior and sentiment is, very additive, and unlocks a lot of value for everyone, when they try to get that data. I mean, think about how many times you've just kind of gone on Google and snagged a a free photo that, people contributed and hopefully, you know you know, had a Creative Commons licensing for, you know, and used it for for a project of your own. Right?
I think the challenge with open data is that, you know, there aren't it tends to be extremely unstructured and uncoordinated. Right? And, that's fine for a lot of things, but for large scale data training projects for, you know, machine learning or AI projects, kinda like the genomics and precision medicine application I described earlier, you know, having coherence, cohesiveness, and structure in large datasets really, really does matter. And that's where I think open data doesn't necessarily cut it. And I think part of the reason why is because, people who contribute to open data projects, they do so really purely because, you know, of this desire to kinda do good and and and to share.
And part of the challenge of that is it means that it tends to happen once in these 1 off kind of benevolent projects. But because of lack of ongoing incentive, they tend to kinda peter out and they tend to remain fragmented on their own. There's another model where I think you can get much more cohesive large scale, and sustainable, you know, data sharing happening, and that's through the form of data brokerage. And the idea here is a bunch of people will collect data that they think is valuable. They'll use that to exchange with other datasets, you know, to broker deals with other companies that collect data. Right? All kind of in pursuit of profit because as we kinda collect and aggregate and exchange this data, we can now resell it. Resell, this collective package at a higher cost. And, frankly speaking, this is, how a lot of the modern ecommerce world and the modern online financial world works. Right? You have a lot of companies who actually take your click data from browsing around the web or take your, credit data, from various applications and use that to power, new sorts of applications and products for consumers.
And that works fantastically as, fantastically well for a lot of applications. However, the challenge with that is that, it tends to be a little bit more of an opaque market and something that happens, on the back. So, you know, that outpacing means that if you're you're 1 of these data brokers trading financial data, great. But if you're not, you you don't really have access to that sort of data. Right? I think the 3rd category would be the idea of, you know, what I think of as a data cooperative. And the idea here is, you know, market publicly that this data cooperative exists. Right? So it's not this opaque hidden thing that you have to be an insider in order to access, but it's something that's publicly available. But in order for you to be part of this data cooperative and to be able to get access to the data from other contributors, you yourself has to be a contributor, have to be a contributor.
And I think that model is interesting because what it does is, you know, it allows you to, have everybody benefit from, that sort of membership, you know, mentality that that gives access to not only your own data but everybody else's data, you know, as as a member of that cooperative. But it does so in a way that's, publicly, accessible to others who might wanna join. But I think the challenge of, the idea of data cooperatives is that, there's a code start problem. Right? So if you wanna publicly invite people to share data, you know, I think there's a little bit of a you first, kinda mentality.
And I think that leads to, the notion of data cooperatives being hard to get off the ground, you know, in the early days. And so I think, you know, when I think about these 3 different models, open data, when I think about data brokerages, when I think about data cooperatives, they all have their their virtues and benefits, but they also all have their challenges and drawbacks.
[00:16:25] Unknown:
And the data cooperative in particular has the question of what sort of a governance model do you put in place of, who is going to be in charge of providing the access to the data? Is there going to be any sort of metering involved between the companies as far as providing a certain amount of data gets you a certain amount of usage or if somebody wants to be able to, just pay to be able to access the dataset without necessarily contributing additional information to it?
[00:16:55] Unknown:
Yeah. It's that's a really great question. I think, governance is probably, you know, really at the heart of a lot of these things. Right? Like, good governance, the sense of fair rules and a fair fair game is 1 of 1 of the best things that you can do to actually encourage, proper data sharing. I think we've seen, governance up to now form, you know, come in the form of, you know, industry consortiums, for example. Right? And it's worked pretty well for certain applications. Right? So for example, the Semiconductor Research Corporation forming a consortium between different types of semiconductor research and manufacturing companies. Right? They're thinking about what are the collective, technology challenges they need to solve, what sort of knowledge can they share to help them all kinda move forward. Even even though some of those members might be competitors because they recognize that, everybody needs to benefit from solving some technical hurdle for them to all as an industry kind of continue to grow and progress.
And I think I think if you have, very, very strong governing bodies, which, you know, maybe they're comprised of a committee that is representative, you know, of the different members, or however else. I think that can work well. But I I think we're also seeing a new kind of, governance model emerge now, which is exciting. At the same time, kinda like scary. And it it's sort of an Internet first model. And that's that's the idea of, you know, using cryptographic protocol networks to govern that where you have Internet based rules for saying, you know, saying what you get for what sort of contribution and Internet based rules for saying, you know, whether or not you've, you know, properly contributed data that you said you have contributed.
And just to be more overt, I think I think what's happening with, blockchains right now and the ability to kinda coordinate efforts across different people, when it comes to data projects is is fascinating to me.
[00:18:58] Unknown:
Yeah. That's 1 of the interesting things I've seen as well is using the idea of blockchains and smart contracts and the idea of the decentralized authority being used as a means of managing the access to federated data networks so that you can verify that somebody who, claims to have access is only gaining the access that they are entitled to and nothing more. So it, in some ways, prevents the sort of widespread abuse of a system, but it also can potentially, depending on how the smart contracts are implemented, provide some sort of imbalance in terms of the level of access that's available to people depending on their fundamental capability of being able to participate in that network, whether that's because of infrastructural issues where they're located or technical sort of acumen or anything along those lines. So it's in some ways more egalitarian, but in other ways, it's just another way to potentially exclude people whether on purpose or by accident.
[00:20:12] Unknown:
Yeah. I think it's fascinating. I think, you know, earlier, I talked about how data cooperatives suffer from, you know, a code start problem where different organizations might wait for the other to put in first before they're willing to contribute as well. And what I think is interesting about, you know, about decentralized networks for this is suddenly you're kind of, you know, taking trust in any single 1 organization going first. You're kinda taking that out of the equation. And, implicitly, there's still some trust in the organization that's creating the decentralized network. But but, really, what you're saying is this protocol is gonna give, financial remuneration and payout to people who are willing to take that first step, and usually in a way that rewards them more than people who follow later on. And so suddenly, you know, I I think you just kind of change you change the dynamic a little bit. You now give this carrot, you know, to solve this code start problem around sharing. Right? Because even without another participant coming in, you have some financial incentive for someone to actually upload their data into a network. Right? So for me, it's a really, really interesting way to kick start, a new kind of network.
And I I also think it's a way to create, like you said, a very, much more egalitarian, model where not only, do you where you don't have to just be, you know, some sort of branded trusted organization, but you really as any small organization, unknown 1, or even individual, you can have permissionless access, right, to be part of this data network and to to profit off of it. But I do think that, you know, that itself also comes with a lot of challenges. Right? I think 1 1 challenge with, you know, these crypto token protocol networks is, the fact that you have to be able to design something that is, you know, egalitarian yet has the flexibility to evolve over time. Right? So if you have a governing body, that's a a consortium of, members, right, for some sort of industry, they might be able to recognize 3 years in that, you know, maybe, something should be recognized more. Maybe they should head in a certain direction or maybe contribution should be, readjusted in certain ways. And I think for protocol networks, that can be a challenge sometimes because you're you're literally kind of baking into a software code how some of those things can be shaken out. But we're seeing a lot of interesting things happen. Right? There's a lot of innovation happening right now when it comes to, crypto token protocols, and a lot of people are figuring out ways to, essentially, create really, really market driven, elegant designs, that create that kind of elegant egalitarian governance structure while still maintaining enough flexibility in them to adapt and evolve over time. So I'm just really excited about what what that's gonna unlock, for for data marketplaces, whether it's in health care or or anywhere else for that matter, actually.
[00:23:11] Unknown:
And 1 of the other challenges associated with having these common data layers, particularly the idea of a cooperative and even open data access is the challenge of being able to store and transmit the data and the infrastructure required to do so because particularly if there are multiple organizational entities who want to be able to take advantage of these sets, then it may create a fair amount of strain on the underlying hardware systems that are necessary for being able to provide that access. So I don't know if you have any thoughts on some of the design of the way that that data is stored and distributed and some of the, sort of economic structure that's necessary to be able to support those, fundamental requirements?
[00:24:02] Unknown:
Yeah. So I think storage and transmission infrastructure, I mean, you can, you know, sort of break that down into, a spectrum. Right? And on 1 end of the spectrum, you have, local client based end to end transmission. Right? So this is literally local client as in your laptop or, you know, maybe in in put the local into quotation marks here. Maybe your Dropbox or your Google Drive. You know, things you know, storage, units where you have personal ownership over, do end to end transfer of data that you have, your local store to, whoever the recipient might be. So that's 1 end of the spectrum, and that that's, you know, I guess, more of the traditional, model of how data is, you know, transmitted these days.
And on the other end of spectrum, I think we're seeing some exciting new technologies evolving around, fully distributed storage, decentralized storage. Right? The idea where maybe we can take that file instead of storing it locally on your hard drive, we can, chop it up into a bunch of pieces and, put it out on the web for random hosts, to kinda store. And then when the time comes to transmit it, you know, we can resync that up and deliver that to the right place for for computation. And I think, I think basically what we're gonna see is we're gonna see an evolution from the former to the latter. Right? The reason why is because I think a lot of the Internet infrastructure is already in place for you to, you know, store a file in the cloud or, you know, on your computer and just ship it to from point a to point b. That's ready to happen today.
So in terms of, scalability, in terms of, pragmatism, I think that kind of, data storage and transmission infrastructure, will be realized in some of these early data networks. However, I do think over time, what what I described around this kind of exciting notion of decentralized storage, decentralized transmission, and even decentralized computing will have to happen for some, data sharing applications if only because, you know, data privacy for certain applications is paramount. So for example, when it comes to, you know, health data, you don't that's not the kind of data you want to just send out to anybody that wants access to it. That's not the kind of data you want, to be replicated and disseminated across the entire web. And that's the kind of data where you would want, some sort of decentralized network of, you know, storage as well as computation so that you can run training algorithms, against that kind of dataset without ever seeing the raw data. Right? And that that kinda ties into what I'm what I'm seeing now as an exciting new trend, and I think this is gonna be this isn't like the next few years. I think this is the next, you know, decade or 2 trend, but I think it's an inherently very, very powerful trend, and that is, you know, the the notion of secure multiparty computing. Right? And I think, you know, and the idea here is that, again, we can have multiple except all those local clients that, computed it in a in a multiparty computing model never actually see the full raw data, and, therefore, the data in whole remains private.
So that that kind of new sort of storage and transmission infrastructure, I think, is is kind of on its way. And I think part of that is just the fact that we have, you know, fantastic compute resources and cloud infrastructure that's gonna enable that to happen more and more over time. So I'm just really curious to see, you know, as that, becomes a reality, and to be to be kind of clear here, I think it's a there are a lot of technical hurdles before that can fully happen. But, you know, as that happens, I wonder what what kind of interesting, data sharing applications will emerge. Right? Because now suddenly you can talk about the idea of sharing the most sensitive data and have having confidence that that data will not be, co opted by someone else. Right?
[00:28:27] Unknown:
Yeah. And the whole idea of data privacy and cleaning datasets is a complicated and nuanced 1 because of some of the examples that we've seen where datasets that have been ostensibly scrubbed of personally identifiable information can still be used to actually find individuals within a large dataset even though there isn't any sort of address or name information just because of the, the implicit biases or the implicit information that's fundamentally linked to the way that the data was created. And 1 of the interesting approaches that I've heard about in recent, sort of months years is the idea of homomorphic encryption as a way to try and prevent some of that where you don't actually have any direct access to the underlying data because of the way that it's encrypted, but you can still run machine learning algorithms against it because there is enough data in aggregate to be able to actually gain some information from it. But when you want to then delve into the individual data points, there's no way to do that because of the way that it's structured.
And then going to, to your point of the distributed data storage, some of the technologies that are interesting, and I'm curious to see how they pan out within the broader ecosystem are the interplanetary file system, for being able to do fully decentralized data storage where everybody can take part of the network, and then the DAT protocol, which is similar in terms of being able to distribute the data, but also has the concept of versioning and currently is a write once, read many system where only 1 entity is able to actually update the datasets and everyone else who is part of the, peers, network is able to just read what's published. So that's interesting as well from the concept of a data commons where it's 1 way to ensure that somebody isn't inadvertently polluting the dataset by accidentally writing back to it.
[00:30:37] Unknown:
Yeah. I I think, both those projects are are fascinating. Right? IPFS, I mean, it it's a real thing. Like, people, are using it. It's up and live, and it's hosting a ton of, files as we as we speak. And, Max Ogden, and his work at that along with the rest of the team there, it's it's it's it's brilliant. Right? Like, these are, core, necessities in the world of data. Right, you know, for all sorts of different applications. And, you know, I think as we started layering on other sorts of primitives alongside and on top of the capabilities that they've pioneered around, you know, privacy or competition or this or that. I just think that that that will unlock and enable, all sorts of exciting new applications where, you know, data privacy and trust becomes a bit of, you know, has traditionally been a bit of a hurdle because you can now remove those hurdles. Right? And it just gets to this notion of, man, if we if we can have this, you know, centralized, or what's funny, it's like this universal computer because it's decentralized, handle a lot of these compute jobs in a way where we can, you know, in a very high trust way contribute any kind of data no matter how sensitive to it. Right? You know, what are are the sorts of things that we can get people to to do? Right? What sorts of research projects, and what sorts of data can we get people to contribute? Right?
And I think we're just we're just now seeing kinda early innings of that with,
[00:32:08] Unknown:
with projects like IPFS, like DAT, like Filecoin, and so forth. And once we move into the near future where these common datasets are becoming more widely available and more prevalent and more robust. I'm wondering if you have any thoughts as to the types of businesses or products that are going to become possible, whether for smaller organizations or for organizations of a larger size, simply because of the availability of that data and some of the ways that it may impact the future trending and growth of global economies.
[00:32:48] Unknown:
Yeah. I think there there are lots of things, that will be enabled from a business and industry perspective. But I think there are 22 that stand out in particular to me. 1 is just the idea of, enabling data commerce. Right? Now you can actually, create this notion of a data marketplace. You know, I think everyone talks about how data is an asset, whether they call it the new oil or or whatever else. But for the most part, you know, data has been this thing that kinda gets copied and pasted and, you know, either you're an insider and you have access or you don't, sort of thing.
But if we get to a point where we can actually get data to be publicly available and for you to have a single place where you can kind of provably show that you have ownership and sell data to people that want it. You know, I think that that notion is incredibly powerful. This, notion of ascribing data ownership and therefore, being able to enforce profits and reward for being able to contribute that data when they're bought by by certain people. I I I do think, for data markets, the 1 hurdle though is that, you know, data replication, the ability to copy and paste still is an issue, but I think there are actually some very, you know, practical things you can do about that, which is 1, just continue contributing data that, is new, that always refreshes. Right? Always refresh that data network. You know, 1 kind of data that I talked about earlier is weather data. Weather data is always changing. It's streaming. It's real time. You know, what are all the other kinds of streaming datasets that you can, pipe together into this public repo?
It doesn't matter if someone copied and pasted last week's weather data because they're still gonna need this week's weather data, and they're still gonna need next week's weather data. And suddenly, I think the, idea or the notion of, buying data, right, as a service, from from that kind kind of thing is actually very viable. And then the other way to kinda refresh data is, you know, just continue continue collecting new data that wasn't there. So in the case of precision medicine, are you continuing every single week to scale by bringing in not only new genomics data from new people, but for each person that's already part of this data network you're adding new kinds of data. Right? So now from genomics to phenomics to proteomics to all sorts of other, health data per person. Right? There are ways for you to continue to scale and refresh the data that's on a network to make, commerce a viable thing. And then I think so that's 1. 1 is data marketplaces.
2, I think part 2 is just what what happens because you have these unified datasets. Right? And this is just where, you know, you know, as a biopharmaceutical company, I now have this incredible public resource where I can go and access and get this data that I couldn't get before and use that to contextualize, my internal drug discovery pipeline. Right? And create maybe blockbuster drugs that I just wouldn't have been able to discover or develop before without some of this contextual data. Right? Tremendous applications in finance as well. I think tremendous applications in other areas that I think AI is, you know, ready to unlock as soon as there's enough data, right, whether that's autonomous vehicles or natural language.
Right? And those will be those will be that's interesting, right, because those those will be things that will fall out in almost an indirect way. Those will be kinda new businesses that that come out in an indirect way from the fact that you will have these kind of public data commons that let people do the research that lead to some discovery, that leads to some other discovery, that leads to some sort of product in those spaces down the line. And yeah. So I I I think those are the 22 kind of main things, and it's an incredibly exciting future.
[00:36:37] Unknown:
And I think 1 other aspect too to the broader availability of these fundamental datasets is, new business oriented around data integration and data enrichment where you provide value by having your own access to these underlying datasets and then having a means by which to combine them together to create a new dataset that wasn't necessarily accessible by accessing each of them in isolation similar to the Google Maps article where they were able to combine their street view data with their satellite imagery to be able to provide these areas of interest because of the machine learning applications of each of the sets in isolation than being combined via another mechanism. So I think the area of in data enrichment and integration and then selling access to those, you know, secondary and tertiary data sources as well as another way that the that the broader access to data will create a new, economically viable model for future businesses.
[00:37:44] Unknown:
Yeah. It's it's fascinating. You know, I wanna tie this back actually to the earlier discussion about, you know, crypto crypto networks. Because if you think about crypto networks, there's a fascinating business model innovation that sort of occurred, right, in a sense that when it came to Bitcoin, you had a a network of people who, you know, whose job is literally to mine Bitcoin. Right? And and, in Ethereum and other networks, well, not not yet for Ethereum, but in the future for Ethereum and other networks, having validators, stake whatever cryptocurrency, is part of that network in order to do validation jobs or other forms of work on that network.
And it's fascinating to me because it spells out this new new sort of business model where, you know, people could get paid for doing work that they previously would have had a hard time getting paid for, but is extremely valuable. Right? So in the case of Bitcoin and Ethereum, maintaining this universal ledger that, is relatively immutable, that people can build off of. Right? And then I think where you hit on when it comes to data is, well, what if we had an ability to create a network of people whose job is to assemble these datasets? What if you had, a network that could track contributions and, more than that is full of people that are ready to be mobilized to crowdsource any sort of data that you want on demand. Right? How incredibly powerful would that be? And that itself actually is an interesting, new kind of economy, a new kind of business that you can get into. Like, you could literally be a solo entrepreneur, somewhere, you know, in the middle of the United States and and realize, hey. You know, I I I'm actually really good at getting all this data, and I'm really good at proving that this is the right data, and you can maybe maybe make a living off of that. So so we'll see. I mean, I I do think, like, that that is, that is a loft division I'm I'm supremely excited about.
I do think, a lot remains to be seen with how viable a lot of these, token economies will be. But, for me, on a rational level, it makes a lot of sense to give proper payout and, and and compensation to some of these people who, do this really, really valuable work, which in a traditional world, kinda just gets captured in these free open source projects. But maybe in this new world, can be captured in a free open source sorry, in an open source project where they get, you know, their their dues for the hard work that they put into that to kind of enable all sorts of other projects to build on top of that open source project. Right?
[00:40:11] Unknown:
And as a final question, I'd just like to get your perspective on what the biggest gap is the in the tooling or technology that's available for data management today.
[00:40:22] Unknown:
I think it's it's, incredibly important to remember in this, day and age where everyone talks about AI and machine learning that, you know, proper data management is what's gonna allow you to actually capitalize on great AI algorithms. Right? And I think related to that, there is specifically a big gap I see now, in having the necessary tooling and technology to be able to structure data in a right way, you know, for training purposes. Right? I think that's such an incredibly important but underappreciated, task that makes a lot of these models, successful.
And I I think there we're we're starting to see some early progress. Right? I think there's, you know, there's this work out of Stanford around the notion of data programming, that I think is fascinating. I believe, the project is under, this open source repo called Snorkel. Right? So the idea that now maybe we can programmatically take a bunch of data which might not be clean, might be unstructured in many different ways, but have a way to scalably a framework to scalably go through it and figure out how to munge it and structure it in ways that are appropriate to now run it through a model for training, right, to to train a new model. Right?
And so I think I think that's something that that ought to be solved in the near future, and I think when that does get solved, I think that'll unlock a lot of new capabilities in the machine learning world. And I think there are some, people who are really smart who are working on it right now. So I'm excited for that.
[00:41:56] Unknown:
Yeah. I definitely agree that the area of automatic dark data extraction and enrichment of data from domain experts is a very interesting and valuable subject area and actually had 1 of the members of the snorkel project on a past episode. So I'll add a link to that in the show notes as well. Oh, cool. So, with that, I would just like to thank you for taking the time out of your day to join me and talk about your ideas on how the future of the data economy might pan out and some of the challenges that we need to face in the near to midterm. So thank you for that and I hope you enjoy the rest of your evening. It was a pleasure. I had fun and, thank you very much
[00:42:45] Unknown:
for having me.
Introduction and Announcements
Guest Introduction: Roger Chen
Roger's Journey into Data Management
Understanding Data Liquidity
Foundational Data and Data Commons
Models for Data Sharing
Governance in Data Cooperatives
Decentralized Data Networks and Blockchain
Infrastructure for Data Storage and Transmission
Data Privacy and Homomorphic Encryption
Future Business Models Enabled by Data Commons
Data Integration and Enrichment
Biggest Gaps in Data Management Tooling
Closing Remarks