The Future Data Economy with Roger Chen

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode at data engineering podcast dotcom/linode

and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.

And go to data engineering podcast.com

to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page, which is linked from the site.

I've got a couple of announcements before we start the show.

The O'Reilly AI Conference is also coming up, happening April 29th to 30th in New York. It will give you a solid understanding of the latest breakthroughs and best practices in AI for business.

Go to data engineering podcast.com/aicondashnewdashyork

to register and save 20% off the tickets.

Also, if you work with data or want to learn more about how the projects you have heard about on the show get used in the real world, then join me at the Open Data Science Conference happening in Boston from May 1st through 4th.

It has become 1 of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective.

To save 60% off your tickets, go to data engineering podcast.com/odscdasheastdash2018

and register.

Your host is Tobias Macy. And today, I'm interviewing Roger Chen about data liquidity and its impact on our future economies. So, Roger, could you start by introducing yourself? Hey. Yeah. Thanks for having me, online with you. So,

you know, I've been interested in the area of data,

for quite a while now and in many different forms and fashions,

originally,

as a

double e electrical engineer by training, and then eventually as, more of a more of an applied physicist,

during my PhD research at at Berkeley, and then subsequently

as an investor in,

startups working within the realm of data machine learning

and AI and, most recently, in terms of starting our own company,

that we're we're founding around a data network.

And how did you first get involved in the area of data management?

Yeah. It's interesting. So I'd say the first

the first kind of foray was actually when I was a research scientist,

and, this is when I was studying the field of nanophotonics. And, academic research, just getting access to open science data, stuff that could augment your own research, stuff that

feels very non 0 sum for everybody in a field,

and kinda feeling the pains of not having

enough seemingly access

to to kind of data. Right? And that that kinda was,

what first opened my eyes to the new importance of data. And, you know, it was kind of a weird windy road. That's a too long too long a story to tell here, but,

basically, I ended up in the startup ecosystem,

pretty quickly while I was in graduate school. And,

ended up at a venture capital firm, working full time right after my PhD. And, you know,

sadly, you know, as a venture capitalist, you don't go around investing in silicon nanophotonic companies all too often, and so that's when I kind of pivoted myself into thinking about data,

in a sense of software and machine learning and AI since a lot of companies that were most exciting and interesting were

emerging in that field and that area. That's when I really kind of, really, really got into,

the area of data management as,

a focus area. And it's it's kinda funny too because, if you think about it, I I I kinda saw the same pattern,

and problems when it came to AI and data startups that I saw when I was a researcher myself, which is that there are just all these data silos that prevent you from making a progress that should be

able to be made, right, in a sense that you could have great algorithms,

you could have great ideas, but you don't necessarily have the data to,

train against

or use to kind of fulfill and explore,

some of your ideas. And I I just think that's an incredible problem to solve, and that's something that, you know, I'm kinda set out to do now as well too.

And so a little while ago, I came across an essay that you wrote discussing how the increasing usage of machine learning and artificial intelligence

is going to result in a demand for data that necessitates what you refer to as

a an increase in data liquidity

and a broader access to more fundamental

datasets. So I'm wondering if you can discuss a bit about, what you mean by the term data liquidity

and some of the ideas that you were uncovering in the, article in question.

Sure. I mean, when I when I think about liquidity, there's really 2

2 ways I think about it. 1 is,

obviously, the notion of having,

data that can move

freely,

between,

you know, different entities in order to unlock,

you know, discovery and development and all sorts of innovation.

But the other way to think of liquidity too is just it's a it's a marketplace and a finance concept. Right? If you if you can create, you know, a marketplace where you have assets that are liquid,

you know, they tend to be healthier,

marketplaces where that asset gets to be shared, right, which leads to productivity and growth. And 1 thing I just found that was really, you know, from that perspective, a really interesting,

thing about the data ecosystem is that data is a world where there's a lot of illiquidity. Right? You have a lot of data silos,

that prevent this kind of movement of data to to spur progress.

And, you know, on a 1 hand, I I fully understand that because I think, you know, in this day and age, I think having

proprietary access to data is a linchpin to maintaining

a competitive edge as a business.

But at the same time, I don't think that's true for all data. I think in addition to a company having proprietary data, there is plenty of data where,

the fragments of data by themselves are not enough to create value for any, you know, owners of fragments

and where data commons,

across a bunch of businesses

can unlock value for everybody, grow the pie, if you will, for everybody. And in addition, you know, to that, you know, that having that data commons layer could actually even augment, the proprietary data that some businesses might have. But in in order for us to get there, I I really do think it ties back to this concept of, 1, mechanically, can you create liquidity in a sense that can we actually

move data around and make it accessible and ease easily,

attainable? But other, do we do we have an an ability to create the incentive layer? Right? The ability to,

motivate,

and bring marketplace dynamics into this data ecosystem so that people want to trade data and exchange data.

Yeah. 1 of the common discussions is around the concept of data capital or data is the new oil where in order to be able to

benefit from the data that's being amassed, You need to be able to refine it into additional data products, and there's actually a really interesting,

article I read a little while ago about some of the ways that Google Maps has been able to take that refinement to create the products such as all of the,

area of interest information that they have in their mapping. So I'll add a link to that in the show notes. But I'm wondering if you have any thoughts or examples of the types of data that you envision as being foundational

to multiple organizations

that you would want to see

consolidated into a data commons

and some of the problem domains that they might be applicable to?

Yeah. I I think there's tons, and I'll just sit on, you know, a couple here. You know, 1 I think is health data in general. I think, right now, there's a lot of outsized interest

in the idea of personal health or precision medicine.

And part of, what's driving that is the rise of genomics and the ability to do low cost sequencing.

But, you know, what's interesting to me is precision medicine really only becomes

a reality if you're able to unify

data. Right? It's this weird, it's it's almost ironic. In order to have a precise

understanding of a person's,

you know,

health

disposition,

you actually have to have broad, understanding

human population data in order to train against, in order to make those sorts of discoveries.

And, you know, that's 1 form of data commons that I think would actually unlock, all sorts of new products, whether they're drugs or otherwise, or even just wellness products,

as well as maybe even creating new industries.

But to get there, again, you need to kinda get to that basic foundation of having enough access to enough critical density of genomics data, as well as phenomics data to to make that possible. And then on top of that, I still think, you know, whether you're a life science company, biopharma

pharmaceuticals company, a, biotech company, a wellness company of any other kind, you can still have your own kind of proprietary,

datasets that you you build on top of those data commons,

and let let that kind of data commons,

unlock

unlock discovery of your own dataset, right, within your own dataset. You know, other things I think about, you know, beyond health care is just what what are the kinds of data that if they were publicly crowdsourced and publicly available would just create a lot of,

value for everybody by helping them contextualize whatever it is that they're working on. Right? So let's let's use another example of, let's say weather data. Right? Like, weather weather data should not be something that's proprietary yet. If it was, something that was a public good and publicly shareable, it can enable a lot of industries. Right? Whether it's, farming and agriculture or whatever or whatever else. And then the last thing I would actually put out there too is,

even just the notion of having the ability to have

publicly accessible training data for AI.

And, you know, that that might sound funny or or a little bit kinda idealistic, but,

you know, I can give a very concrete example of that. I mean, if you think about what ImageNet and some of these other datasets have been able to do to help kinda spur innovation

and results in that field, you know, then you gotta wonder, well, what if I can have,

you know, the ImageNet for all sorts of other data types. Right? Whether it's for

autonomous vehicles or,

natural language or whatever else. Right? So, you know, all these sorts of datasets, if you're able to create public repos,

to make them available for people to,

you know, train off of. I just think that can unlock tremendous progress,

on an academic level, on a research level, but also on an enterprise level as well too.

1 of the things that I was most curious about when I was reading your article

is the idea of the different types of

organizational

and technical structures that can be built around this idea of a data commons. And so 1 of the

most obvious is the idea of open datasets, which are

proliferating

as the storage capacity

for those datasets becomes more accessible

and less expensive

and also as there becomes a greater awareness of and push for it. But you also call out some other additional structures such as having a,

sort of federated dataset. I'm wondering if you can just talk about some of the different ways that these common datasets can be structured, both from the business side, the organize organizational side, as well as from the technical level of how the data is actually located and accessed?

Yeah. So I think, it's a great question. I think there's certainly a lot of different models for sharing data.

And,

you know, the way I break it down,

there are kind of 3 buckets, but even within those buckets, there are so many different,

permutations of how you how you share or exchange data. You know? But, largely, I do think there are 3 categories that capture it. I think 1,

there's just the open data movement where the idea here is that we, you know, collect data, crowdsource it, obtain it however we do, and we publicly contribute it,

to the Internet for public consumption. And I think it's fascinating because, you know, I think that kind of behavior and sentiment is,

very additive, and unlocks a lot of value for everyone,

when they try to get that data. I mean, think about how many times you've just kind of gone on Google and snagged a a free photo that, people contributed and hopefully, you know you know, had a Creative Commons licensing for,

you know, and used it for for a project of your own. Right?

I think the challenge with open data is that,

you know, there aren't it tends to be extremely unstructured

and uncoordinated.

Right? And, that's fine for a lot of things, but for large scale data training projects for, you know, machine learning or AI projects, kinda like the genomics and precision medicine application I described earlier, you know, having

coherence,

cohesiveness,

and structure in large datasets really, really does matter. And that's where I think open data doesn't necessarily cut it. And I think part of the reason why is because, people who contribute to open data projects,

they do so really purely because,

you know, of this desire to kinda do good and and and to share.

And part of the challenge of that is it means that it tends to happen once

in these 1 off kind of

benevolent

projects. But because of lack of ongoing incentive, they tend to kinda peter out and they tend to remain fragmented

on their own. There's another model where I think you can get much more cohesive large scale,

and sustainable,

you know, data sharing

happening, and that's through the form of data brokerage.

And the idea here is a bunch of people will collect data that they think is valuable.

They'll use that to exchange with other datasets,

you know, to broker deals with other companies that collect data. Right? All kind of in pursuit of profit because as we kinda collect and aggregate and exchange this data, we can now resell it. Resell, this collective package

at a higher cost. And,

frankly speaking, this is,

how a lot of

the modern ecommerce world and the modern online financial world works. Right? You have a lot of companies who actually take your click data from browsing around the web or take your, credit data, from various applications

and use that to power,

new sorts of applications and products for consumers.

And that works fantastically as,

fantastically well for a lot of applications.

However,

the challenge with that is

that,

it tends to be a little bit more of an opaque market

and something that happens,

on the back. So, you know, that outpacing means that if you're you're 1 of these data brokers trading financial data, great. But if you're not, you you don't really have access to that sort of data. Right? I think the 3rd category would be the idea of,

you know, what I think of as a data cooperative.

And the idea here is, you know, market publicly

that this data cooperative exists. Right? So it's not this opaque hidden thing that you have to be

an insider in order to access,

but it's something that's publicly available. But in order for you to be part of this data cooperative and to be able to get access to the data from other contributors,

you yourself has to be a contributor, have to be a contributor.

And I think that model is interesting because what it does is, you know, it allows you to, have everybody benefit from, that sort of membership, you know, mentality that that gives access to not only your own data but everybody else's data, you know, as as a member of that cooperative.

But it does so in a way that's, publicly,

accessible to others who might wanna join. But I think the challenge of, the idea of data cooperatives is that,

there's a code start problem. Right? So if you wanna publicly invite people to share data, you know, I think there's a little bit of a you first,

kinda mentality.

And I think that leads to, the notion of data cooperatives being hard to get off the ground,

you know, in the early days. And so I think, you know, when I think about these 3 different models, open data, when I think about data brokerages, when I think about data cooperatives, they all have their their virtues and benefits, but they also all have their challenges and drawbacks.

And the data cooperative in particular

has the question of what sort of a governance model do you put in place

of,

who is going to be in charge of providing the access to the data? Is there going to be any sort of metering involved between the companies

as far as providing a certain amount of data gets you a certain amount of usage or if somebody wants to be able to,

just pay to be able to access the dataset without necessarily contributing additional information to it?

Yeah. It's that's a really great question. I think, governance is probably,

you know, really at the heart of a lot of these things. Right? Like,

good governance,

the sense of fair rules and a fair fair game is 1 of 1 of the best things that you can do to actually encourage, proper data sharing.

I think we've seen, governance up to now

form,

you know, come in the form of,

you know, industry consortiums, for example. Right? And it's worked pretty well for certain applications. Right?

So for example, the Semiconductor Research Corporation forming a consortium between

different types of semiconductor

research and manufacturing companies. Right? They're thinking about what are the collective,

technology challenges they need to solve, what sort of knowledge can they share to help them all kinda move forward. Even

even though some of those members might be competitors

because they recognize that, everybody needs to benefit from

solving some technical hurdle for them to all as an industry kind of continue to grow and progress.

And I think I think if you have,

very, very strong governing

bodies, which, you know, maybe they're comprised of a committee that is representative,

you know, of the different members,

or however else. I think that can work well. But I I think we're also seeing a new kind of, governance model

emerge now, which is exciting.

At the same time, kinda like scary.

And it it's sort of an Internet first model. And that's that's the idea of, you know, using

cryptographic

protocol networks to govern that where you have

Internet based rules for saying,

you know, saying what you get for what sort of contribution

and Internet based rules for saying,

you know, whether or not you've,

you know, properly contributed data that you said you have contributed.

And just to be more overt, I think I think what's happening with,

blockchains right now and the ability to kinda coordinate efforts

across different people, when it comes to data projects is is fascinating to me.

Yeah. That's 1 of the interesting things I've seen as well is using the idea of blockchains

and smart contracts

and the idea of the decentralized

authority

being used as a means of

managing

the

access to federated data networks so that you can verify that somebody who,

claims to have access is only gaining the access that they are entitled to and nothing more. So it, in some ways, prevents

the

sort of widespread

abuse of a system, but it also can potentially,

depending on how the smart contracts are implemented,

provide some sort of imbalance

in terms of the level of access that's available to people depending on their

fundamental capability of being able to participate in that network, whether that's because of

infrastructural issues where they're located or

technical

sort of acumen or anything along those lines. So it's in some ways more egalitarian, but in other ways, it's just another way to potentially

exclude people whether on purpose or by accident.

Yeah. I think it's fascinating. I think,

you know, earlier, I talked about how data cooperatives suffer from, you know, a code start problem where different organizations might wait for the other

to put in first before they're willing to contribute as well.

And what I think is interesting about,

you know, about decentralized networks for this is

suddenly you're kind of, you know, taking trust in any single 1 organization going first. You're kinda taking that out of the equation. And, implicitly, there's still some trust in the organization that's creating the decentralized network. But but, really, what you're saying is this protocol is gonna give, financial remuneration

and payout to people who are willing to take that first step, and usually in a way that rewards them more than people who

follow later on. And so suddenly, you know, I I think you just kind of change you change the dynamic a little bit. You now give this carrot,

you know, to solve this code start problem around sharing. Right? Because even without another participant

coming in, you have

some financial incentive for someone to actually upload their data into a network. Right? So for me, it's a really, really interesting way to kick start, a new kind of network.

And I I also think it's a way to create, like you said, a very,

much more egalitarian,

model where not only,

do you where you don't have to just be, you know, some sort of branded trusted organization,

but you really as any

small organization,

unknown 1, or even individual, you can have permissionless access,

right, to be part of this data network and to to profit off of it. But I do think that,

you know, that itself also comes with a lot of challenges. Right? I think 1 1 challenge with, you know, these crypto token protocol networks is,

the fact that you have to be able to design something

that is,

you know,

egalitarian

yet

has the flexibility to evolve over time. Right? So if you have a governing body, that's a a consortium

of, members, right, for some sort of industry, they might be able to recognize 3 years in that, you know, maybe,

something should be recognized more. Maybe they should head in a certain direction or maybe contribution should be,

readjusted in certain ways. And I think for protocol networks, that can be a challenge sometimes because you're you're literally kind of baking into a software code how some of those things can be shaken out. But we're seeing a lot of interesting things happen. Right? There's a lot of innovation happening right now when it comes to,

crypto token protocols, and a lot of people are figuring out ways to,

essentially,

create really, really market driven, elegant designs,

that create that kind of elegant

egalitarian governance structure while still maintaining enough flexibility in them to adapt and evolve over time. So I'm just really excited about what what that's gonna unlock, for for data marketplaces, whether it's in health care or or anywhere else for that matter, actually.

And 1 of the other challenges

associated with having these common data layers, particularly the idea of a cooperative and even open data access

is the challenge of being able to store and transmit

the data and the infrastructure required to do so because particularly if there are

multiple organizational entities who want to be able to take advantage of these sets, then it may

create a fair amount of strain on the

underlying hardware systems that are necessary for being able to provide that access. So I don't know if you have any thoughts on some of the design of the way that that data is stored and distributed

and some of the,

sort of economic structure that's necessary to be able to support those,

fundamental requirements?

Yeah. So I think storage and transmission

infrastructure, I mean, you can, you know, sort of break that down into,

a spectrum. Right? And on 1 end of the spectrum,

you have,

local client based end to end transmission. Right? So this

is literally local client as in your laptop

or,

you know, maybe in in

put the local into quotation marks here. Maybe your Dropbox or your Google Drive. You know, things you know, storage,

units where you have personal ownership over,

do end to end transfer of data that you have, your local store

to, whoever the recipient might be. So that's 1 end of the spectrum, and that that's, you know, I guess, more of the traditional,

model of how

data is, you know, transmitted these days.

And on the other end of spectrum, I think we're seeing some exciting new technologies evolving around,

fully distributed

storage,

decentralized

storage. Right? The idea where maybe we can take that file instead of storing it locally on your hard drive, we can,

chop it up into a bunch of pieces and, put it out on the web for random hosts,

to kinda store. And then when the time comes to transmit it, you know, we can resync that up and deliver that to the right place for for computation.

And I think, I think basically what we're gonna see is we're gonna see an evolution from

the former to the latter. Right? The reason why is because I think a lot of the Internet infrastructure is already in place for you to, you know, store a file in the cloud or, you know, on your computer and just ship it to from point a to point b. That's ready to happen today.

So in terms of, scalability, in terms of,

pragmatism,

I think that kind of, data storage

and transmission infrastructure,

will be realized in some of these early data networks. However, I do think over time,

what what I described around this kind of exciting notion of decentralized storage,

decentralized

transmission, and even decentralized computing

will have to happen for some, data sharing applications

if only because,

you know, data privacy for certain applications is paramount. So for example, when it comes to, you know, health data, you don't that's not the kind of data you want to just send out to anybody that wants access to it. That's not the kind of data you want, to be replicated and disseminated

across the entire web. And that's the kind of data where you would want, some sort of decentralized network

of, you know, storage as well as computation so that you can run training algorithms,

against that kind of dataset without ever seeing the raw data. Right? And that that kinda ties into what I'm what I'm seeing

now as an exciting new trend, and I think this is gonna be this isn't like the next few years. I think this is the next, you know, decade or 2 trend, but I think it's an inherently

very, very powerful trend,

and that is, you know, the the notion of secure multiparty computing. Right? And I think, you know, and the idea here is that, again, we can have multiple

except all those local clients that, computed it in a in a multiparty computing model never actually see the full raw data, and, therefore, the data in whole remains private.

So that that kind of new sort of storage and transmission infrastructure,

I think, is is kind of on its way.

And I think part of that is just the fact that we have, you know, fantastic compute resources

and cloud infrastructure that's gonna enable that to happen more and more over time. So I'm just really curious to see, you know, as that,

becomes a reality, and to be to be kind of clear here, I think it's a there are a lot of technical hurdles before that can fully happen. But, you know, as that happens, I wonder what what kind of

interesting,

data sharing applications will emerge. Right? Because now suddenly you can talk about the idea of sharing the most sensitive data and have having confidence that that data will not be, co opted by someone else. Right?

Yeah. And the whole idea of data privacy

and cleaning datasets

is a complicated

and nuanced 1 because of some of the examples that we've seen where

datasets that have been ostensibly

scrubbed of personally identifiable information

can still be used to

actually find individuals within a large dataset even though there isn't any sort of address or name information just because of the,

the implicit biases or the implicit information that's

fundamentally linked to the way that the data was created.

And 1 of the interesting

approaches that I've heard about in recent,

sort of months years is the idea of homomorphic encryption

as a way to

try and prevent some of that

where you don't actually have any direct access to the underlying data because of the way that it's encrypted, but you can still run

machine learning algorithms against it because there is enough data in aggregate to be able to actually gain

some information from it. But when you want to then delve into the individual data points, there's no way to do that because of the way that it's structured.

And then going to,

to your point of the distributed

data storage, some of the technologies that are interesting, and I'm curious to see how they pan out within the broader ecosystem are the interplanetary

file system,

for being able to do

fully decentralized

data storage where everybody can take part of the network, and then the DAT protocol, which is

similar in terms of being able to distribute the data, but also has the concept of versioning

and currently

is a

write once, read many system where only 1 entity is able to actually update the datasets and everyone else who is part of the,

peers,

network is able to just read what's published. So that's interesting as well from the concept of a

data commons where it's 1 way to ensure that somebody isn't

inadvertently

polluting the dataset by accidentally writing back to it.

Yeah. I I think, both those projects are are fascinating. Right? IPFS,

I mean, it it's a real thing. Like, people,

are using it. It's up and live, and it's

hosting a ton of,

files as we as we speak. And, Max Ogden,

and his work at that along with the rest of the team there, it's it's it's it's brilliant. Right? Like, these are,

core,

necessities in the world of data. Right,

you know, for all sorts of different applications. And, you know, I think as we started layering on other sorts of primitives

alongside and on top of the capabilities that they've pioneered around, you know, privacy

or competition or this or that. I just think that that that will unlock and enable,

all sorts of exciting new applications where, you know, data privacy and trust

becomes a bit of, you know, has traditionally been a bit of a hurdle because you can now remove those hurdles. Right? And it just gets to this notion of, man, if we if we can have

this, you know,

centralized,

or what's funny, it's like this universal computer because it's decentralized,

handle a lot of these compute jobs in a way where we can, you know,

in a very high trust way contribute any kind of data no matter how sensitive to it. Right? You know, what are are the sorts of things that we can get people to to do? Right? What sorts of research projects,

and what sorts of data can we get people to contribute? Right?

And I think we're just we're just now seeing kinda early innings of that with,

with projects like IPFS, like DAT, like Filecoin, and so forth. And once we move into the near future where these common datasets

are becoming more widely available and more prevalent and more robust. I'm wondering if you have any thoughts as to the types of businesses

or products that are going to become possible,

whether for smaller organizations

or for

organizations of a larger size, simply because of the availability of that data

and some of the ways that it may impact

the

future trending and growth of global economies.

Yeah. I think there there are lots of things,

that will be enabled from a business and industry perspective.

But I think there

are 22 that stand out in particular to me. 1 is just the idea of,

enabling data commerce. Right? Now you can actually,

create this notion of a data marketplace.

You know, I think everyone talks about how

data is an asset, whether they call it the new oil or or whatever else. But for the most part, you know, data has been this thing that kinda gets copied and pasted and, you know, either you're an insider and you have access or you don't,

sort of thing.

But if we get to a point where we can actually get data to be publicly available

and for you to have a single place where you can kind of provably show that you have ownership and sell data to people that want it. You know, I think that that notion is incredibly powerful. This,

notion of

ascribing data ownership and therefore,

being able to enforce

profits

and reward for being able to contribute that data

when they're bought by by certain people. I I I do think,

for data markets, the 1 hurdle though is that, you know, data replication, the ability to copy and paste still is an issue, but I think there are actually some

very,

you know, practical things you can do about that, which is 1,

just

continue contributing data

that, is new, that always refreshes. Right? Always refresh

that data network. You know, 1 kind of data that I talked about earlier is weather data. Weather data is always changing. It's streaming. It's real time. You know, what are all the other kinds of streaming datasets

that you can,

pipe together into

this public repo?

It doesn't matter if someone copied and pasted last week's weather data because they're still gonna need this week's weather data, and they're still gonna need next week's weather data. And suddenly,

I think the,

idea or the notion of,

buying data, right, as a service,

from from that kind

kind of thing is actually very viable. And then the other way to kinda refresh data is, you know, just continue continue

collecting new data that wasn't there. So in the case of precision medicine, are you continuing every single week to scale

by bringing in not only

new genomics data

from new people, but for each person that's already part of this data network you're adding new kinds of data. Right? So now from genomics to phenomics to proteomics to all sorts

of other, health data per person. Right? There are ways for you to continue to scale and refresh

the data that's on a network to make, commerce a viable thing. And then I think so that's 1. 1 is data marketplaces.

2, I think part 2 is just what what happens because you have these unified datasets. Right? And this is just where,

you know, you know,

as a

biopharmaceutical

company, I now have this incredible public resource where I can go and access and get this data that I couldn't get before and use that to contextualize,

my internal drug discovery pipeline. Right?

And create maybe blockbuster drugs that I just wouldn't have been able to discover or develop before without some of this contextual data. Right?

Tremendous applications in finance as well.

I think tremendous applications in other areas that I think AI is, you know, ready to unlock as soon as there's enough data, right, whether that's

autonomous vehicles or natural language.

Right?

And those will be those will be that's interesting, right, because those those will be things that will

fall out in almost an indirect way. Those will be kinda new businesses that that come out in an indirect way from the fact that you will have these kind of public data commons that let people do the research that lead to some discovery, that leads to some other discovery, that leads to some sort of product in those spaces down the line.

And yeah. So I I I think those are the 22 kind of main things, and it's an incredibly exciting future.

And I think 1 other

aspect too to the broader availability

of these fundamental datasets

is,

new business oriented around data integration and data enrichment where you provide value by

having your own access to these underlying datasets and then having a means by which to

combine them together to create a new dataset

that wasn't necessarily

accessible

by accessing each of them in isolation similar to the Google Maps article where they were able to combine their street view data with their satellite imagery to be able to provide these areas of interest because of

the machine learning applications

of each of the sets in isolation than being combined via another mechanism. So I think the area of in data enrichment and integration and then selling access to those, you know, secondary and tertiary data sources as well as another way that the

that the broader access to data will create a new,

economically viable model for future businesses.

Yeah. It's it's fascinating. You know, I wanna tie this back actually to

the earlier discussion about, you know,

crypto crypto networks. Because if you think about

crypto networks, there's a fascinating

business model innovation that sort of occurred, right, in a sense that when it came to Bitcoin, you had a a network of people who,

you know, whose job is literally to mine Bitcoin.

Right? And and, in Ethereum and other networks,

well, not not yet for Ethereum, but in the future

for Ethereum and other networks,

having validators,

stake whatever cryptocurrency,

is part of that network in order to do validation jobs or other forms of work on that network.

And it's fascinating to me because it spells out this new new sort of business model where, you know, people could get paid for doing work that they previously would have had a hard time getting paid for, but is extremely valuable. Right? So

in the case of Bitcoin and Ethereum, maintaining this universal ledger

that, is relatively immutable, that people can build off of. Right?

And then I think where you hit on when it comes to data is, well, what if we had an ability to create a network of people

whose job is to assemble these datasets? What if you had,

a network that could track contributions

and, more than that is full of people that are ready to be mobilized

to crowdsource any sort of data that you want on demand. Right? How incredibly powerful would that be? And that itself actually is an interesting,

new kind of economy, a new kind of business that you can get into. Like, you could literally

be a solo entrepreneur,

somewhere, you know, in the middle of the United States and and realize, hey. You know, I I I'm actually really good at getting all this data, and I'm really good at proving that this is the right data, and you can maybe maybe make a living off of that. So so we'll see. I mean, I I do think, like,

that that is, that is a loft division I'm I'm supremely excited about.

I do think, a lot remains to be seen with how viable a lot of these, token economies will be.

But, for me, on a rational level, it makes a lot of sense to give

proper payout and,

and and compensation to some of these people who,

do this really, really valuable work,

which in a traditional world,

kinda just gets captured in these free open source projects. But maybe in this new world, can be captured in a free open source

sorry, in an open source project where they get, you know, their their dues for the hard work that they put into that to kind of enable all sorts of other projects to build on top of that open source project. Right?

And as a final question, I'd just like to get your perspective

on what the biggest gap is the in the tooling or technology that's available for data management today.

I think it's it's, incredibly important to remember in this, day and age where everyone talks about AI and machine learning that, you know, proper data management is what's gonna allow you to actually capitalize on

great AI algorithms. Right? And I think related to that,

there is specifically a big gap I see now,

in having

the necessary tooling and technology

to be able to structure data in a right way,

you know, for training purposes. Right? I think that's such an incredibly important but underappreciated,

task

that makes a lot of these models,

successful.

And I I think there we're we're starting to see some early progress. Right? I think there's,

you know, there's this work out of Stanford around the notion of data programming,

that I think is fascinating.

I believe, the project is under, this open source repo called Snorkel. Right? So the idea that now maybe we can programmatically

take a bunch of data which might not be clean, might be unstructured in many different ways, but have a way to scalably a framework to scalably go through it and figure out how to munge it and structure it in ways that are appropriate

to now run it through

a model for training, right, to to train a new model. Right?

And so I think I think that's something that that ought to be solved in the near future, and I think when that does get solved, I think that'll unlock a lot of new capabilities

in the machine learning world. And I think there are some, people who are really smart who are working on it right now. So I'm excited for that.

Yeah. I definitely agree that the area of

automatic dark data extraction

and enrichment of data from domain

experts is a very interesting and valuable subject area and actually had 1 of the members of the snorkel project on a past episode. So I'll add a link to that in the show notes as well. Oh, cool. So,

with that, I would just like to thank you for taking the time out of your day to join me

and talk about your ideas

on

how the future

of the data economy might pan out and some of the challenges that we need to face in the near to midterm. So

thank you for that and I hope you enjoy the rest of your evening. It was a pleasure. I had fun and, thank you very much

for having me.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links