Data Sharing Across Business And Platform Boundaries

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

Data lakes are notoriously complex.

For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte scale SQL analytics fast at a fraction of the cost of traditional methods so that you can meet all of your data needs ranging from AI to data applications to complete analytics.

Trusted by teams of all sizes, including Comcast and DoorDash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem

promises.

And Starburst does all of this on an open architecture

with first class support for Apache Iceberg, Delta Lake and Hoody,

so you always maintain ownership of your data.

Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst

and and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.

Forms and data pipelines.

It is an open source, cloud native orchestrator for the whole development lifecycle with integrated lineage and observability,

a declarative programming model, and best in class testability.

Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise class hosted solution that offers serverless and hybrid deployments, enhanced security, and on demand ephemeral test deployments.

Go to data engineering podcast.com/daxter

today to get started, and your first 30 days are free. Your host is Tobias Macy, and today I'm interviewing Andy Jefferson about how to solve the problem of data sharing. Can you start by introducing yourself? Yeah. Hi, Faius. I'm Andy. I'm the CTO at Bobsled.

We're a series a startup,

solving the problem of data sharing for enterprises in the cloud. And do you remember how you first got started working in data? For me, like, software engineering has always kind of been about moving and processing data, whether it's, like, getting a tweet,

or an Imessage from, you know, my phone, onto your phone, or whether it's control software

in a power plant or a chemical plant,

that's taking input data from sensors and things, and then, you know, processing that data and then creating output the data that signals to,

control software, pump, and things like that. But I got started relatively late,

during my PhD.

I was doing a PhD in chemical engineering,

and I started with, work on control software. And 1 thing I really enjoyed about that was that you're actually a software that did something very tangible,

and real that worked in the real world. And from that, I got into

computer modeling during my PhD.

And then, a little later, I actually quit my PhD,

to join a software startup because I found I was enjoying the software engineering

more than I was enjoying the welding together of tons of stainless steel.

And then from there, my first job was the database administrator

on a SQL Server, Microsoft SQL Server as it was then. So my very first job in software engineering was,

in the data realm. There are a lot of administration.

I got firsthand experience of the move from on prem to cloud.

As that was back in the day when,

you know, I remember when Microsoft SQL Server on Azure was first launched,

and then testing out and being really excited about

no longer having SQL Servers that, you know, initially sat next to me and then later we're in a data warehouse.

But I used to, you know, go and visit them. And I got the chance to see that transition into cloud and really understand the benefits of no longer having to kind of babysit physical boxes and the things that could go wrong with them and doing the upgrade and all that kind of stuff. And, yeah, from there I've had what I think is a pretty exciting and, fun career so far,

moving through different kind of things. So from there I'm moving to a company that was doing an OLAP

database,

built on top of Cassandra at the time when NoSQL was a big thing. So, you know, NoSQL databases were a new hot thing. OLAP was very fashionable,

has a company called Kunu who are building their lab solution,

on top of Cassandra. And we work with some large ride sharing firms

and things there. So I got I got a chance to see some of the power of, like, big data and things you could do,

you know, at a real scale,

as well. And from

Acunu, I went to, Apple Icloud. That's where I got my first experience doing data sharing in consumer data sharing.

So at Icloud, I was,

working on some of the kind of protocols and database stuff that was related to how you share, not just between devices. Right? So I had synchronizing things like photos and updates between

different Apple devices. But we also worked on the first protocols for doing sharing where you could do stuff like share photos to another,

iPad user. That was pretty awesome. We also did some big data processing work there where we're doing stuff like reference counting at scale. And so when you're sharing things, you need to keep track of, like, how many people have shared it. And if all of the people who've shared it have deleted it, then you can garbage collect it and things.

We were doing that with very large Hadoop jobs,

as that was, yeah, like, the thing that was of its time. Sure it'd probably be Spark today.

After working at Apple,

I went on to work at Neo 4 j.

Right here, I was working on called Trackful. We're building an AI solution.

And there, I was working on building data infrastructure again for training deep neural networks,

to do computer vision.

And after that,

I worked at Neo4j, which is a graph database.

So my career has spanned

quite a range of kind of databases and data technologies at Neo 4J. I worked on that,

on both the database as a service product.

So, you know, managing Neo4j clusters,

as a service for users, and on the clustering algorithms doing things like the raft implementation

and and working on scaling

and distributed,

systems algorithms for a neo4jaser. You could scale up your clusters to, like, thousands of nodes, if you wanted to do kind of big data graph processing. So, yeah, quite a range of graph that I use. And it's actually Neo4j where I met Jake, who's my cofounder at bobsled.

And now for the context of this conversation,

I'm wondering if you can start by giving some

scope and framing around what we mean when we say data sharing because that can mean any number of a broad variety of things, and I'm wondering if we could just kind of give the proper framing for what we wanna discuss during the rest of this conversation.

I think we've been sharing data for years.

As I talked about, consumers have been sharing data for a long time. And even if you think about things like a tweet as a form of data sharing, you write some data and then you share it with the world on Twitter.

And businesses have been doing this, you know, for years. You can go way back to businesses that used to post sets of data around on CDs. Again, thinking back to my very first role,

I was in the UK, in London,

and

we used to get

a CV from company, like, the post office.

US could be like USPS or someone, which had, like, all the ZIP codes

and mapping of all the ZIP codes into,

kind of addresses and regions. Right? And that used to be something that people provided on a CD. You signed up and you paid money and they got delivered to you in the mail. And,

so yeah. So data sharing between organizations has been going on,

for years in lots of ways, say, from CDs in the mail through to, you know, APIs,

different kind of cloud based sharing

techniques. Right? A a lot of,

people use APIs as a way of invoices common sharing data. Right? I have data.

And if you call my API, I tell you, you know, tell about some of the data I have. We're sharing that data and then you're doing something with it.

And for this conversation, we're concerned with data sharing between businesses

and data that's being shared really for personal analytics.

Our current OLTP versus OLAP.

We're thinking about, you know, you have a fairly large amount of data you're sharing with someone else so that they can, use that data in their analytics. And the typical usage involve things like joining that data to other data,

that the recipient has and then doing that. It's on that. It's pretty rare that someone just said, hey. Give me some data. I can analyze it in isolation. You can think of some things like in the financial world where maybe you say, you give me all of the stock ticker data,

and I'm just gonna analyze it in isolation

and then try and use that to make predictions about what stock prices will be. But in reality, even that use case is quite rare. In most other use cases, you're

saying, let's share dates between our 2 enterprises, and then each of us, it might be 1 way or 2 way. But, you know, the recipient is then gonna join that data with some stuff they have or use it in their own applications. But it is a broad,

a broad scope.

Really, it's, you know, any analytics data moving between organizations.

And so given that context of

I, at organization a, want to be able to send data to organization b, or I need to be able to request data from organization b to use for purposes of some sort of partnership agreement or whatever the case might be. What is the current state of the art and state of the ecosystem

for being able to enable data sharing across organizational boundaries, whether that is separate businesses or just different business units within an enterprise

and some of the complexities that arise because of that current state of the ecosystem?

Yeah. It's really quiet. Like, we do see a lot of internal organization

data sharing as well.

We speak to a number of people who have problems, particularly in larger organizations

either with geography or when they've done things like acquired a few business units who have different, like, platforms and things. Yeah. It's great class. But what's the current state of the art here? I I started off talking about, kind of sending CDs in the mail. This kind of follow on technology from that really is, FTP, SFTP,

and sharing CSPs. And we see this is actually the dominant

mechanism today

where data is shared between organizations.

So someone maintains an FTP server.

They put CSV files on it.

There's usually some dance involving sharing,

RSA certificates,

so that you can SSH,

connect to someone's FTP server, and then you can, you know, read the data off it. And FTP can operate in a push or a pull,

orientation. So you could push data onto my FTP server or I could pull data from your FTP setup. That's really the dominant thing. I think

some of the shortcomings of that are kind of obvious particularly in this, your businesses and everyone moves into cloud native kind of environments.

But that's the things you like. Then, you know, the,

follow on from that is really, kind of data APIs.

This, I think, was kind of CSV via

HTTP. It's not radically different

from CSV on FTP, but there are various APIs that we can see in a while, particularly common with things like SaaS businesses.

You make an API call, you know, often with some query parameters,

which is just necessary because you can't pack that much data into a single HTTP

request. And so you say, hey. Give me some data. Here are some parameters that that scope it down for a reasonably small chunk, and I make the API call and you return that data. It's often, again, ultimately kind of CSV or JSON formatted.

In some scenarios, it's you make an API call and you kind of get back a parquet file or something,

But that's less common. Here, the I talked about limitations. That's just really not something that HTTP is built for. And there is a cottage industry

of

home built tools that people have for kind of scraping these APIs and then reconstructing,

you know, complete tables via lots and lots of queries,

things like that. And, like, that was really, I think, based on a lot of people who were kind of saying, like, we have a hammer. So, you know, every the the hammer that people had is, like, REST,

API,

and JSON that serves JSON data, and they kinda just took that hammer and applied it to sharing analytics data. Then, you know, the the major

state of the art,

at the moment is connectors.

Right? So there are companies like Fivetran and Stitch,

and others who provide,

what's kind of connector things either as a service or as your software or open source software that you can use and run yourself.

And that broadly helps you to, as a consumer of data, to pull in data that is shared with you from kind of a range of

different sources. As the connectors can connect to the data API,

they can connect to things like FTP,

and they're based on a kind of pull

principle. And so it's the consumer of the data takes responsibility,

and they use it for getting hold of the data that's been shared to them. And they are moving the bytes using a connector and then putting it somewhere, whether it's, like, files storage or a data warehouse.

And, connectors

in a ecosystem where there's lots of data sharing for that are inherently quite resource inefficient

because every consumer

has their own connector running, has their own copy of the data.

So it's kind of inefficiency and there's a latency,

and there's a lot of duplicated compute. Lots of different people make the same API request for the same data into different places. It puts a responsibility on the consumer of the data to kind of operate and maintain and run the system. And then there is in place kind of cloud native sharing. Right? Pretty much every major data platform or cloud platform offers that today. So whether it's something like s 3, which has a feature called access points, which is particularly designed for sharing data between, s 3 buckets, whether it's Snowflake sharing, BigQuery sharing through Analytics Hub.

Databricks has Delta Sharing. Azure has Azure Data Sharing. All the platforms offer these things, and they do what we call in place sharing,

which you know, the the key thing of in place sharing is the data isn't duplicated.

So you immediately get huge efficiency bonus, particularly as you start to share the same data or same parts of data to different

users. And on the data warehouses,

they'll allow you to share the data you have kind of as you exactly as you see it. So you can share tables.

You can share not just the data, but things like the foreign key constraints and the indexes and the views and all that stuff.

So it's it's really rich as well as and more efficient and more straightforward. When we talk through all these things, what it shows is this data sharing isn't just a purely technical concern. A huge rent tied up in the kind of business

socio technical arrangement.

Right? Where we're like, when we share data, who takes responsibility for what? Right?

Who takes responsibility for paying the compute costs?

Who takes responsibility for maintaining the structure of the data and the indexes and the foreign key constraints and things like that. And that's all tied up

in an approach. And when I talk about connectors, there's an implicit

expectation that, like, it's in the pool based thing that the consumer data will be paying for a lot of the compute and stuff happens.

We talk about, like, a push FTP. That's a reversed

expectation on who has responsibility of things, you know, because the consumer maintains an FTP server, but the provider pushes it, the data into it. And that's also true with in place sharing. And,

yeah. We think that, you know, in place sharing provides some of the best

splits of these responsibilities.

Right? It allows things like the person who's,

analyzing and running computer on the data, pays for the compute, but the person generally who's,

providing data is paying for things like the storage. And it works incredibly well for people,

in terms of ease of use because we have basically eliminated ETR.

Right? By the setting, here's my table in Snowflake. I'd like to share it to you. That's it. There is no ETA computer prints or anything else. You can just start analyzing straight away. So, yeah, that's really the current state of.

And in terms of those socio technical elements of data sharing

and the

methods

and motivations behind it, 1 of the other complexities also

comes into the compliance question where as the providing organization, I need to make sure that I am eliding

or masking certain pieces of information because I can't share it externally, or I need to ensure that there are appropriate controls on that data as it is being shared so that it is not accessible by some man in the middle or a third party that is not supposed to be involved in this sharing. And then there are also questions of public data sharing where I, as an organization, want to be able to create and publish a public dataset that anybody can use, but I don't wanna have to pay 1, 000, 000 of dollars because somebody else is using all of my compute to do analysis. And I'm wondering if you can talk to some of the ways that those considerations factor into

when and how businesses

decide that they actually want to engage in these data sharing agreements and some of the ways that those considerations

will maybe prevent what would otherwise be an, amenable relationship.

Yeah. So 2 things that I think can't hurt. The

the compliance

and privacy and sensitivity,

management has some strong technical aspects. It's

also good to think about, like, the breadth that we see in data sharing kind of arrangements.

So we see,

a real range from things like supplier consumer relationships

in, manufacturing,

where the kinds of data is being shared and stuff like how much stock,

does a manufacturer

who's providing parts

to an assembler, like, have. So we see this in the automotive industry where the automotive buying has huge power,

and they actually have arrangements with

their providers where they say, hey. You've got to tell us, like, how much stock you have and how many parts you have on the shelf so we can manage our supply chain risk and things like that. And that's kind of sensitive commercial information that they wanna keep between themselves,

but

it's also not subject to the kind of compliance that you might see, at the other end of the scale when you get into things like health care data,

and things that's covered by HIPAA and say a health insurance company in the United States that wants to share data with a pharmacy,

company or a hospital

organization.

And that's a very different concern.

Yeah. In the,

with health care data, you start to have concerns

not just of, like, what data is accessible,

how is the access to that data audited and tracked.

And

in Europe, you also have things like, right to be

forgotten, where you wanna say not maybe you don't need to track who's access data, but you need to have a way to redact data. And doing in place sharing, you're going with the technical side of these requirements. In place sharing helps with quite a lot of them because the cloud platforms provide

some things,

that allow you to do things like audit who's read the app read the data. And, you know, if you remove data from in place, yeah, you know it's gone. Whereas if someone's copied your CSV files

over to somewhere else, then, you know, you need to have you absolutely need to have more process in place. But a lot of this comes down,

to

the kind of relationships,

you know, contractually between organizations as well. So you all those have to make sure they've got the right legal things and the right compliance in place before they can do data sharing based on their industry and the kind of data they wanna share. Yeah. Does that answer the the question on that?

Yeah.

And digging more into the

mechanical aspects of data sharing, as you mentioned, there are a few different ways to think about it where 1 is I have this data. I am going to extract it from the system that I use to maintain it. I'm gonna push it into some other system, whether that's s 3 or FTP.

You can take it, do whatever it is you want with it. I have no more visibility or control over that data versus on the other end of the spectrum, you have the Snowflake

and BigQuery approach of, I have this table. I'm going to make it available to you. As long as you have an account with that same provider, you can query it, do whatever you want, and I have some level of visibility of how it's being utilized.

But I also still don't maintain control over it once you use it because maybe you're extracting it elsewhere. And I'm wondering if you can talk to

what are maybe some of the shortcomings

of even that more sophisticated approach of the sharing the entire table and its context and history

and some of the technical capabilities that need to be present for the data sharing solution to be

effective, whatever effective might mean given the context.

Yeah. You mean, particularly around kind of sensitive data and things of that nature. It is a business, especially a technical area, so some of it comes down to just the contractual agreements. It's as you say, there is a level of trust and legal enforcement in place where you just have to say you have to agree not to do certain things because,

outside of getting into data cleaning rooms and differential privacy,

the

I said there's there's not a lot you can do to technically prevent people from, you know, extracting data from systems like Snowflake if they've got access to a share. Some of the things that you can do technically that are interesting with the cloud, you know, you can make use of things like views. This is reasonably common and straightforward. So you can have a lot of views that restrict what is then shared. So you share the view rather than the underlying data.

And using in place sharing means you can do a lot more of that than you can with kind of older techniques

because the data doesn't need to be duplicated for every view. Right? When you do sharing with something like a CSV file that's extracted, if you wanna share different views of the data, you have to extract all the different possible combinations

into different sets of CSV files for different consumers.

And that obviously means it uses a lot of resources and compute and storage and so on. Whereas Snowflake or Databricks or BigQuery, you can create a view that is, you know, exactly the data that needs to be seen. And you can even use that to apply things like, obfuscation

or,

some of the kind of differential privacy you might do when you exchange things like tokens. So 1 of the things that we see some people do is 2 way sharing

where I share with you an obfuscated

token,

you that allows you to identify,

you know, data that you have that you would then share back to me. You join on the obfuscated token, and then you share the data back with being of only the rows that, like, match that obfuscated token. It gets a little technical, but it means that we can say, you know, where we have data related to the same things, we can ensure that we share the join of those data without necessarily sharing the details of, like, what we know,

about those individuals.

So there is some some things that could be done there.

But at the,

at the limit, you you get into clean rooms. Right? Once you get beyond

your confidence to operate

with another business based on kind of contractual agreement and a legal framework that's in place,

You get into the more realm of clean rooms, which are fully controlled

environments,

and they're often maintained

by the provider of the data. And, you know, with the cleanroom solution, it's a little bit different from most what we do, where you're actually saying, like, we've set up this environment. You log into it, and you have very controlled access of what you can do in that environment and, like, whether you can extract data out of it.

And in terms of the work that you're doing at bobsled

and some of the specific

problem areas that you're trying to solve for,

what is the kind of unique set of capabilities

that you're enabling that aren't present in these other platforms or some of the ways that you're approaching the problem

that is maybe vendor agnostic or removes the constraint of everybody having to use the same technology platform.

Yeah. The effective the the largest problem

that people face, you know, that's problem, challenge, and shortcoming that people face using in place kind of cloud native sharing is that requirement that the recipient, right, who's receiving data and personally sharing it both have to be using the same cloud platform. In almost all cases, you have to be on the same kind of cloud platform and region. So we talk about things like Snowflake, who are a real leader here. But Snowflake sharing is only, you know, truly straightforward,

if we're both in the same Snowflake region

on the same platform. So, you know, we're both in US east 1 on AWS. If you're in Snowflake, if you're using Snowflake on GCP,

you know, in EU Central, it is possible, but you then have to do database replication with Snowflake, and it's no longer, I want to share this table to you. It's actually a whole process. We have to replicate replicate the database and then do a share in different region. So even within the same platform, there are challenges, but the the major challenge and shortcoming

is for for climate based and patient sharing

is that we have to agree on a platform that we're gonna use, and that's, you know, extremely difficult in practice. Right? It's always never practical if you're the person providing an analytical data set, to relocate your data operation,

into, you know, another cloud. And, you know, data is

often the result of a whole process of, you know, what's done to collect that data and store that data. It's tied up with other things and assets you have in place. So you don't just, you know, relocate your data onto Azure because you've got someone who works here on Azure. And,

you know, it's almost never particularly practical for a recipient of data to relocate their usage. As we said, it's very rare for dates to be used in isolation. You want to join it on other dates you have and feed it into your existing,

processes,

you know, whether they're analytical or transactional.

And so to not gonna relocate your application

to a different platform just to receive some data that's being shared to you. So in this kind of many to many environment

where, you know, it's really high cardinality particularly when you take into account the kind of regions. Like, you could be on-site, you could be on Databricks, you could be on BigQuery,

and then, like, you could easily be in different regions than than at the same platform.

There's this huge temptation that just isn't solved for you, unless someone makes a move to a different cloud platform. And that's 1 of the massive, things that we're solving for in Bob said. So our aim is to provide

that really simple straightforward experience.

When you say, you know, I want to share these specific views or tables or this specific data from my object storage to this person.

And with Logstag, you say this is where I want to share it to. Right? And so you can say I wanna share it to Databricks. I wanna share it to BigQuery. I wanna share it to Azure Blob Storage. And

what the recipient experiences

with Bob said is they just experience that same straightforward share in the cloud native way of their platform that they're on. And what the provider experiences

is that we either access their data directly or we access their data via simple chat. And

we solve the problem of, like, how does the data move from 1 place to the to the other?

And and how do we maintain

efficient sharing,

if you're doing sharing to, you know, multiple people in the same destination and region,

without doing things like replicating all of your data. So, yeah, this allows people to to maintain that kind of shift left of simplicity,

taking responsibility

for structuring their data, making it analytics ready,

and usable,

with the ability to, you know, straightforwardly share it to someone else without anyone having to think about all the ETL and so on that's that's involved. And that's what Bob said does,

under the hood. And then another

challenge to this data sharing

question that we touched on a little bit, but is that question of auditability

and governance

when you are

sending data to another entity because at sir at a certain point, there's no way for you to maintain control anymore because

once somebody has access to the data, even if you want them to be analyzing it in situ,

there's always the possibility that they're going to extract it and do some other thing with it. And I'm wondering how that factors into the ways that the sociotechnical

aspect comes into play with some of the sharing agreements

and some of the regulation and compliance aspects of doing data sharing, particularly when you're dealing with something like health care data and you're maybe a medical provider sharing patient data with a medical researcher for being able to develop some new sort of therapy, etcetera, and some of the ways that the sharing

protocol

maybe can and should incorporate that audit and access control and governance,

a lot of it comes down to the agreements and the but what the protocol can do is help to make people, very clear and have a shared understanding of agreement. So, for example, with something like,

right to be forgotten,

we can help people to standardize

on the way that they communicate things that need to be deleted. So if if we are both signed up, and compliant

to

the European data processing

rules around that, if,

1 of my and you're a subprocessor for my data,

if I pass on a right to be forgotten, you know, request to you, you need to process that and delete the relevant data.

And if that's a contract or in between us, we get into the protocol level.

You get into practical questions of, like, well, okay. How do we do that? You know, how do we communicate to you the information so that we can be reasonably confident that we've given it to you and that you know what to do with it and that you then process the deletes. And there are some interesting challenges with that particular thing of how do you keep track of

the fact that you have deleted something

and also that you have deleted it. Right? And you have to keep track of, like, we know that we did delete this and can prove we deleted it, but we don't actually have the data because we deleted it. The and so things we can do is help people to standardize

around, you know, how they communicate and share that data and things like whether that's something that people also want to sign up to with an API. It's like, we don't maybe

don't just want here are the kind of token identifiers and things you need to delete, and they're in a table. But there's also an API call,

that you can do to process that to to trigger that. And we're also about, like, sharing

being,

more rich

than just the data.

So some of the things that you can share in data warehouse platform is things like, user defined functions,

stored procedures,

native apps, and things like that. So there you can help people to share a function that can do things like carry out and delete,

of a particular entity. And you can say, you know, did you run this function and generate the output?

And 2 way sharing

can be something that people can use as part of a compliance process to say, you know, can you share back to us either data or something that's computed over the data like a check sum. It's okay. Can you provide us some kind of receipt that shows that you've carried out certain actions by sharing data back. Right? And, again, that's something where we can really help by providing this ability to say go from

Snowflake to Databricks and Databricks

back again to Snowflake means that each person can be operating where they're at. They have the confidence, the expertise, but you can provide a flow that means that you can, you know, provide some receipts. And we also do abstractions

over things like the telemetry and the audits. So we can say, you know, there's a person who's sharing data to someone else. You can go into bobsled

and get your audit logs of the access to that data. As far as it's available depending on the citation platform, but you can get that as this kind of single bobsled abstraction. You don't have to be spelunking the audit logs of 4 or 5 different platforms if you need to kind of verify something some question around, like, who's access to the data itself. Part of our kind of plans, although it's not yet something we do as well, is communicating,

governance

rules and requirements.

I just took that. 1 thing is that you have to agree, you know, contractually and say we're gonna abide by some rules around how we're gonna process this data. And if you're

subject to something like HIPAA,

you know, I will know that you get audited. Right? We're ISO 27, 001,

SOC 2.

We're working towards HIPAA compliance.

So if I'm a HIPAA compliant organization, you're HIPAA compliant.

Although I can't necessarily

I can't audit you

directly and say, you know, look what you've done. I can have confidence that you are audited. And 1 thing that we're looking at is providing a way for people to say what that kind of governance requirements are and have that clearly passed along with the data so that the recipients of the data clearly can see these the governance requirements of this that they've been tested and said, like, yes. We meet those requirements and and helping to kind of make that part of the data sharing protocol

and tying it up with that business sociotechnical system.

And at what point do point to point connections for data sharing

reach the limitation, and you need to then step into the situation of having a data brokerage for escrowing certain datasets that multiple organizations need to be able to have access to and what are some of the ways that the data sharing protocols can maybe also help to reduce friction of

populating those datasets and consuming those datasets.

Yeah. So 1 of the the things we do is above this, we kinda combine, hopefully, what is the best of the in place sharing,

with doing some of the work of moving data around and achieving efficiency. So I talked about when you have,

when you have a kind of shift right approach,

every consumer of the data has their own copy of data and their own compute doing ETL ETL and so on. When we do a data share from, like, 1 platform to another, sub ETL has to happen. Right? Obviously, that in place sharing requires the data to be in place. So if you got it in place shared in Google Cloud, it's gotta be in Google Cloud. If you want it in place shared in Snowflake, it's gotta be in place in Snowflake.

If you're both at once, then you have to have 2 copies of data. What we can do is make sure that if you're sharing the same data

to, you know, 10 people in,

AWS US east to 1, that there is only 1 copy of the data in AWS US East 1, and all of those 10 people are consuming,

you know, a view on that data.

So we can ensure that you're getting the best possible efficiency of what's being shared. And as we think about things like access and things so on as well, like, we can provide what we do provide. It's a very simple way for people to do things like remote access. You know, as you start to think about the challenges that people face, trying to achieve data sharing into multiple clouds

at multiple platforms.

You know, just if you wanna say, okay. We wanna revoke access from someone now.

The work that is involved with that can be quite significant. Right? You have to go into each platform and,

where data might be used and individually figure out how to, like, revoke access in that particular platform, that particular person.

So, you know, with Bob said, you can just go into Bob said, say, revoke access, and we'll make sure that that particular person

that is accessed.

And where you've got multiple people sharing from the same dataset,

you've got 10 consumers in 1 particular region. We can handle things like the kind of garbage collection and reference counting. As we say, you know, we're we're maintaining that data in that location until it's not being used by anyone in that location,

And then it doesn't need to be updated. And there's obviously a number of technical,

challenges in terms of orchestration and,

permissions,

management, and things like that for us to do. You mentioned kind of escrow as well.

That's yeah. Escrow is there's some different understanding of that, and it is a use case that we've talked to

various people about. At the moment,

1 of the, like, 1 of the our folks there has been saying we can help to,

ensure that data is available in different places and synchronize.

But often,

if you want to kind of escrow data against certain conditions,

the best thing to do is to use cryptography,

and then

manage, you know, manage the cryptography keys. Right? So we can say we can share the encrypted data between a bunch places. The places where I've, experienced this personally is around source code

escrow. So when you're a startup working with some enterprises, they'll say, you know, in if you go bankrupt,

or out of business or stop service in some way, we'd like to have the possibility that we could maybe continue to operate the service. So you need to put your source code in escrow. And there are some, some companies who provide that kind of data service, but you can also do this sort of DIY thing we've been saying. If you want a large amount of data in escrow,

you can encrypt it. Bob said it can help you then, you know, move that good data around.

But the escrow process and then the escrow process can just focus on the keys. Right? It's a small piece of data, and you can work with your firm and accountancy or 1 of these people who provide that service to say that we'll hold the keys to that in escrow, and then everyone just has the encrypted data.

In terms of the

boundaries

that you're crossing with these data transfer

technical arrangements,

the organizational arrangements,

What are some of the

typical

situations in which you encounter those types of boundaries

and the ways that they are defined and delineated.

And I imagine most of that is just purely organizational, but what are the cases where

technical requirements

actually

necessitate

these data transfer systems versus just being able to do direct integration between them?

Yeah. We see and we see the kind of within an organization boundary,

as well as the between organized boundary that you talked about before. So sometimes it can be

different regions,

with an organization.

So, you know, for a large enterprise, you might have the UK office on

1 system and South African office on another system.

And sometimes that can be necessitated

by, things like regional

rules or regional availability,

of services.

The like, another thing we do we can sometimes see for

technical requirements is things like AI

processing availability and stuff like that. So people's choices to where they want to analyze their data may not just be driven by the myriad reasons that you choose a cloud or so on, but it might be related to specific,

AI or blockchain,

something we see technical requirements. So if you want to use certain OpenAI things, you may need to be on a Microsoft platform.

And if you wanna use certain blockchain systems, you know, you may be better on, another platform or another location.

The other 1 that drives a lot of kind of regional things is compliance

and the rules around that. Right? So you want to keep data

within

certain geographical boundaries.

1 of the things we allow people to do is control, like, which regions and things they allow data to be shared to. Right? So you can have data on Bob said and Rob. Bob said could make that data possible to be shared with, you know, any region, any cloud native platform. You can limit it down and say, this data is only allowed to be shared within the EU. Could be on any platform, but still on the,

regions of those platforms that are EU based. So we see boundaries,

the kind of geographical and regulatory,

around that. There are regulatory boundaries in the cloud as you have kind of GovCloud

services

and in some cases, health care cloud. You know, Snowflake has a separate

Snowflake cloud for health care data. There's obviously some compliance boundaries there. I'm trying to just think through what other things we've seen that might come into this. There are

not cloud to cloud boundaries. This is something that, you know, we're aware of and keep in mind for the future, where you have a real asymmetry

between organizations.

So you maybe have quite a small organization working with a very large you know, small organization with limited capacity to do sophisticated things,

working with, you know, a large organization that wants to do very sophisticated things or can be very sophisticated things.

That creates a kind of technical boundary

of, you know, what kinds of solutions you think might be used. And your small organization

might wanna be using something like Google Sheets. And it's not something we support today, but it is something that BigQuery can do,

is to say something like, you know, I'd like to share it from a Google Sheet into

BigQuery and, you know, events

from BigQuery into using both sides into anywhere. So, yeah, we can see those kind of things where someone says, you know, I wanna go from really a kind of different kind of system. And we talk about territory cloud based warehouses and stuff. Someone said, I wanna send dates from a very different system, into or else a cloud based warehouse.

And in your experience

of working in this space of data sharing

and the socio technical aspects that come into play, what are some of the most interesting or innovative or unexpected

applications

of that protocol

and capability that you've seen?

There are a really interesting thing we've seen from from customers

is, like, auto fulfillment

from CRM.

So we allow

driving all of this through a single API.

So you can call Boston API and set up a share or a transfer or make a change.

And we've seen really innovative stuff from customers,

where they kind of directly connecting a CRM,

into Bobsled. So, you know, you do some activities in your system

at something like Salesforce

or so on, and, like, Salesforce can send

information so on directly to Bob said. Bob said it can do a share. Bob said it can make web notifications

back to your CRM, and you can actually achieve auto fulfillment

of, you know, salesperson or account manager using the system that they're familiar with. Having a data share be created in action and actually getting updates back in their CRM without the person then using CRM and without the company using it really building kind of bespoke software. It's that they don't have to run some separate software and some separate servers. They're able to build it into the extensibility of

a platform like Salesforce, which is, you know, really cool to see that people are able to get these things up and running without using it right now without having to build their own, like, server and their own about major development process. Another thing

that I think is very cool that we do internally is

we do bobsled to bobsled.

So we can send data,

from from 1 place in bobsled to another place and then use that data destination as a source for, like, further onward pop setting. We use that internally,

including doing things like you know, so we would have other things using that to

share data back to our customers about things like their usage.

Right? So if you want to get data about your usage, as Bob said, we're working on providing that, as a Volstead chair that you can then consume in BigQuery or in Snowflake or so. Another thing that we've seen is people do is

having

data they've got as a source of something like a CSV. We support loading CSVs into

data warehouses.

So they use BOS led to load data from the I got into SVs

into Snowflake,

or Databricks on BigQuery.

They then set that Snowflake or Databricks on BigQuery up as a bobsled source,

and then they use the capabilities of Snowflake

to make views and so on over what they previously had as CSV.

And then they use that Snowflake, as Bob said, source to then do further on stream sharing. So they're actually using Bob said to kind of do an ETL process,

and

bootstrap themselves from

a kind of not cloud native sharing

protocol world into a cloud native sharing protocol world,

basically, Bob said to bootstrap into Snowflake and then onward sharing using Bob said from that Snowflake.

And in your experience

of building bobsled,

working closely in this context of organizational data sharing, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

There's always a lot of challenging lessons from,

operating as a startup. The

as time we started the call, right, as a a founder, you're often dealing with whatever the most serious problem in the business is at any given time.

Yeah. So 1 of the biggest challenges we've seen is the the complexity of building an abstraction,

overall in the different cloud systems. So I was talking to Jake, he's my cofounder

and we're observing that Bob said, as a product, it's kind of a simple concept. Right? It's 1 of and we're observing it's 1 of the simplest concepts that either of us have worked on in our careers in certain ways. So compared to something like graph database.

Conceptually, a graph database is a really complex

thing. Whereas Bob said, it's a very straightforward product proposition. Right? You can share data

from your cloud storage to your warehouse to another cloud storage to your warehouse.

And what we see is a real challenge between the simplicity of the concept

and the challenge of building

an abstraction over all the different clouds and warehouses and platforms. And 1 of the things for me here is that,

we don't own this this kind of stack

all the way down. So what we have to work with aren't kind of the theoretical

limitations you might have when using, like, building raft. Right. So you're building a raft system,

you can go and read the the kind of PhD papers and so on related to it, and you can understand their constraints and there's cap theorem as the kind of speed of light. And you basically

then are kind of up against those challenges. You can try and build against that. You can control it and understand it. We don't have that kind of

deep tech or hard tech challenge. There's not, at our core, like, a really hard challenging AI problem or a challenging kind of cap theorem, distributed systems problem or something that we're solving people in a really smart way.

What we're challenged with is all of these different abstractions

that are present in

Azure, Databricks, Snowflake,

BigQuery,

and they superficially are quite similar.

But as you get into trying to manage and work with all of them, you discover that they are different. And, you know, the devil of data engineering lies in these details. And so, yeah, that's been a really you know, not entirely unexpected challenge, but that's been where we've discovered a lot of challenges actually to build an abstraction,

even across

object storage. You know, we find that in AWS, you have this access points abstraction, which is really great, but it's not present in the other clouds. So you build something on AWS, then you realize you can't really build a comparable abstraction for a Google Cloud Store. You have to do something slightly different.

Or, as we get into things like executing serverless functions to our work, and we execute serverless functions to do work in AWS, GCP, and Azure. And so we had to build out an abstraction

for managing service functions running on different clouds. And that's kind of challenge that in itself, you know, some organizations,

have us look like their main organization.

So if you're like service platform,

you're building a technology that allows people to do that.

That's just 1 of the problems that we actually have to solve internally so that we can say, hey. To do object storage sharing to all the different clouds, we need an abstraction that means we can run some service compute, means we can make some certain assumptions about how data is stored, and we can make some straightforward ways of saying, like, how we grant or work access to a share. Each of those end up being surprisingly different nuance

between different platforms. And for people who are

exploring

the problem space of being able to send data from

1 system to another, whether that's across organizational

boundaries or across technical boundaries.

What are the cases where bobsled is the wrong choice, or what are the cases is where they should just reconsider the entire application and avoid data sharing entirely? I think the the biggest time when bobsled is the wrong choice is kind of when I when I did it is to take perhaps just don't need to.

I'm a huge fan of,

yeah, identifying we don't need to do things. And you can often find some situation where you feel like you need to do something because that's how it's done. That's how other people do it and things like that.

But a bit of analysis can say, oh, maybe we don't. But 1 of the major cases where both said it's the wrong choice is

when something like a data clean room is the right choice. Right? So that's when, you know, the,

reassurances you want around, like, what's visible to someone and what's done with the data and whether or not it can have been extracted and so on.

So stringent,

that you need to make use of some identity room. And there's some really cool technologies in that space around, say, things like differential privacy and things where you can, have systems that allow you to make kind of aggregate queries

that don't reveal the underlying data, but allow you to query the data in the aggregate and things like that. And, you know, for all of those, Bob said it's the wrong choice or would have to be part of a much more complicated

solution architecture. At the moment, we talk to people who,

Katie, we talk to people who want to do a migration.

Yeah. They want to migrate

from GCP to AWS,

you know, they're in TypeX. Let's say, well, that's something Bob said can do. Bob said can move data from a data

store in 1 place to another. Can we use Valset for migration?

At the moment, that's something where we would generally say, you know, Valset is not the right choice. If you're doing exactly 1 place to exactly another place,

there's probably already a tool in the destination,

that's good enough

for what you want to achieve. Right? So if it's Azure Data Factory or something. Right? If you're just concerned with getting data into just 1 platform,

then you can probably use the native tooling on that platform. And, yeah, we're we're huge advocates of use the native tooling. Right? Use the native sharing protocol of this thing. Like, don't build extra stuff. Yeah. And then I was trying to think of solutions of any situations where we've come across, where we've sort of said, you know, do do you even need to to do data sharing? Like, perhaps you should reconsider. I think, yeah, there are cases where we have the reverse that's true, where you might be like, should this be an API? And so I talked to you about, like, there are situations where people like we have a

a JSON REST API hammer. So we're gonna to solve everything with a REST API's

JSON over HTTP.

There can be an inverse case where you see, should this use case be something that you're managing with analytic data sharing, or, you know, should it be something that's actually an API or a web hook or something else

that's like a

PubSub,

notification. Right? Another way you can do cross organizational

kind of synchronization in some of the clouds is using things like PubSub

stream where you can do cross organizational, like, listeners on Pub Sub or SNS or things like that. So, yeah, sometimes there are other

cloud native cross organizational

technologies that might be better suited. And as you continue to build and iterate on the bobsled product or

and explore the overall space

of data sharing. What are some of the things you have planned for the near to medium term or any particular projects or problem areas you're excited to explore? So I might give 2 answers to this, Ben. 1 thing I'm really excited about or kind of passionate about is tackling something that we call, like, modern data stack fatigue. So you're probably seeing are familiar with this. There's a whole raft of technologies that go into the modern data stack. I have a a controversial relationship to Kubernetes,

and Kubernetes has

this kind of poster,

which is quite cliche at this point showing you, like, all of the Kubernetes stack technologies. I I don't know if you're familiar with it, but it's incredibly dense. It's, like, unreadable unless your screen is 6 foot wide, kinda thing. And the modern data stack is is going in a similar direction, where there's, like, thousands and thousands of separate tools for doing each, like, different individual thing that you might do. And, yeah, we for the, like, past decade, like, in the world of data and data engineering and analytics, we've had a kind of explosion

of all of these different tools

and technologies

and services and as an infrastructure service and platform service, everything else. And, you know, now we're in a a situation where there's there's fatigue from

sort of, oh, I on my team, kind of imaginary protagonist is, like, well, I have to do, like, 6 or 7 different technologies. And then when you get into things like hiring,

you'll send me, like, well, our stack is this particular combination.

When you're hiring, you're like, we wanna hire someone who has this exact combination of experience. And you're like, well, that person doesn't exist because there are so many different combinations of possible things that, like, no 1 has used your exact combinations. And, you know, we're in a macroeconomic

environment where there there isn't necessarily a budget for everyone to have every single tool. Right? And people are a bit more focused on what you can do being lean.

And, like, do you need to have a bunch of services running for this?

We think something like DuckDp is really cool. Right? DuckDp has this, kind of minimalist approach, which says, yeah, you don't need to have a bunch of services necessarily running and, like, you can use

on your,

on your, you know, m 3 laptop.

And,

like, the thing that's really exciting for us is that we can kind of

help people approach some of that because we don't have a kind of force in any of these races. Right? And then within that fatigue, there are sort of different philosophical,

holy wars.

The kind of emacs versus Vim type of things where it's like, should you have a lake house or a warehouse or was that? We can help people to do data sharing kind of regardless

of where they stand or what technology choices they've made.

And our aim here is to kind of help people

achieve simplicity

rather than

in the face of all this complexity of options.

And, yeah, I'm really excited to see what we can do,

you know, to cover more of these bases and help people to actually have simple experience and incorporating things

like dotDB and things into box density. And there are use cases where people could analyze data potentially kind of directly in box set. Right? It's like, you know, we talked about you might need to move the data in place into

into Google Cloud so you can analyze it with your BigQuery.

1 of the things that I think would be very cool is, you know,

do you need to move the data into the destination cloud to analyze it with BigQuery, or can you kind of issue a query directly,

using something like that PD on the source data, and we never need to do the ETL part. Right? Can can we help people to shortcut and simplify

some of their work? Yes. That's 1 of the things that I'm excited about. I'm also excited about things like 2 way sharing,

as I talked about. There's various different use cases, and they're all quite interesting where people say, you know, I wanna share something to you, and then you would use, for example, enrich it.

Right? So you'd attach it to some data you have or you perform some analysis,

or processing or some of that data, and then you kind of send something back to me that is, you know, meaningfully transformed or enriched. That's 1 of the things that I'm looking forward to us getting into because it starts to unlock,

you know, high levels of value for people, enable them to collaborate. And as I talked about the beginning here, we can help the industry in a way to do things more efficiently. Talk about, like, you end up duplicating data if you have data

being copied from 1 place to another and then processed and so on.

Helping the industry to to be efficient, but also to achieve higher value. Right? So it's for us, you know, sharing is part of a collaboration process.

And if we can do 2 way sharing, we can help to unlock higher value collaboration.

And it's really 1 of the things that was 1 of our kind of founding convictions

is that, you know, enabling collaboration between organizations is is kind of a net beneficial state. Helps improve efficiency. It helps organizations make better decisions.

And those things are broadly in the interest of customers and consumers and users of those services.

And are there any other aspects of the overall space of data sharing,

both the technical aspects, the

organizational

challenges, the ways that you're approaching it at bobsled that we didn't discuss yet that you'd like to cover before we close out the show? I think, you know, there is,

there's stuff I love talking about around this, like, shift left and shift right

mentality and, like, who has the responsibility

But anything soon, we talked about in kind of data rate industry,

shift left is something that we talk about as being a broadly good thing. It's this idea of moving,

moving some of the responsibility in the work closer to the kind of source of the data. And saying, like, let's help the person use the source of data. And this is often used inside an organization. We say, you know, should we shift left to a kind of data team in the organization

and have a central

team who's

ensuring that data is clean, that is well set up, that it's easy to query, that it's optimized with stuff like indexes and applications.

And that drives efficiency

compared with the shift right mentality where you say, you know, we just provide data and all the consumers have to, like, figure out how they're gonna use it and how they're gonna compute over it. So And what we're you know, we see 1 thing we can help facilitate is not so much within an organization, where you're generally on 1 platform, across organizations and within those more complex organizations that do have boundaries

is,

you know, further into that shift left approach where you say, do the people who are best placed to structure the data and,

set up optimizations,

and, you know, generally manage the ongoing life of the data. Right? And

evolution of the schema and appending of new data and all those challenges that really,

like, make up a lot of the work of data engineering analytics.

All that kind of the nitty gritty stuff like, oh, no. What happens when, a stock price, like, splits or when a country changes its ZIP codes or something like that.

The curve balls that you have to deal with when you're managing this schema. We can help that to be centralized, which is more efficient, more robust within data sharing between organizations.

And, you know, this idea then we can help shift left, and we can help reduce ETL,

which is 2 of the major

pain points of a lot of data engineering. Is, you know, I spend so much time on ETL before I get into my analysis,

or I spend so much time on data cleaning and processing and so on before I actually get into my analysis. You know, I think the work that we do can really help tackle those for a range of organizations in what is a really challenging realm where for a lot of people we work with, the alternative is kind of some major DIY project that they're typical. So it's possible. And try and build some subset of this functionality yourself

or try and persuade

a a commercial partner to make some pretty significant decision like working and doing that as some different cloud.

Absolutely. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on the what you see as being the biggest gap in the tooling our technology that's available for data management today. That's an interesting question in the the context of, you know, the modern data stack. But we have a we have a huge range of tools

and, yeah, in some ways, like, my passion is about reducing that and simplifying that. So, you know, I have,

I have some experience in the AI space,

and obviously, that's

extremely hot. I'm very busy right now. I think that there isn't

that I'm I'm not an expert. Sorry. But 1 thing that I think

there isn't is a really good approach to,

vector

database embeddings. Right? So there's obviously a lot of

people attempting to build out good solutions around vector databases and managing embeddings.

I've spoken to quite a lot of startup

kind of founders and founding engineers,

based on my previous experiences who are trying to,

do things like, you know, similarity search and k n n,

and,

Levenshtein distance and all these kind of fairly standard

data analytics things over

vectors that are being produced from kind of AI processes and large language models and deep neural networks doing computer vision. And that is an area where, you know, I spoke to a lot of people about what we're doing to to challenge it. And, you know, most of the existing tools that are all very new,

are very pricey

or very inefficient

above kind of very small toy projects.

And a lot of people I spoke to there are, like, running to the town instead of building their own. That's often why they're talking to me is around, like, the data infrastructure management. Like, how do we manage the infrastructure and build our own thing,

on top of something like Spark to start running these algorithms at scale over vectors.

So, yeah, I guess maybe it's kind of

obvious,

or startup founder answer, but I think kind of AI and vector

database solutions is somewhere I think there isn't yet a a really good tool.

Alright. Well, thank you very much for taking the time today to join me and share your experiences of working in this space of data transfer and enabling that for different organizations,

making that a simpler problem to solve. So I appreciate all the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Thanks. Thanks so much, Tobias. It's been a pleasure.

Thank you for listening.

Don't forget to check out our other shows, podcast.init,

which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com

Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links