Data Engineering Podcast


This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Support the show!

29 January 2018

Dat: Distributed Versioned Data Sharing with Danielle Robinson and Joe Hand - Episode 16 - E16

Rewind 10 seconds
1X
Skip 30 seconds ahead
0:00/0:00

Share on social media:


Summary

Sharing data across multiple computers, particularly when it is large and changing, is a difficult problem to solve. In order to provide a simpler way to distribute and version data sets among collaborators the Dat Project was created. In this episode Danielle Robinson and Joe Hand explain how the project got started, how it functions, and some of the many ways that it can be used. They also explain the plans that the team has for upcoming features and uses that you can watch out for in future releases.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
  • When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
  • Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • You can help support the show by checking out the Patreon page which is linked from the site.
  • To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
  • A few announcements:
    • There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20%
    • The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20%
    • If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.
  • Your host is Tobias Macey and today I’m interviewing Danielle Robinson and Joe Hand about Dat Project, a distributed data sharing protocol for building applications of the future

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • What is the Dat project and how did it get started?
  • How have the grants to the Dat project influenced the focus and pace of development that was possible?
    • Now that you have established a non-profit organization around Dat, what are your plans to support future sustainability and growth of the project?
  • Can you explain how the Dat protocol is designed and how it has evolved since it was first started?
  • How does Dat manage conflict resolution and data versioning when replicating between multiple machines?
  • One of the primary use cases that is mentioned in the documentation and website for Dat is that of hosting and distributing open data sets, with a focus on researchers. How does Dat help with that effort and what improvements does it offer over other existing solutions?
  • One of the difficult aspects of building a peer-to-peer protocol is that of establishing a critical mass of users to add value to the network. How have you approached that effort and how much progress do you feel that you have made?
  • How does the peer-to-peer nature of the platform affect the architectural patterns for people wanting to build applications that are delivered via dat, vs the common three-tier architecture oriented around persistent databases?
  • What mechanisms are available for content discovery, given the fact that Dat URLs are private and unguessable by default?
  • For someone who wants to start using Dat today, what is involved in creating and/or consuming content that is available on the network?
  • What have been the most challenging aspects of building and promoting Dat?
  • What are some of the most interesting or inspiring uses of the Dat protocol that you are aware of?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the unedited transcript…

Tobias Macey 00:13

Hello and welcome to the data engineering podcast the show about modern data management. When you’re ready to launch your next project, you’ll need somewhere to deploy it, you should check out Linotype data engineering podcast.com slash load and get a $20 credit to try out there fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to date engineering podcast com to subscribe to the show. Sign up for the newsletter read the show notes and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes or Google Play Music, tell your friends and co workers and share it on social media. I’ve got a couple of announcements before we start the show. There’s still time to register for the O’Reilly strata conference in San Jose, California how from March 5 to the eighth. Use the link data engineering podcast.com slash strata dash San Jose to register and save 20% off your tickets. The O’Reilly AI conference is also coming up happening April 29. To the 30th. In New York, it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to data engineering podcast.com slash AI con dash new dash York to register and save 20% off the tickets. Also, if you work with data or want to learn more about how the projects you have heard about on the show get used in the real world, then join me at the Open Data Science Conference happening in Boston from May 1 through the fourth. It has become one of the largest events for data scientists, data engineers and data driven businesses to get together and learn how to be more effective. To save 60% of your tickets go to data engineering podcast.com slash o d s c dash East dash 2018 and register. Your host is Tobias Macey. And today I’m interviewing Danielle Robinson and Joe hand about the DAP project the distributed data sharing protocol for building applications of the future. So Danielle, could you start by introducing yourself? Sure.

Danielle Robinson 02:10

My name is Danielle Robinson. And I’m the CO executive director of code for science and society, which is the nonprofit that supports that project. I’ve been working on debt related projects first as a partnerships director for about a year now. And I’m here with my colleague, Joe hand, take it away, Joe.

Joe Hand 02:32

Joe hand and I’m the other co executive director and the director of operations at code for science and society. And I’ve been a core contributor for about two years now.

Tobias Macey 02:42

And Danielle, starting with you again, can you talk about how you first got involved and interested in the area of data management? Sure.

Danielle Robinson 02:48

So I have a PhD in neuroscience. I finished that about a year and a half ago. And what I did during my PhD, my research was focused on cell biology Gee, really, without getting into the weeds too much on that a lot of time microscopes collecting some kind of medium sized aging data. And during that process, I became pretty frustrated with the academic and publishing systems that seemed to be limiting the access of access of people to the results of taxpayer funded research. So publications are behind paywalls. And data is either not published along with the paper or sometimes is published but not well archived and becomes inaccessible over time. So sort of compounding this traditionally, code has not really been thought of as an academic, a scholarly work. So that’s a whole nother conversation. But even though these things are changing data and code aren’t shared consistently, and are pretty inconsistently managed within labs, I think that’s fair to say. So and what that does is it makes it really hard to reproduce or replicate other people’s research, which is important for the scientific process. So during my PhD, I got really active in the open con and Mozilla science communities, which I encourage your listeners to check out. These communities build inter interdisciplinary connections between the open source world and open education, open access and open data communities. And that’s really important to like build things that people will actually use and make big cultural and policy changes that will make it easier to access research and share data. So it sort of I got involved, because of the partly because of the technical challenge. But also I’m interested in the people problems. So the changes to the incentive structure and the culture of research that are needed to make data management better on a day to day and make our research infrastructure stronger and more long lasting.

Tobias Macey 04:54

And Joe, how did you get involved in data management?

Joe Hand 04:57

Yeah, I’ve sort of gone back and forth between the sort of more academic or research a management and more traditional software side. So I really got started involved in data management when I was at a data visualization agency. And we basically built, you know, pretty web based visualization, interactive visualizations, for variety clients. This was cool, because it sort of allowed me to see like a large variety of data management techniques. So there was like the small scale, spreadsheet and manually updating data and spreadsheets, and then sending that off to visualize and to like, big fortune 500 companies that had data warehouses and full internal API’s that we got access to. So it’s really cool to see that sort of variety of, of data collection and data usage between all those organizations. So that was also good, because it, it sort of helped me understand how how to use data effectively. And that really means like telling a story around it. So you know, in order to sort of use data, you have to either use some math or some visual representation and the best the best stories around data combined, sort of bit of both of those. And then from there, I moved to a Research Institute. And we were tasked with building a data platform for international NGO. And they that group basically does census data collection in slums all over the world. And so as a research group, we were sort of trying interested in in using that data for research, but we also had to help them figure out how to collect that data. So before we came in with that project, they’d basically doing 30 years of data collection on paper, and then simulate sometimes manually entering that data into spreadsheets, and then trying to sort of share that around through thumb drives or Dropbox or sort of whatever tools they had access to. So this was cool, because it really gave me a great opportunity to see the other side of data management and analysis. So, you know, we work with the corporate clients, which sort of have big, lots of resources and computer computer resources and cloud servers. And this was sort of the other side where there’s, there’s very few resources, most of the data analysis happens offline. And a lot of the data transfer happens offline. So it was really cool to an interesting to see that, that a lot of the tools I’d been taking for granted sort of weren’t, couldn’t be applied in those in those areas. And then on the research side of things, I saw that, you know, as scientists and governments, they were just sort of haphazardly organizing data in the same way. So I was sort of trying to collect and download census data from about 30 countries. And we had to email right fax people, we got different CDs and paper documents and PDFs and other languages. So that really illustrated that there’s like a lot of data manage out there in a way that that I wasn’t totally familiar with. And it’s just, it’s just very crazy how everybody manages their data in different way. And that’s sort of a long, what I like to call the long tail of data management. So people that don’t use sort of traditional databases or manage it in their sort of unique ways. And most people managing day that in that way, you probably wouldn’t call it data, but it’s just sort of what they use to get their job done. And so once I started to sort of look at alternatives to managing that research data, I found that basically, and was hooked and started to contribute. So that’s sort of how I found that.

Tobias Macey 08:16

So that leads us nicely into talking about what the project is. And as much of the origin story each of you might be aware of. And Joe, you already mentioned how you got involved in the project. But Danielle, if you could also share your involvement or how you got started with it as well,

Danielle Robinson 08:33

yeah, I can tell the origin story. So the DAP project is an open source community building a protocol for peer to peer data sharing. And as a protocol, it’s similar to HTTP and how the protocols used today, but that adds extra security and automatic provisioning, and allows users to connect to a decentralized network in a decentralized network. You can store the data anywhere, either in a cloud or in a local computer, and it does work offline. And so data is built to make it easy for developers to build decentralized applications without worrying about moving data around and the people who originally developed it. And that’ll be Mathias, and Max and Chris, they’re scratching their own itch for building software to share and archive public and research data. And this is how Joe got involved, like he was saying before. And so it originally started as an open source project. And then that got a grant from the Knight Foundation in 2013, as a prototype grant focusing on government data, and then that was followed up in 2014, by a grant from the Alfred P. Sloan Foundation, and that grant focus more on scientific research and allowed the project to put a little more effort into working with researchers. And since then, we’ve been working to solve research data management problems by developing software on top of the debt protocol. And the most recent project is funded by the Gordon and anymore foundation. And now, that project started 2016. And that supports us it’s called debt in the lab, and I can get you a link to it on our blog. It supports us to work with California Digital Library and research groups in the University of California system to make it easier to move files around version data sets from support researchers through automating archiving. And so that’s a really cool project, because we get to work directly with researchers and do the kind of participatory design software stuff that we enjoy doing and create things that people will actually use. And we get to learn about really exciting research very, very different from the research, I did my PhD, one of the labs were working with a study see star Wasting Disease. So it’s really fascinating stuff. And we get to work right with them to make things that we’re going to fit into their workflows. So I started working with that, in the summer, right before that grant was funded. So I guess maybe six month before that grant was funded. And so I was came on as a consultant initially to help write grants and start talking about how to work directly with researchers and what to build that researchers will really help them move their data around and version control it. So So yeah, that’s how I became involved. And then in the fall, I transitioned to a partnerships position, and then the ED position in the last month.

Tobias Macey 11:27

And you mentioned that a lot of the sort of boost to the project has come in the form of grants from a few different foundations. So I’m wondering if you can talk a bit about how those different grants have influenced the focus and pace of the development that was possible for the project?

Joe Hand 11:42

Yeah, I mean, that really occupies a unique position in the open source world with that grant funding. So you know, for the first few years, it was closer to sort of a research project than a traditional product focused startup and other projects, other open source projects like that might be done part time as a side project, or just sort of for fun. But the grant funding really allowed the original developers to sign on and work full time, really solving harder problems that they might might be able to otherwise. So since we sort of got those grants, we’ve been able to toe the line between more user facing product and some research software. And the grant really gave us opportunity to, to tow that line, but also getting a field and connect with researchers and end users. So we can sort of innovate in with technical solutions, but really ground those real in reality with with specific scientific use cases. So you know, this balances really only possible because of that grant funding, which sort of gives us more flexibility and might have a little longer timeline than then VC money or or just like a open source, side project. But now we’re really at a critical juncture, I’d say we’re grant funding is not quite enough to cover what we want to do. But we’re lucky the protocol is really getting in a more stable position. And we’re starting to, to look at those user facing products on top and starting to build those those around around the core protocol.

Tobias Macey 13:10

And the fact that you have received so many different rounds of grant funding, sort of lends credence to the fact that you’re solving a critical problem that lots of people are coming up against. And I’m wondering if there are any other projects or companies or organizations that are trying to tackle similar or related problems that you sort of view as co collaborators or competitors in the space? Where do you think that the DAP project is fairly uniquely positioned to solve the specific problems that it’s addressing?

Joe Hand 13:44

Yeah, I mean, I would say we have, you know, there are other similar use cases and tools. And you know, a lot of that is around sharing open data sets, and sort of that the publishing of data, which Daniel might be able to talk more about, but on the on the sort of technical side, there is, you know, other I guess the biggest competitor or similar thing might be I PFS, which is another sort of decentralized protocol for for sharing and, and storing data in different ways. But we’re really we’re actually, you know, excited to work with these various companies. So you know, I PFS is more of a storage focus format. So basically allows content based storage on a distributed network. And that’s really more about sort of the the transfer protocol and, and being very interoperable without all these other solutions. So yeah, you know, that’s what we’re more excited about it is trying to understand how we can how we can use that in collaboration with all these other groups. Yeah,

Danielle Robinson 14:41

I think I’m just close one, what Joe said, through my time coming up in the open con community and the Mozilla science community, there are a lot of people trying to improve access to data broadly. And I, most of the people, I know everyone in the space really takes collaboration, not competition, sort of approach, because there are a lot of different ways to solve the problem, depending on who what the end user wants. And there are there’s a lot of great projects working in the space. I would agree with Joe, I guess that IP address is the thing that people sometimes you know, like I’ll be at a an event and someone will say, what’s the difference between detonate, PFS, and I answered pretty much how judges answered. But it’s important to note that we know those people, and we have good relationships with them. And we’ve actually just been emailing with them about some kind of collaboration over the next year. So it’s there’s a lot of there’s a lot of really great projects in the open data and improving access to data space. And I basically support them all. So hopefully, there’s so much work to be done that I think there’s room for all the people in the space.

Tobias Macey 15:58

And now that you have a style, a nonprofit organization around that, are there any particular plans that you have to support future sustainability and growth for the project?

Danielle Robinson 16:09

Yes, future sustainability and growth for the project is what we wake up and think about every day, sometimes in the middle of the night. That’s the most important thing. And incorporating the nonprofit was a big step that happened, I think, the end of 2016. And so it’s critical as we move towards a self sustaining future. And importantly, it will also allow us to continue to support and incubate other open source projects in the space, which is something that I’m really excited about. For dat, our goal is to support a core group of top contributors through grants, revenue sharing, and donations. And so over the next 12 months will be pursuing grants and corporate donations, as well as rolling out an open collective page to help facilitate smaller donations, and continuing to develop products with an eye towards things that can generate revenue and support that idea that ecosystem at the same time, we’re also focusing on sustainability within the project itself. And what I mean by that is, you know, governance, community management. And so we are right now working with the developer community to formalize the technical process on the protocol through a working group. And those are really great calls, lots of great people are involved in that. And we really want to make sure that protocol decisions are made transparently. And it can involve a wider group of the community in the process. And we also want to make the path to participation, involvement and community leadership clear for newcomers. So by supporting the developer community, we hope to encourage like new and exciting implementations of the DAP protocol, some of the stuff that happened 2017, you know, from my perspective, working in the science and sort of came out of nowhere, and people are building, you know, amazing new social networks based on that. And it was really fun and exciting. And so just keeping the community healthy, and making sure that the the technical process and how decisions get made is really clear and transparent, I think was going to facilitate even more of that. And just another comment about being a nonprofit because code for science, and society is a nonprofit, we also act as a fiscal sponsor. And what that means is that like minded projects, who get grant funding that are not nonprofits, so they can’t accept the grant on their grant through us. And then we take a small percentage of that grant. And we use that to help those projects by linking them up with our community. I work with them on grant writing, and fundraising and strategy will support their own community engagement efforts and sometimes offer technical support. And we see this is really important to the ecosystem and a way to help smaller projects develop and succeed. So right now we do that with two projects. One of them is called sin Silla. And it can send a link for that. And the other one is called science fair. scintilla is an open source project predictable documents software funded by the Alfred P. Sloan Foundation. It’s looking to support researchers from data collection to document offering. And science fair is a peer to peer library built on data, which is designed to make it easy for scholars to curate collections of research on a certain topic, annotate them and share it with their colleagues. And so that project was funded by a prototype grant from a publisher called life. And they’re looking for additional funding. So we’re working with both of them. And in the first quarter of this year, Joe and I are working to formalize the process of how we work with these other projects and what we can offer them and hopefully, we’ll be in the position take on additional projects later this year. But I really enjoy that work. And I think, as someone so I went through the Mozilla fellowship, which was like a 10 month long, crazy period where Mozilla invested a lot in me and making sure I was meeting people and learning how to write grants and learning how to give good talks and all kinds of awesome investment. And so for a person who goes through a program like that, or a person who has a side project, there’s kind of there’s a need for groups in the space, who can incubate those projects, and help them as they develop from from the incubator stage to the, you know, middle stage before they scale up. So I thinking there’s, so as a fiscal sponsor, we were hoping to be able to support projects in that space.

Tobias Macey 20:32

And digging into the debt protocol itself. When I was looking through the documentation, it mentioned that the actual protocol itself is agnostic to the implementation. And I know that the current reference implementation is done in JavaScript. So I’m wondering if you could describe a bit about how the protocol itself is designed, how the reference implementation is done, and how the overall protocol has evolved since it was first started and what your approach is to version in the protocol itself to ensure that people who are implementing it and other technologies or formats are able to ensure that they’re compliant with specific versions of the protocol as it evolves.

Joe Hand 21:19

Yeah, so that’s basically a combination of ideas from from get BitTorrent, and just the the web in general. And so there are a few key properties in that, but basically, any implementation has to recreate. And those are content, integrity, decentralized mirroring of the data sets, network, privacy, incremental version, and then random access to the data. So we have a white paper that sort of explains all these in depth, but I’ll sort of explain how they work maybe in a basic use case. So let’s say I want to send some data to Danielle, which I do all the time. And I have a spreadsheet where I keep track of my coffee intake intake. So I want to live Danielle’s computer so she can make sure I’m not over caffeinated myself. So sort of similar to how you get started with get, I would put my spreadsheet in a folder and create a new dat. And so whenever I create a new debt, it makes a new key pair. So one the public key and was the private key. And the public key is basically the dat link, so kind of like a URL. So you can use that in any anything that speaks with the the DAP protocol. And you can just sort of open that up and look at all the files inside of that. And then the the private key allows me to write files to that. And it’s used to sign any of the new changes. And so the private key allows Danielle to verify that the changes actually came for me and that somebody else wasn’t, wasn’t trying to fake my data, or somebody wasn’t trying to man in the middle of my, my data when I was transferring it to Danielle. So I added my spreadsheet to the data. And then the date, what that does is break that file into little chunks. It hashes all those trunks and creates a Merkel tree with that. And that Merkel tree, basically has lots of cool properties is one of the key key sort of features of data. So the Merkel tree allows us to sparsely replicated data. So if we had a really big data set, and you only want one file, we can sort of use the Merkel tree to download one file and then still verify the integrity of that content with that incomplete data set. And the other part that allows us to do that is the register. So all the files are stored in one register, and all the metadata is stored in another register. And these registers are basically append only Ledger’s. They’re also sort of known as secure registers. Google has a project called certificate transparency, that has similar ideas. And these registers, basically, you pen, whenever new file changes, you might append that to the metadata register, and that register stories based permission about the structure of the file system, what version it is, and then any other metadata, like the creation time for the change time of that file. And so right now, you know, as you said, Tobias, we we sort of are very flexible on sort of how things are implemented. But right now we basically store the files as files. So that’s sort of allows for people to see the files normally and interact with them normally. But the cool part about that is that the the on disk file storage can be really flexible. So as long as the implementation has random access, basically, then they can store it in any different way. So we have, for example, a server edge store storage model built for the server that stores all of the files as a single file. So that sort of allows you to have less file descriptors open and sort of shut, gets the the file I O all constrained to one file. So once my file gets added, I can share my link privately with Danielle and I can send that over chat or something or just paste it somewhere. And then she can clone my dad on using our command line tool or the desktop tool or the beaker browser. And when she clones my dad, our computer is basically connect directly to each other. So we use a variety mechanisms to try and do that connection. That’s been one of the challenges that I can talk about later, sort of how to how to connect peer to peer and the challenges around that. But then once we do connect, will transfer the data either over TCP or UDP. So those are default network protocols that we use right now. But yeah, that can be as automated basically, on any other protocol. I think Mathias once said that, that if you could implement it over carrier pigeon, that would work fine, as long as you had a lot of pigeons. So we’re really open to sort of how how the data as far as the protocol, information gets transferred. And we’re working over a dat over HTTP implementation too. So this wouldn’t be peer to peer. But it would allow basically traditional server fallback if no peers or online or for services that don’t want to run a peer to peer for whatever reason, once Danielle clones my, she can open it just like a normal file and plug it into a bar or Python or whatever. And use her equation to measure my caffeine level. And then let’s say I drink another cup of coffee and update my spreadsheet, the changes will basically automatically be synced to her, as long as she’s still connected to me. And it will it will be synced throughout the network to anybody else that’s connected to me. So the meditate, meditate or register stores that updated file information. And then the content register stores just the change file blocks. So Danielle only have to sync the death of that content change rather than the whole dataset again. So this is really useful for the big data sets, you know, I think the whole thing. And yeah, we’ve had to design basically each of these pieces to be as modular as possible both within our JavaScript demo the implementation, but also in the protocol in general. So right now, developers can swap other network protocols data storage. So for example, if you want to use that in the browser, you can use web RTC for the network and discovery and then use index DB for data storage. So index DB has random access. So you can just plug that in, directly into that. And we have some modules for those. And that should be working. We did have a web RTC implementation we were supporting for a while, but we found it a bit inconsistent for our use cases, which is, you know, more around like large file sharing. But it’s still might be okay for for chat and other more text based things. So, yeah, all of our implementations in Node right now.

I think that was that was both for, for usability and developer friendliness, and also just being able to work in the browser and across platforms. So we can distribute a binary now of that pretty easily. And you can run that in the browser or build dad tools on electron. So it sort of allows a wide range of, of developer tools built on top of that. But we have a few community members now working on different implementations and rust and see I think are the two, the two that are going right now. And so as far as the the protocol version in, that was actually one of the big conversations we were having in the last working group meeting. And that’s to be decided, basically, but through through the stages we’ve gone through, we’ve broken it quite a few times. And now we’re finally in a place where we we want to make sure not to break it moving forward. So there’s sort of space in the protocol for information like version history, or version of the protocol. So we’ll probably use that to signal the version and just figure out how, how the tools that are implementing it can fall back to the latest version. So before, before all the sort of file based stuff that went through a different a few different stages, it started really as a more like version, decentralized database. And then as as Max and Mathias and Krista sort of moved to the scientific use cases where they sort of removed more and more of the database architecture as it as it moved on and matured. So we basically, that transition was really driven by like user feedback and watching her researchers work. And we realized that so much of research data is still kept in files and basically moved manually between machines. So even if we were going to build like a special database, a lot of researchers still won’t be able to use that, because that sort of requires more more infrastructure than there they have time to support. So we really just kept working to build a general purpose solution that allows other people to build tools to solve those, those more specific problems. And the last point is that right now, all that transfer is basically one way so only one person can update the source. This is really useful for a lot of our research escape research cases where they’re getting data from lab equipment, where there’s like a specific source, and you just want to disseminate that information to various computers. But it really doesn’t work for collaboration. So that’s sort of the next thing that we’re working on. But we really want to make sure to solve, solve this sort of one way problem before we move to the harder problem of collaborative data sets. And this last major iteration is sort of the hardest. And that’s what we’re working in right now. But it’s sort of allows multiple users to write to the same that. And with that, we sort of get into problems like conflict resolution and, and duplicate updates and other other sort of harder distributed computing problems.

Tobias Macey 30:24

And that partially answers one of the next questions I had, which was to ask about conflict resolution. But if there’s only one source that’s allowed to update the information, then that solves a lot of the problems that might arise by sinking all these data sets between multiple machines, because there aren’t going to be multiple parties changing the data concurrently. So you don’t have to worry about how to handle those use cases. And another question that I had from what you were talking about is the cryptography aspect of that sounds as though when you initialize the data, it just automatically generates the pressure private key. And so that private key is chronically linked with that particular data set. But is there any way to use for instance, Coinbase or jpg, to sign the source that in addition to the generated key to establish your identity for some for when you’re trying to share that information publicly? And not necessarily via some channel that already has established trust?

Joe Hand 31:27

Yeah, I mean, you can sort of so once, I mean, you could, like do that within the that. We don’t really have any mechanism for doing that on top of that. So it’s, you know, we’re sort of going to throw that into user land right now. But, yeah, I mean, that’s a good good question. And we’ve we’ve had some people, I think, experimenting with different identity systems and and how to solve that problem. And I think we’re pretty excited about the, the new wire app, because that’s open source, and it uses end to end encryption and has some identity system and are sort of trying to see if we can sort of build that on top of wire. So that’s, that’s one of the things that we’re sort of experimenting with.

Tobias Macey 32:09

And one of the primary use cases that is mentioned in the documentation, and the website for that is being able to host and distribute open data sets with a focus being on researchers and academic use cases. So I’m wondering if you can talk some more about how that helps with that particular effort and what improvements it offers over some of the existing solutions that researchers were using prior

Danielle Robinson 32:33

there are solutions for both hosting and distributing data. And in terms of hosting and distribution. There’s a lot of great work, focused on data publication and making sure that data associated with publications is available online and thinking about the noto and Dryad or data verse. There are also other data hosting platforms such as see can or data dot world. And we really love the work these people do and we’ve collaborated with some of them are were involved in like, the organization of friendly org people life for the open source Alliance for open scholarship has some people from Dryad who are involved in it. And so it’s nice to work with them. And we’d love to work with them to use that to upload and distribute data. But right now, if researchers need to feed if researchers need to share files between many machines and keep them updated, and version, so for example, if there’s a large live updating data set, there really aren’t great solutions to address data version and sharing. So in terms of sharing, transferring lots of researchers still manually copy files between machines and servers, or use tools like our sink or FTP, which is how I handled it during my PhD. Other software such as Globus or even Dropbox box can require more IT infrastructure than small research group may have researchers like you know, they are all operating on limited grant funding. And they also depend on the it structure of their institution to get them access to certain things. So a researcher like me might spend all day collecting a terabyte of data on a microscope and then wait for hours or wait overnight to move it to another location. And the ideal situation from a data management perspective is that those raw data are automatically archived to the web server and sent to the researchers computer for processing. So you have an archived copy of the raw data that came off of the equipment. And in the process, files also need to be archived. So you need archives of the imaging files, in this case at each step in processing. And then when a publication is ready, the data processing pipeline, in order for it to be fully reproducible, you’ll need the code and you’ll need the data at different stages. And even without access to to compete, the computer, the cluster where the analysis was done, a person should be able to repeat that. And I say ideally, because this isn’t really how it’s happening. Now.

archiving data, a different steps can be the some of the things that stop that from happening, or just cost of storage, and the availability of storage and researcher habits. So I definitely, you know, know some researchers who kept data on hard drives and Tupperware to protect them in case the sprinklers ever went off, which isn’t really like a long term solution, true facts. So that can make on can automate these archiving steps at different checkpoints and make the backups easier for researchers. As a former researcher, I’m interested in anything that makes better data management automatic for researchers. And so we’re also interested in version computer environments to help labs avoid the drawer full of jobs tribes problem, which is sadly, a quote from a senior scientist who was describing a bunch of data collected by her lab that she can no longer access, she has the drawer, she has the jazz drives, she can’t get in them, that data is essentially lost. And so researchers are really motivated to make sure when things are archived, they’re archived in a forum where they can actually be accessed. But I think, because researchers are so busy, it’s really hard to know like, when that is, so I think because we’re so focused on essentially like filling in the gaps between the services that researchers use, and it worked well for them and automating things, I think that that’s in a really good position to solve some of these problems. And if you have, you know, some of the researchers that we’re working with now, I’m thinking of one person who has a large data set and bioinformatics pipeline, and he’s at a UC lab, and he wants to get all the information to his closet right here in Washington State. And it’s taken months, and he has not been able to do it or he can get he can’t, he just can’t move that data across institutional lines. So and that’s a much longer conversation as to like why exactly that isn’t working. But we’re working with him to try to just make him make it possible for him to move the data and create a version iteration or a version emulation of his compute environment so that his collaborator can just do what he was doing and not need to spend four months worrying about dependencies and stuff. So yeah, hopefully, that’s the question.

Tobias Macey 37:39

And one of the other difficult aspects of building a peer to peer protocol is the fact that in order for there to be sufficient value in the protocol itself is there needs to be a network behind it of people to be able to share that information with and share the bandwidth requirements for being able to distribute that in front. So I’m wondering how you have approached the effort of building up that network, and how much progress you feel you have made in that effort?

Joe Hand 38:08

Yeah, I’m not sure we really view that as as that traditional peer to peer protocol, I’m using that model sort of relying on on network effects to scale. So you know, as Danielle said, we’re just trying to get data from A to B. And so our critical mass is basically to users on a given data set. So obviously, we want to first build something that offers better tools for those to users over traditional cloud or client server model. So if I’m transferring files to another researcher using Dropbox, you know, we have to transfer files via a third party and a third computer before it can get to the other computer. So rather than going direct between two computers, we have to go through a detour. And this has implications for speed, but also security bandwidth usage, and even something like energy usage. So by cutting off at their computer, we feel like we’re we’re already about adding value to the network, we’re sort of hoping that when when researchers that are doing this HDB transfer, they they can sort of see the value of going directly. And and using something that is version and can like be life synced over existing tools, like our st corrected E or, or the commercial services that might store data in the cloud. And you know, we really don’t have anything against the centralized services, we sort of recognize that they’re very useful sometimes. But they, they also aren’t the answer to everything. And so depending on the use case, decentralized system might make more sense than a centralized one. And so we sort of want to offer developer and users that option to make that choice, which we don’t really have right now. But in order to do that, we really have to start with peer to peer tools first. And then once we have that decentralized network, we can basically limit the network to one server peer in many clients, and then all of a sudden, it’s centralized. So we sort of understand that, that it’s easy to go from the centralized, decentralized, but it’s harder to go the other way around, we sort of have to start with a peer to peer network in order to solve all these different problems. And the other thing is that we sort of know, file systems are not going away. We know that that web browsers will continue to support static files. And we also know that people will basically want to move these things between computers, back them up, archive them, share them two different computers. So we sort of know files are going to be transferred a lot in the future. And that’s something we we can, we can depend on. And they probably even want to do this in a secure way sometimes, and maybe in an offline environment or a local network. And so we’re basically trying to build from that those basic principles, using sort of peer to peer transfer is the sort of bedrock of all that. And that’s sort of how we got to where we are now with the peer to peer network. But we’re not really worried that that we need a certain number of or critical mass of users to add value, because we just sort of feel like by building the right tools, with these principles, we can, we can start adding value, whether it’s a decentralized network or a centralized network.

Tobias Macey 40:59

And one of the other use cases that’s been built on top of that is being able to build websites and applications that can be viewed by a web browsers and distributed peer to peer in that manner. So I’m wondering how much uptake you’ve seen and usage for that particular application of the protocol? And how much development effort is being focused on that particular use case?

Joe Hand 41:20

Yeah, so you know, if I open my bigger browser right now, which is the main the main web implementation we have that Paul frizzy and Tara Bansal are working on, you know, if I open my my bigger browser, I think I usually have 50, to 100, or sometimes 200, peers that I connected right away. So that’s through some of the social network copies, like, wrote on their freighter, and then just some, like personal sites. And you know, we’ve sort of been working with the beaker browser folk probably for two years now. Sort of CO developing the protocol and, and seeing what they need support for in beaker. But you know, it sort of comes back comes back to that basic Brynn pull that we can recognize that a lot of websites are static files. And if we can just sort of support static files in the best way possible, then you can browse a lot of websites. And that even gives you the benefit of things that are more interactive, we know that they have to be developed. So they work offline, too. So both Cortana and Twitter can work offline. And then once you get back online, you can just sync the data sort of seamlessly. So that’s sort of the most exciting part about those.

Danielle Robinson 42:29

You mean, fritter not.

freighter is the Twitter clone that Tara Bansal and Paul made beakers, a lot of fun. And if you’ve never played around with it, I would encourage you to download it. I think it’s just speaker browser calm. And I’m not a developer by trade. But I have seriously enjoyed playing around on beaker. And I think the some of the more frivolous things like printer that have come out of it are a lot of fun, and really speak to the potential of peer to peer networks in today’s era as people are becoming increasingly frustrated with the centralized platforms.

Tobias Macey 43:13

And the fact that the content that’s being distributed via that using the browser is primarily static in nature, I’m wondering how that affects the sort of architectural patterns that people are used to using with the common three tier architecture. And what are you’ve already mentioned, a couple of social network applications that have been built on top of it, but I’m wondering if there any others that are built on top of and delivered via that, that you’re aware of the you could talk about that speak to some of the ways that people are taking advantage of that in more of the consumer space?

Joe Hand 43:47

Yeah, I mean, I think, you know, one of the big shifts that have made this easier is having databases in the browser, so things like index DB or other local storage databases, and then be able to sync those two other computers. So as long as you sort of know that, I’m writing to my database, and that, you know, if I’m writing my, I think people are trying to build games off this. So you know, you could build a chess game where I write to my local database, and then you have some logic for determining if a move is valid or not, and then sinking that to your competitor, you know, it sort of provides, it’s a more constrained environment. But I think that also gives you a benefit of, of sort of being able to constrain your development and, and not requiring these external services or external database calls or whatever. I know that I’ve tried a few times to sort of develop projects are just like fun little things. And it is a challenge, it’s a challenge, because you sort of have to think differently, how those things work, and you can’t rely necessarily on on external services, you know, whether that’s something as simple as like, loading fonts from external service, or CSS styles or whatever, external JavaScript, you sort of want that all to be packaged within one, one day, if you want to ensure it’s all going to work. So it’s def has, you know, you think of a little differently even on those those simple things. But yeah, it does constrain the sort of bigger applications. And, you know, I think the other area that that we could see development is more in electron applications. So maybe not in beaker, but electron, using that sort of framework as as a platform for other types of applications that might need those more sort of flexible models. So science fair, which is one of our hosted projects, is a really good example of how, how to use that in a way to distribute data, but still sort of have a full application. So basically, you can distribute all the data for the application over that and keep it updated through the live sinking. And users can basically download the the PDFs that they need to read, or the journals or the figures they want to read. And just download whatever they want sort of allowing developers to have that flexible model where you can distribute things peer to peer and have both the live sinking, but also just downloading whatever data that users need, and just providing that framework for, for that data management.

Tobias Macey 46:15

And one of the other challenges that’s posed, particularly for this public distribution, use case is that content discovery, because the By default, the URLs that are generated, are private, and ungraspable, because they’re essentially just hashes of the content. So I’m wondering if there are any particular mechanisms that you either have built or planned or started discussing for being able to facilitate content discovery of the information that’s being distributed by these different networks?

Joe Hand 46:50

Yeah, this is definitely an open question. I sort of fall back on my comment answer, which is depends on the the tool that we’re using and the different communities and there’s going to be different approaches, some might be more decentralized, and some might be centralized. So, for example, with data set discovery, you know, there’s a lot of good centralized services for data set publishing, as Daniel mentioned, like pseudo or data verse. So these are places that already have discovery engines, I guess we’ll say, and they published data sets. So you know, you could sort of similarly publish that URL along with those those data sets so that people could sort of have an alternative way to download those data sets. So that’s, that’s sort of one way that we’ve been thinking about discovery is sort of leveraging these existing solutions that are doing a really good job in their domain, and trying to work with them to start using that for their their data management. Another sort of hacky solution, I guess I’ll say is using existing domains and DNS. So basically, you can publish a regular HTTP site on your URL, and give it a specific well known file, and that points to your debt address. And then the baker browser can find that file and tell you that a peer to peer version of that site is available. So we’re basically leveraging the existing DNS infrastructure to start to discover content just with existing URLs. And I think a lot of the discovery will be more community based. So in, for example, fritter in rotund people are starting to build crawlers or search bots, to discover users or search and so basically, just sort of looking at where there is need, and identifying, you know, different types of crawlers to build and, and how to connect those communities in different ways. So we’re really excited to see what what ideas pop in that in that area. And they’ll probably come in in a decentralized way, we hope.

Tobias Macey 48:46

And for somebody who wants to start using that what is involved in creating and or consuming the content that’s available on the network, or if there any particular resources that are available to get somebody up to speed and understand how it works and some of the different uses that they could put it to?

Danielle Robinson 49:05

Sure, I can take that. And Joe just chime in. If you think of anything else, we built a tutorial for our work with the labs and for Ma’s fest this year that’s at try dash calm. And this tutorial takes you through how to work with the command line tool and some basics about beaker. And please tell us if you find a bug, there may be bugs morning. But it was working pretty well when I use the last and it’s in the browser. And you can either share data with yourself it spins up a little virtual machine. So you can share data with yourself or you can do it with a friend and share data with your friend. So beakers also super easy for a user who wants to get started, you can visit pages of her dad just like you would a normal web page. For example, you can go to this website, and we’ll give Tobias the link to that. And just change the end PTP to dat. And so it looks like dat colon slash slash j handout space. And beaker also has this fun thing that lets you create a new site with a single click. And you can also fork sites and edit them and make your own copies of things, which is fun if you’re like learning about how to build several websites. So you can go to bigger browser calm and learn about that. And I think we’ve already talked about return and fritter. And we’ll add links into people who want to learn more about that. And then for data focused users, you can use that for sharing or transferring files, either with the desktop application or the command line interface. And so if you’re interested, we encourage you to play around the community is really friendly and helpful to new people. Joe and I are always on the IRC channel or on Twitter. So if you have questions, feel free to ask and we love talking to new people, because that’s how all the exciting stuff happens in this community. So

Tobias Macey 50:58

and what have been some of the most challenging aspects of building the project in the community and promoting the use cases and capabilities of the project,

Danielle Robinson 51:10

I can speak a little bit to promoting it in the academic research. So in academic research, probably similar to many of the industries where your listeners work, software decisions are not always made for entirely rational reasons. There’s tension between what your boss wants what the IT department has approved, that means institutional data security needs, and then the perceived time cost of developing a new workflow and getting used to a new protocol. So we try to work directly with researchers to make sure the things we build are easy and secure. But it is a lot of promotion and outreach to get their scientists to try a new workflow. They’re really busy. And the incentives are all you know, get more grants, do more projects, publish more papers. And so even if something will eventually make your life easier, it’s hard to sink in time up front. One thing I noticed, and this is probably common to all industries is that people will I’ll be talking to someone and they’ll say, Oh, you know, archiving the data from my research group is not a problem for me. And then they’ll proceed to describe a super problematic data management workflow. And it’s not a problem for them anymore, because they’re used to it. So it doesn’t hurt day to day. But you know, doing things like waiting until the point of publication, then try to go back and archive all the raw data, maybe someone was collected by a postdoc who’s now gone, other was collected by a summer student who used a non standard naming scheme for all the files, you know, there’s just a million ways that that stuff can go wrong. So for now, we’re focusing on developing real world use cases, and participating in you know, community education around data management. And we want to build stuff that’s meaningful for researchers and others who work with data. And we think that by working with people and doing the nonprofit thing, grants is going to be the way to get us there. God want to talk a little bit about building.

Joe Hand 53:03

Yeah, sure. So you know, in terms of building it, I mean, I haven’t done too much work on the core protocol. So I can’t say much around the difficult design decisions there. I’m the main developer on the command line tool. And the most of the challenging decisions, they’re all are about sort of user interfaces, not necessarily technical problems. And so as Danielle said, it’s sort of as much about people as it is around software and and those decisions. But I think, you know, one of the, the most challenging thing that we’ve run into a lot is, is basically network issues. So in the peer to peer network, you know, you have to figure out how to connect to peers directly in a network, they might not be supposed to do that. So I think a lot of that is from BitTorrent sort of making different institutions restrict peer to peer networking in different ways. And, and so we’re sort of having to fight that battle against these existing restrictions and trying to find out how these networks are restrictive, and how we can continue to have success in connecting peers directly rather than through through a third party server. And it’s funny because, or maybe not funny, but some of the strictest network, we found are actually in academic institutions. And so, you know, some, for example, one of the UC campuses, I think, we found out that computers can never connect directly to each other computers on that same network. So if we wanted to transfer data between two computers sitting right next to each other, we basically have to go through an external cloud server just to get it to the computer sitting right next to each other, or, you know, you suddenly like a hard drive, or a thumb drive or whatever. But you know, that sort of thing. All these different sort of network configurations, I think, is one of the hardest parts, both in terms of implementation. But also in terms of testing, since we can’t, we can’t like readily get into these UC campuses or sort of see what the, what the network setup is. So we’re sort of trying to create more tools around network scene and both testing networks in the wild, but also just sort of using virtual networks to test different different types of network setups and sort of leverage that those two things combined to try and get around around all these network connection issues. So yeah, I think, you know, I would love to ask Mathias to this question around the design decisions in terms of the core protocol. But, but I can’t really say much about that, unfortunately.

Tobias Macey 55:29

And are there any particularly interesting or inspiring uses of that, that you’re aware of that you’d like to share?

Danielle Robinson 55:36

Sure, I can share a couple of things that we were involved in. During last in January 2016, we were involved in the data rescue and libraries plus network community. And that was the movement to archive government funded research at trusted public institutions like libraries and archives. And as a part of that, we got to work with some of the really awesome people at California Digital Library, California Digital Library is really cool, because it is digital library with a mandate to preserve and archive and steward the data that’s produced in the UC system. And it supports the entire UC system. And the people are great. And so we worked with them to make the the first ever backup of data.gov in January of 2016. And I think my colleague had 40 terabytes of metadata sitting in his living room for a while as we were working up to the transfer. And so that was a really cool project. And it has produced a useful thing. And it’s sort of, you know, we got to work with some of the data.gov people to make that happen. And they, you know, they were like how, really, it has never been backed up, that it was a good time to do it. But believe it or not, it’s actually pretty hard to find funding for that work. And we have more work we’d like to do in that space. archiving copies of federally funded research at trusted institutions is a really critical step towards ensuring the long term preservation of the research that gets done in this country. So hopefully, 2018 will see those projects funded or new collaborations in that space. Also, it’s a fantastic community, because it’s a lot of really interesting librarians and archivists who have great perspective on long term data preservation, and I love working with them. So hopefully, we can do something else there. Then the other thing that I’m really excited about is the working on the data in the lab project working on the debt container. issue. And I don’t mind over a little over time. So I don’t know how much I shouldn’t go into this. But we’ve learned a lot about really interesting research. And so we’re working to develop a container based simulation of a Research Computing cluster, that can run on any machine or in the cloud. And then by creating a container that will include the complete software environment of the cluster, researchers across the UC system can quickly get analysis pipelines that they’re working on us usable in other locations. And this Believe it or not, is it is it big problem, I was sort of surprised when one researcher told me she had been working for four months to get a pipeline running at UC Merced said that had been developed at UCLA. And that’s like, you could drive back and forth between her said, and UCLA a bunch of times in four months. But it’s this little stuff that really slows research down. And so I’m really excited about the potential there. And we wrote, we’ve written a couple blog posts on that. So I can add the links to those blog posts and in the follow up.

Joe Hand 58:36

And I’d say the most novel use that I’m sort of excited about is called hyper vision. And it’s basically video streaming and built on that Mathias booth, one of the lead developers on that is prototyping sort of something similar with the Danish public TV. And they basically want to live stream their, their channels over the peer to peer network. So I’m excited about that, because I’d really love to get more public television and Public Radio distributing content, peer to peer, so we can sort of reduce their their infrastructure costs and hopefully, allow for for more of that great content to come out.

Tobias Macey 59:09

Are there any other topics that we didn’t discuss yet? What do you think we should talk about before we close out the show?

Danielle Robinson 59:15

Um, I think I’m feeling pretty good. What about you, Joe?

Joe Hand 59:18

Yeah, I think that’s it for me. Okay.

Tobias Macey 59:20

So for anybody who wants to keep up to date with the work you’re doing or get in touch, we’ll have you each add your preferred contact, excuse me, your preferred contact information to the show notes. And as a final question, to give the listeners something else to think about, from your perspective, what is the biggest gap in the tooling or technology that’s available for data management today?

Joe Hand 59:42

I’d say transferring files, which feels really funny to say that, but to me, it’s still a problem that’s not really well solved. Just how do you get files from A to B in a consistent and easy to use manner, especially want a solution that doesn’t really require a command line, and is still secure, and hopefully doesn’t go through a third party service. Because hopefully, that means it works offline. So a lot of what I saw in the sort of developing world is the need for data management that works offline. And I think that’s, that’s one of the biggest gaps that we don’t really address yet. So there are a lot of great data data management tools out there. But I think they sort of aimed more at data scientists or software focused users that might use manage databases or something like a dupe. But there’s really a ton of users out there that don’t really have tools. Indeed, and most of the world is still offline or with inconsistent internet and putting everything through the servers on the cloud isn’t really feasible. But the alternatives now require sort of careful data management and manual data management if you don’t want to lose all your data. So we really hope to find a good balance between those those two needs in those two use cases. Yeah.

Danielle Robinson 01:00:48

Plus one with Joe said, transferring files, it does feel funny to say that, but it is still a problem in a lot of industries, and especially where I come from in research science. And from my perspective, I guess the other issue is that, you know, the people problems are always as hard or harder than the technical problems. So if people don’t think that it’s important to share data or archive data, in an accessible and usable form, we could have the world’s best easy to use tool, and it wouldn’t impact the landscape or the accessibility of data. And similarly, if people are sharing data that’s not usable, because it’s missing experimental context, or it’s in a proprietary format, or because it’s shared under a restrictive license, it’s also not going to impact the landscape, or be useful to the scientific community or the public. So working to change, we want to build great tools. But I also want to work to change the incentive structure and research to ensure that good data management practices are rewarded. And so that data is shared in a usable form. That’s really key. And I’ll add a link in the show notes to the fair data principles, which means data should be fundable, testable, interoperable, and reusable, something that your listeners might want to check out if they’re not familiar with it. It’s a framework developed in academia. But I’m not sure actually how much impacts had outside of that sphere. But it would be interesting to talk to your listeners a little bit about that. And yeah, I’ll put my contact info in the show notes. And I’d love to connect with anyone and or answer any further questions about that, and what we’re going to try to do with coatings for science and society over the next year. So thanks a lot, Tobias, for inviting us.

Tobias Macey 01:02:30

Yeah, absolutely. Thank you both for taking the time out of your days to join me and talk about the work you’re doing. It’s definitely a very interesting project with a lot of useful potential. And so I’m excited to see where you go from now into the future. So thank you both for your time and I hope you enjoy the rest of your evening.

Unknown Speaker 01:02:48

Thank you. Thank you.

Transcribed by https://otter.ai?utm_source=rss&utm_medium=rss

Support Data Engineering Podcast


Share on social media:


Listen in your favorite app:



More options

Here are shows you might like

See show recommendations
AI Engineering Podcast
Tobias Macey
The Python Podcast.__init__
Tobias Macey