Summary
We have tools and platforms for collaborating on software projects and linking them together, wouldn’t it be nice to have the same capabilities for data? The team at data.world are working on building a platform to host and share data sets for public and private use that can be linked together to build a semantic web of information. The CTO, Bryon Jacob, discusses how the company got started, their mission, and how they have built and evolved their technical infrastructure.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
- When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
- Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
- You can help support the show by checking out the Patreon page which is linked from the site.
- To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
- This is your host Tobias Macey and today I’m interviewing Bryon Jacob about the technology and purpose that drive data.world
Interview
- Introduction
- How did you first get involved in the area of data management?
- What is data.world and what is its mission and how does your status as a B Corporation tie into that?
- The platform that you have built provides hosting for a large variety of data sizes and types. What does the technical infrastructure consist of and how has that architecture evolved from when you first launched?
- What are some of the scaling problems that you have had to deal with as the amount and variety of data that you host has increased?
- What are some of the technical challenges that you have been faced with that are unique to the task of hosting a heterogeneous assortment of data sets that intended for shared use?
- How do you deal with issues of privacy or compliance associated with data sets that are submitted to the platform?
- What are some of the improvements or new capabilities that you are planning to implement as part of the data.world platform?
- What are the projects or companies that you consider to be your competitors?
- What are some of the most interesting or unexpected uses of the data.world platform that you are aware of?
Contact Information
- @bryonjacob on Twitter
- bryonjacob on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- data.world
- HomeAway
- Semantic Web
- Knowledge Engineering
- Ontology
- Open Data
- RDF
- CSVW
- SPARQL
- DBPedia
- Triplestore
- Header Dictionary Triples
- Apache Jena
- Tabula
- Tableau Connector
- Excel Connector
- Data For Democracy
- Jonathan Morgan
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the data engineering podcast, the show about modern data management. When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode at data engineering podcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production, and Go CD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to data engineering podcast.com/gocd to download and launch it today.
Enterprise add ons and professional support are available for added peace of mind. And go to data engineering podcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page, which is linked from the site. To help other people find the show, you can leave a review on Itunes or Google Play Music, tell your friends and coworkers, and share it on social media. This is your host Tobias Macy. And today, I'm interviewing Brian Jacob about the technology and purpose that drive data dot world. So, Brian, could you please introduce yourself?
[00:01:23] Unknown:
Sure. Yeah. Thanks for thanks for having me on the podcast. My name is Brian Jacob. I'm the CTO, 1 of the 4 cofounders at, at data dot world. The the 4 of us had all known each other and worked at you know, worked together in various combinations at number of companies here in Austin over the last, 20 years. And so we brought a lot of different, a lot of different experiences and, you know, our networks and our kind of experiences starting and work working other companies to bear to, to data dot world, which I think we would all 4 agree is definitely the the most ambitious company we've we've started.
[00:02:00] Unknown:
And how did you first get involved in the area of data management?
[00:02:04] Unknown:
Yeah. So, you know, for me, it's mostly been mostly been as a as a consumer, as an implement as an implementer of data management solutions. I mean, earliest on, I was in grad school, I studied, you know, I was studying artificial intelligence, but, you know, 20 years ago. So a lot of the things that are that are popular now, but in a much different form then and was was, you know, interested through through through AI and a lot of the representational things that really became what is the semantic web. But after leaving grad school, I worked for, you know, a few different companies. I worked for a company called Trilogy here in Austin, which was a big enterprise software company, did a lot of configuration management, and I had kind of gravitated towards just solving some of the Harrier data integration problems that we had kind of integrating that technology into various data management layers at our customers.
I worked for a short while at Amazon, where there I worked on their drop ship team, which was basically again data integration. It was solving the integration between Amazon's inventory systems and third party warehouses that that serviced some of their goods. And then I spent the last 10 years before starting data.world at a company called HomeAway, which, was is an online marketplace for vacation rentals. What was kind of technically interesting about HomeAway and really what kinda drove me through a bunch of a lot of different projects there over over 10 years was we, you know, we we started the company really as a roll up. There was a lot of smaller regional vacation rental listing sites, and, you know, the the the business has really started to kind of aggregate those into a single online marketplace. And so over the time I was there, we made over 30 acquisitions.
And Scott's really deal with the problems inherent in integrating 30 existing platforms, and and it was kind of over the say you know, 2006 to 2016, the same period of time where, you know, we would we were standing up a a a, you know, data lake and and real building, you know, big data solutions and then building a data science team and, you know, got a lot got a lot of experience through through the through the just the practice of of getting that platform pulled together.
[00:04:17] Unknown:
Yeah. It sounds like it prepared you quite nicely for the work that you're doing with data dot world as far as being able to merge and incorporate and host disparate datasets that aren't necessarily in the same types of formats or even really sharing any common ground as far as the foundational layer?
[00:04:35] Unknown:
Yeah. I think I mean, you know, a lot of the problems that just kept coming up over and over again, you know, it's like while we were very, you know, practically solving these data integration problems and, you know, trying to keep up with the best practices in the state of the art, I was always kind of in the back of my head watching the the semantic web world and and and know, wishing that there were more of those tools in the mainstream, that there was more of this scale. There's there's really powerful technology for for normalizing and standardizing data, and it's just, you know, it was it was it seemed like after a while that it was getting further and further from the mainstream even as that technology was developing so many other techniques that had lots of you know, that were just kind of short term, easier to consume, and worked better with worked better with the existing kind of standard best practices kept coming along, and more and more people were getting into data. So there were there were, you know, increasingly more tools kind of just aiming at the at the the the lowest common denominator.
Where some of the early ideas for data dot world came from were really just, you know, thinking about how even some of the tools that people were doing in analytics and data science and in in different kinds of analytics roles that are, you know, dealing with CSV data and spreadsheet data, the kind of things that are extracted from bigger systems or pulled from APIs, you know, that data is almost completely unmanaged the way it's worked with today. And it, you know, just started, you know, started seeing opportunities where where, you know, really putting a putting a standardizing layer that lets you start to embed some knowledge and some metadata into into that data could be a way to really level up the way that people were working with data in a way that leveraged that powerful semantic web technology without necessarily having to, you know, hit everyone over the head with it and make everyone become an expert in this fairly esoteric technology in order to benefit from it.
[00:06:30] Unknown:
And that fits nicely into what you're doing with data dot world as far as being able to provide a foundational data layer for a lot of different types of open data to be able to be, hosted and analyzed and linked together. So I'm wondering if you can just describe a bit more about the work that you're doing there and the mission of the company and how your status as a b corporation ties into that?
[00:06:55] Unknown:
Yeah. As we as we started kind of brainstorming the business ideas, it kind of evolved from, you know, it'd be great if there was a platform that did this, but, you know, what does it mean? What would the business be? What you know, what's kind of the implications of it? We started really, you know, hitting on the idea that there's just there's a there's a flywheel effect that you could really really go. You know, there's there's not a lot of data that exists in this linked format where you can actually, you know, effortlessly move from 1 dataset to the next because the metadata tells you that you're about the same topics and that you're you you've got those semantic connections. There'd be a lot of value that could be unlocked if there was more data that worked like that.
And people do this work all the time. Right? Peep you know, it's a common meme that gets thrown around that, you know, 80% or 90% or 70%, whatever percent people throw around of the work that people are doing in data is just in finding datasets, cleaning them up, making them work together. And, you know, I guess the insight we started having was that's knowledge engineering. Right? That's the same kind of work that ontologists are doing when they're kind of curating data and and linking it together, but but people don't call it that and and and they don't they don't really necessarily know if that's what they're doing and that work often gets thrown away. Like, at best, maybe that that knowledge work is is in some reusable Python code or r code somewhere, and it's checked into a git repository so that a team can reuse it. But by and large, that metadata is not getting attached back to the data in a in a useful way so that the next person who comes along to use that data asset can kind of pick up where the where where the previous person left and and and benefit from all that work that had already been done. So, you know, we started seeing the the data would be more valuable if it was better linked. The people are actually doing the work required to link it, but we're not capturing the output of that. And if we could start to build tools that really, you know, understand the workflows people have with data and iteratively improve those workflows rather than trying to be something completely different, just try to be a better better way of managing that data workflow.
We could start to get in the middle of some of those pieces of work, capture that knowledge, feed it back into the data, which makes the data more valuable, and then that flywheel could start to go where you really get, you know, the network effect both in terms of connecting up teams of people working on data and better connecting the data. And, you know, as we started thinking about that, it it became pretty quickly evident that this would be a lot more powerful and a lot more valuable of an asset for the world if it was really, you know, done in an open fashion where we're linking up the world's open data. You know, the data coming from governments, the data coming from academia, the data that companies are starting to open up and share and I think that's gonna be a really big force going forward.
You know, we we thought that this would be a platform where if if we could find an economic model that let us allow most of the users on the platform to use it for free to work on open data, you know, forever, then we could we could start to really kind of level up the way people work with data, level up the data itself, make the the open data of the world more valuable. And really, that would provide a huge business value to folks because getting your company's private data adjacent to all that open data in a format where where you can you can find the relevant open data that helps solve the problems that you need to solve just by the fact that your data's about the same topics and it's linked. You know, just all of those things would could would kinda converge and and, you know, and then we, you know, finally then, coming back to the b corp, we started looking at the b corp structure and it was a pretty easy decision for for us really once we kind of educated ourselves about it. It was and I have to give a lot of credit to our, to our our our investors because when we when we first started the company, we hadn't really had the insight about a b corp. We just incorporated it as a standard c corporation.
And then we got really excited about the idea of a b corp and thought it made a lot of sense, and, you know, that required all of our investors to kind of educate themselves and get on board with making that change. But but it again, it was it it wasn't hard. We kind of explained, you know, how how it aligned with the mission of the company and how it really gave us a lot of, you know, a lot of credibility, a lot of standing to be able to say that, you know, we're we're, you know, we're we are you know, this is still a full profit company, but we have this social mission that's gonna be a big part of the decision making we're gonna make. And and it was pretty easy to get everyone on board.
[00:11:17] Unknown:
And the underlying platform that you've built for being able to host all of the different datasets, both open and private, for linking together, I'm sure has evolved a lot of complexity given the different sizes and types of data. So I'm wondering if you can just describe what that architecture looks like and the different tooling that you're using to be able to support those different datasets.
[00:11:42] Unknown:
Sure. Yeah. You know, something that something we honed in on, like, almost from the very beginning was we we didn't wanna be closed off as to what we what we consider data. We, you know, our our platform's really best, like, you know, we we've spent a lot of time optimizing around how we deal with structured or semi structured data and turning that into a queryable database. But there's a lot of value that people get in putting things like images and PDFs, even audio files, things like that, into a dataset alongside that structured data to help other people understand it. So, you know, we we we first kind of optimized around, you know, for let's just make sure that from a certain layer, you could look at a dataset as just storage. Right? You could just put bytes in and get those same bytes out, and those can mean what you want them to mean. And then we can start to observe what are the what are the thing what are the file types people seem to be using and getting value from, and we can handle those, you know, to the best of our ability. So we started with a certain set of structured file types, and we started adding more as we started seeing more signal.
Similarly on, you know, unstructured things, we started looking for, okay, people are putting a lot of source code in. So let's put some syntax styling in, put in some things to help people visualize that within the platform, you know, and and and so on. But then on regarding structured data, you know, the way we the way the way our architecture works around making all of that queryable is you know, so every as I mentioned, everything's based on semantic web technology. So what we do is we build we build an RDF model of any structured data that comes in. So if you're putting in data that is already in 1 of the serialized forms of RDF, we can just load that up directly.
And if you're, you know, if you're putting in other data that looks graph like, we can convert that to RDF data that's tabular in nature, which is, you know, by far the majority of what we see. So, you know, CSV files, Excel spreadsheets, relational database dumps, you know, some geo formats that have kind of a tabular structure in them. We parse those files. We build an RDF model, based on the the CSVW specification, which is a spec put out by w3c around modeling tabular data in RDF. And what this lets us do is basically build a a logically, the the way you can kind of look at data dot world is it's a single queryable database where each dataset represents a a a graph, if you're thinking of it as a graph database.
And at any point, you know, you as a user have access to some portion of that graph. You have access to your own private data, any data that any private data that another user has shared with you. So if you're a member of an organization, you'll have access to that organization's datasets. If, you know, if somebody has added you as a collaborator, this would mean you could you think of it as very similar to collaboration platform like GitHub. You know, you might have organizations you're a member of, and you might have 1 off collaborations that you're that you're you participate in, and then all of the open data on the platform. And you can query that data as though you have access to all the all those graphs.
The the native query language of semantic web and therefore of our platform is a language called SPARQL, s p a r q l. And that's another w 3 c standard query language for this graph database, structure of RDF. So you can and, you know, 1 of the nice features of of sparkle is it has really great features built in for doing that kind of federated query. So you can query across any datasets in our platform, but then also to public SPARQL endpoints. You know, there's a bunch of examples of that. The the easiest 1 kind of explained for maybe an audience who doesn't do a lot of this like data is 1 called DBpedia, which is, basically a linked database of structured data that is extracted from the pages that people edit on Wikipedia.
So if you think about all the information you might have on Wikipedia that you could go click through and read as a person, imagine that all that data is kinda pulled out into a structured searchable database, and and that database is open on the Internet. You can federate it between data dot world datasets and that database. And then 1 thing that we've done, which is, you know, by far you know, I I I get really excited talking about the Sparkle endpoint, all this RDF f r d f and, you know, me and tens of other people in the world care. But, you know, a lot more people a lot of the data that's in our platform is tabular in nature, and a lot more people understand SQL than SPARQL. And so 1 of the things we've built is basically a SQL to SPARQL transpiler so that you can you know, just like you can use our our our platform as though it's a big distributed graph database where you have access to a, you know, a portion of the graph.
If there are tables that are stored in each of those datasets, you can query them with SQL the same way you would a large distributed, relational database.
[00:16:39] Unknown:
And how has that architecture evolved from when you first launched, and what are some of the scaling problems that you've had to deal with as you have grown the platform and the variety and amounts of data that you're dealing with? Well, you know, we were very fortunate to kinda start this with a pretty experienced team, so we've been really deliberate around what we chose to scale. So
[00:16:59] Unknown:
early on, you know, we we made a point of making sure that, you know, the thing we wanted to optimize first for was basically, you know, account and resource level scale. So we wanted to make sure that whatever we do, it's gonna be able to quickly scale up to support any number of users, any number of datasets. You know, we didn't wanna we didn't wanna suddenly have the fortune of, you know, finding a great viral connection to a to a community and growing by by a huge amount of users and have everything fall over, especially since we, you know, we we we we knew we had the resources to to to not do that. So the trade off we made there was we actually we, you know, we started with relatively small limits per dataset. So we kept the size of each dataset small and then we've kind of been incrementally increasing that as we as we kind of evolve the tech.
When we first launched, we were basically, you know, we were using a a off the shelf open source, triple store, RDF database. And we had, you know, written a little bit of little bit of a custom orchestration code to load a a an elastic cluster of those of those servers. And we found pretty quickly that it was gonna be really hard to, you know, kind of maintain that account level scale. The the query engine was great on it. The loading of data in bulk as things kind of changed was getting to be just too unpredictable, and this the the the the spikes were hard to manage, and it wasn't something we could just kinda push to, you know, push to being completely asynchronous. So, we we we did kind of a major revisit of of the query architecture, and we ended up using a a a really a really interesting technology called HDT, which stands for header dictionary triples, which is a a binary RDF serialization.
It has the nice characteristic of the way the files are organized. They're queryable in place. So what that meant for us was we were able to basically build a I you know, I say a custom query layer, which is true, but it's custom in the sense of it's orchestrating together a couple of really great open source components. We basically are, you know, we we store as we ingest data, we convert it to RDF, serialize it out to hgt files which we store on, you know, really cheap cloud storage. We have a network a network file system layer that we use as our our cache where we basically load data onto onto that fast, NFS layer when it's when it when it's accessed. And then we have a a custom query layer that uses, Apache's Jenna, open source SPARQL engine as the query engine over those files.
And, this is you know, this has given us just a inexpensive, easy to operate kind of at the, you know, network economics level query layer that, you know, that will scale out linearly very, very easily.
[00:20:00] Unknown:
And what are some of the technical challenges that you've been faced with that are unique to the task of being able to host such a heterogeneous assortment of datasets and their intended, shared use case?
[00:20:14] Unknown:
1 of the reasons we chose the semantic WebTech and RDF for this was it's a it's a universal data format. If you think about, you know, how how triple stores databases like this are organized, everything is kinda broken down logically to statements of, you know, the form subject and objects. It's all kind of broken down into atomic facts. And so you know, as we as we start thinking about adding new new types of files, it's, you know, it's really in a certain sense, it's, oh, it's as simple as, you know, how do I how do I parse that file format and build a model of it into that RDF graph? That's before so that that that that's enabled us to to, you know, support a lot a lot of file formats. There's a lot more that we could continue to support that way. The things that that doesn't that doesn't handle are things like, you know, spatially indexed databases or or, you know, free text searching. So, you know, 1 of 1 of the things that we're we're we're we're we're starting to experiment with is how do we how do we layer, how do we how do we layer things like a spatially index database alongside this, this sparkle queryable graph database and allow people to to really you know, when we parse the geospatial files that we do today, you know, we're able to pull out we're able to we're able to pull out a a a pretty healthy a pretty useful amount of information. We can pull out, you know, lat longs and bounding boxes and some sort of know, geospatial queries that are relatively straightforward to answer, and we can pull out all the tabular topical data. But if you wanna really do true, you know, polygon matching, you you really you really just need a spatially index database. There's not there's not much else you could do. But, you know, and then, you know, you also mentioned the the the shared use.
You know, I I think something that's that that that we that we've spent a lot of time on, and and I I think I'm I'm really happy with what we have. I think we have a good flexible format here, a good flexible platform for this, but this is something we we keep working on with with groups of users. It's just that access control model. Thinking about, you know, what does a distributed access control model with groups and and, you know, how do you kind of optimize for both making it really easy to share data with groups of people and and distribute in as open a manner as possible while at the same time realizing that, you know, data security, data privacy are some of the most important things on people's mind when they're working with data. And so when you need to keep keep a piece keep a a a a set of data private or, you know, or secure in some other way, that's gonna be 1 of the most important things on your mind. So how how do you kind of support both?
[00:22:59] Unknown:
And that leads nicely into the topic of managing the privacy or compliance of some of the datasets that are loaded onto your platform, particularly if they're intended to be hosted privately? But, if they are hosted publicly, how do you ensure that there isn't personally identifiable information in that data that needs to be cleaned out?
[00:23:21] Unknown:
Yeah. You know, it's 1 of the questions we get asked the most, and I'll throw 1 more 1 more facet into that question before I answer it, which is also, you know, particularly important for open data is is licensing. You know, there's there's not 1 size fits all open data. It'd be great if all the open data out there was just simply public domain or creative commonsero, but that's not that's not necessarily the case. A lot of data is open, but it still has it still comes with some amount of of restrictions on its use. And we, you know, we think about these things, I guess, you know, through through 2 lenses. 1 is that, you know, with respect to issues of privacy, compliance, security, access control, you know, we are doing the same things that any kind of any cloud hosting provider who's managing your managing user data on their behalf is, which is you know? And and so with respect to the concerns about, can I trust that you're keeping my data private from other users, that that I'm you're keeping it secure, that it's not gonna leak in leak into other hands, you know, we we do what we do? We go through all the things we do there, but it'd be the same, you know, the same things that that any company would would say they're doing to, protect that. The the issue really comes more into are we are we guaranteeing that if somebody that somebody that 1 of our users isn't putting data that has people's private information or that is not compliant with licenses or other usage restrictions.
And we look at that like any kind of Internet hosting provider would, which is I I can't I I cannot necessarily control or stop every everything that somebody publishes using my platform the same as, you know, nobody a a web host provider couldn't stop somebody from taking that same dataset, dumping it as a as a CSV, and and posting it on a web page somewhere else. You know, we we have a we have a a process and and, you know, all the steps we have we need in place for taking things down when we find things that are that are posted outside of usage restrictions or that contain PII that that shouldn't be there. But we bay you know, we we basically do pass that that responsibility for posting things and keeping them in compliance with with the rules of of the use of that data to the users.
[00:25:35] Unknown:
And for the data types or the, you know, data formats that people upload that are not able to be automatically converted to RDF, thinking particularly of things like PDFs with embedded, tabular data or prose text or audio formats that are difficult to parse in an automated fashion. What are the ways that you expose for people to be able to add additional information to make those pieces of information discoverable and, add contextual relevance to the structured data that they're uploading.
[00:26:11] Unknown:
You know, 1 thing we've seen happen quite a few times on the platform, it's something that, you know, I I think is really kinda goes back to to the the core idea that, you know, data can be iteratively made smarter and more contextual and have more knowledge embedded in it is, you know, we've seen a number of times where somebody has posted, you know, a PDF or a set of images that, you know, that contained some structured information, but that wasn't easy for that person to pull out in an automated fashion. And we've seen another user come along and say, oh, this is great. This PDF or this set of images had tables of data that I was really looking for, and I either, you know, and I went and and I went through the process of either, you know, applying some OCR software or using an open source project like Tabula to extract tabular data from PDFs or in some cases, just, you know, hand transcribe the data and share that structured version of the data asset back. And so, you know, you kinda have the this this world where somebody had the data and shared it because they thought it might be useful to somebody. Somebody else came along, found it, and said, that's useful to me, but only if I turn it into structured data.
And I've now done that. Let me share that back. And so you you have you have both of those assets sitting alongside 1 another. That provenance, that data lineage is there. So if somebody ever questions whether the structured version is a true representation of what was in the original, you know, the ability to go back and reproduce that is is there for anyone who who has the wherewithal. I think over time, things that we will likely end up doing is, you know, some amount of automatically handling some of those things. You know, we've we've we've experimented a bit with with maybe making something like Tabula, part of our ingest pipeline where if we can find some easily identifiable tabular data in the PDF, we can extract it out.
You know, and and and that's you can do that with some degree of success. You can also end up with a lot of, you know, really questionable result data that that's not that's not quite, you know, it it's almost not it's almost less useful than than doing nothing if you end up with, you know, extracting out bad bad copies of data. So it's something I think we'll keep experimenting with. Certainly something I I I I would like to do more of is some natural language processing on free text data, PDF documents, or even just the textual descriptions that people add to do more topical analysis, you know, kind of understand some of the the the entities and the categories, that that that we are that are represented by that data. And again, 1 nice thing about having the, you know, the the backing of every of every, dataset be a graph database is that as we start doing that kind of metadata extraction, it just naturally can get added to that dataset and make the search across the that that data a little bit smarter, you know, 1 1 1 piece at a time. We don't ever have to kind of do a big bang of now we support this whole new feature. It's it's you know, there are certain kinds of metadata that you could extract from natural language. And as we get better at extracting it, we have a place to put it where it's it's immediately queryable.
[00:29:20] Unknown:
And for people who are running analysis against the data on your platform, are they generally to it via an API if they're trying to do something more programmatic than just using the SPARQL query language for trying to fetch data from it? Or do you have some capacity or capability for people to be able to, submit a programming task or a analysis task to a set of executors that will then run co located with the data?
[00:29:47] Unknown:
Yeah. That's great. Great question. So we we don't have any sort of hosted computation that runs colocated with the data. It's it's, you know, definitely something that comes up. I think it's something that, that we'll, you know, we'll keep we'll keep looking at, especially as we start accumulating larger and larger datasets. What we do have, APIs for accessing accessing the data and most people doing analytical work on the platform are using those APIs. We have, you know, a a set of of of web APIs that cover basically all the functionality of the platform and provide a number of different ways of getting data both in and out. And we have SDK wrappers for that in in in Python and R, which a lot of users use those with just personal access tokens to do ad hoc work. So, you know, if you're if you're if you're working in Python or R, you can you can pull in those SDKs and treat really any of the tables in any data set as a data frame and act on it in place, be able to, you know, push push data back up into the data set as a result of computation or pull pull data frames down.
You have access to that SQL and sparkle layer through those APIs. So if you wanna do joins across multiple datasets and pull those into data frames and operate on, you can do that. And then the you know, using the Python SDK, there's actually, you know, quite a few Python SDK and then and then other other language targets that are just accessing those web APIs directly. Quite a few other connectors that are that are available. You know, 1 of the 1 of the most popular ones is the the Tableau connector. So folks who wanna who wanna do do analysis in Tableau can can connect directly to data dot world datasets and and produce reports, visualizations that way. We just launched into the Microsoft App Store, an Excel connector.
You know, 1 of the 1 of the things we we we started observing is that, you know, while you can just take an Excel spreadsheet and drop it into a data dot world data set and that that works great for kind of clean, well organized data datasets in in excel format, a lot of spreadsheets don't really follow that norm. Right? A lot of times your spreadsheet is just as much kind of an ad hoc database interface that you've that you've that you've evolved as it is a place to store the data. And so what we when we built the the the data network connector, it really lets you work with work with richer, more complicated spreadsheets in a way where you can start to actually, you know, bind the data tables that exist in those spreadsheets back to data dot world datasets in a in a much more in a much more natural fashion. So, you know, and we we make all of these connectors open source.
So we've we've already see you know, started to see a number of times a number of places where community members have looked at our connectors that we've published and use those as a as a jumping off point to go take some of their their their favorite platforms and start to build, build connectors into those.
[00:32:50] Unknown:
And what are some of the new capabilities or improvements that you're planning to implement into the platform?
[00:32:58] Unknown:
So the internally, we're kind of organized around, you know, 3 principles, you know, and and there's the engineering team's kind of kinda centered around these. You know, we we we think that that, you know, working with data should be collaborative so that we need great, you know, features for for people doing doing team data work. It should be linked so that we're, you know, always increasing the amount of connectedness in the data itself and the the number of datasets that each dataset's connected to, and and and it should be connected to external tools that, you know, the the idea there is really just that there's such such a huge number, such a huge amount of investment today going into data tools. You know, the the you know, for a platform like ours where we wanna kind of be the the collaborative hub for how people are are working together on data assets, we need we need to interact with as many of those those tools and platforms as as possible because we're not we we, you know, we don't wanna be in the business of trying to change completely how you work with data. We just want to enrich it and help help you make your data smarter over time. So across those 3, you know, our our the team working on collaboration is really working on what are the tools that help a team of people work together. And a lot of that right now is just what does our organizational access control layer look like? What kind of features do you do you focus on when you're looking at your team's dashboard, your team's library of data?
Our data team are working on, you know, we build all of this data into a linked RDF database. We're now really kind of making good on that promise of as you're working with data, we're capturing that the results of that work you're doing and turning that into rich linked data, taxonomies, ontologies, basically, highly structured descriptions of what the day what your data means and starting to surface those features back out to you. And so, you know, really doing fulfilling fulfilling that that flywheel I talked about earlier of if you've uploaded a a a spreadsheet that has a column of 5 digit numbers, then we can look at those 5 digit numbers and say, well, these aren't 5 digit numbers. These are ZIP codes. And ZIP codes are actually a real world entity that has some indicator of geospatial extent, and I can start to find you related data that actually is broken down by the same ZIP codes. And so you could, for example, you know, traverse from your data to the US census. And the fact that you have your data broken down by ZIP code, I can help help you aggregate and break that data down by racial demographics or economics.
A lot of those features are the things that are gonna be coming out in the near future. And then just on our with respect to being connected to outside tools, yeah, I think the thing that that the things that are coming down the line there are really, just fleshing out the the the whole data life cycle so that our tools can can, you know, if you're if you're working in an external tool like Tableau, like Excel, you can do more and more of your work there so that, you know, if you're, if you've if you've linked those tools to data dot world, you can really stay in your tool of choice and and and come back to our data dot world web platform only when you only when you need to to maybe, like, kick off new work or or, you know, you know, interact with some some of the tools there.
[00:36:27] Unknown:
And going back briefly to 1 of the things that you mentioned as far as being able to track data provenance, I'm wondering if you are versioning the datasets as they are updated or modified by the people who are maintaining them.
[00:36:43] Unknown:
Yes. We are. We're we're we're versioning every change, you know, both of the of the granularity of the datasets also to a lot of the individual assets, so saved queries, the the insights that you publish alongside the data. We don't and, you know, this kinda ties into the the some of the work coming out of the, the team working on the the social features. We don't surface as a lot of the versioning right now, and a lot of that is really just, you know, working with through a lot of user tests and figuring out what is the model of versioning that makes the most sense for teams that are doing this kind of data work.
We're we're capturing all that information and we've kinda got full replayability in in the model that we're using. But but what we're what we're, you know, what we're fighting through user tests is that if we just kind of expose, you know, something as rich and complex and with all the same kind of features as as Git, it's not really the way that non programmers, find easy to think about versioning. And so, you know, we're looking at some other models of of how that that data lineage can be displayed. And, you know, and then and it's not just displaying the lineage, but then what's the what are the what are the operations you can do with that lineage. Right? Do you wanna be able to roll back to a point in time? Do you wanna be able to to tag a particular version of a data set and say that you could do reproducible queries at that version? I think yes to both of those things. I just think we'd figure out how to how to how to communicate that to users in a way they can understand.
[00:38:14] Unknown:
And given the breadth of the capabilities and target that you've got for data dot world, what are some of the other projects or companies that you consider to be competitors in the space?
[00:38:27] Unknown:
Yeah. We think about this a lot. You know, we kinda look at the different segments of data tools and and try to think, you know, and we could we could see ourselves to some degree intersecting across a lot of a lot of those segments. You know, 1 thing that I think is is fairly unique and I think is is, you know, kind of the the the focal point of what we do is just the the the social network aspect. The fact that it is an an open platform, We can build, you know, closed networks within that open platform. So, you know, teams within companies are forming, you know, organizations and working in private but still kind of having access to that broader network. You know, the the the tools we definitely don't wanna be don't wanna consider ourselves in in competition with is is, you know, sort of the the the the, visualization and and kind of reporting analytics tools. We think there's so many great tools there that we we really wanna be as as compatible as possible with those. You know, there there there's there's platforms for for for data cataloging, metadata management, and those have those have a those have a some similarity and overlap with what we do.
But, you know, I I hate I hate saying, like, I don't know I don't know that we have something I consider a direct competitor, but but almost every company we look at, we we can think of, you know, I can think of of potential customers who might choose to, you know, go with another tool instead of ours on a given, in a given case, in a given basis, but I can I can see ways for a data dot world install a data dot world organization to be a a good complement to almost any almost any data tool out there?
[00:40:10] Unknown:
And with the variety of data that you have hosted on the platform and the broad number of people who are interacting with and enriching those datasets, what are some of the most notable or interesting or unexpected uses of the data dot world platform and analysis that have come out of it that you're aware of?
[00:40:29] Unknown:
Yeah. So, you know, 1 thing that that I was really really kind of surprised by and and and really happy with, around around the end of of last year, a group of folks who were, you know, motivated by by a lot of the developments in our political climate to, you know, start an organization of data workers to, you know, really pick up a lot of social social causes and try to address, you know, try to address social projects with data and with analytics skills, called Data for Democracy, founded by a a pretty well known data scientist named Jonathan Morgan, started up, and they, early on, connected with us and and and started doing a lot of and and a lot of the the the teams working on data for democracy projects started working on data dot world, you know, when we we were really only a couple months old as a platform, and they were just getting started as a as a as a as a network. And, you know, within within about 6 months, that that organization grew to to well over a 1000 people. And I think it's I think it's over 2, 000 people now.
And and I I can't quote you the number of users who are on data dot world at this at this point exactly, but I know that when they were about a1000 people, we had about, you know, 657 700 active users as a percentage. And and and as far as I know, that percentage has kept has kept pretty, pretty intact. So a lot of people doing just and these are people who are you know, a lot of them are students. A lot of them are professional, data scientists, business analysts in their in in their their day jobs. A lot of folks who were kind of interested in learning data science, and using this as an as an ability to have some real hands on experience and work with some some smart people to kind of, teach them. So, you know, we've seen we just see a lot of really, really great kind of groups of people coming together to to work on projects. And it was it was fantastic for us in the early days to have just such an active group of people, giving us feedback on the platform and and and helping us kind of understand what this, you know, data teamwork looks like.
[00:42:45] Unknown:
Yeah. That definitely sounds like a very interesting and worthwhile, usage for the platform. And you mentioned Jonathan, who I have had the pleasure of meeting and speaking with before as well. And for those who aren't aware, he is, 1 of the hosts of the Partially Derivative podcast, which is another 1 worth listening to. So, thank you for bringing him up as well, and and to fellow Austinite if I'm not mistaken. That's right. And are there any other topics that you think we should cover or anything about data dot world that we, didn't touch upon, that we that we should talk about before we close out the show?
[00:43:22] Unknown:
No. I think, you know, I I really appreciate you inviting me to come on the podcast, and, yeah, I think you asked some some great questions and really kind of covered covered the ground of what we're doing. You know, I think we're doing something, pretty pretty unique in the data in the data world, and I think there's, you know, there's a lot of, lot of interesting tools out there, and and we're we're you know, I think I think we're we're finding a a good place in that in that pantheon of data tools, and it it's it's a lot of fun. It's definitely a really rewarding thing to be doing every day, and I just I really appreciate the chance to come on and talk about it.
[00:43:56] Unknown:
Alright. And for anybody who wants to keep in touch with you and follow the work that you're up to at data dot world and elsewhere. I'll have you add your preferred contact information to the show notes. And as a final parting question, from your perspective, what is the biggest gap in the tooling or technology for data management today?
[00:44:17] Unknown:
Well, unsurprisingly, given everything I've said about about what data dot world is, I I think, you know, the the biggest gap is in keeping context and knowledge with data. I think, you know, most of the the data out there really has, you know, only only a very loose notion of of what it actually means, and it requires a lot of, extrinsic information that really is, you know, trapped in human heads to make sense of it. I think, you know, we we focus a lot on the semantic, meeting of data and how do we capture that knowledge and that semantic meeting alongside the data. I I think in in a lot of ways, it's what, you know, machine learning folks are are are thinking about. They're just using different words and kind of inventing different technologies for it.
I I I think, yeah, I think there's gonna be a real convergence in the in the in the next couple years around, you know, really leveraging this this idea of semantics and knowledge management, both for, you know, data that's being worked on by people, but also in in furtherance of of making data more accessible, more usable by by machines.
[00:45:25] Unknown:
Alright. I'll let the, listeners think about that as they go about their day, and I wanna thank you for taking the time out of yours to join me and talk to me about the work that you're doing. Data dot world is definitely a very interesting platform. I joined it a little while ago, so I've been getting the regular newsletters of the new datasets as they're added, and there's a lot of interesting things that given the time I would like to do some analysis of. I'd also like to point people to the blog that you guys maintain because there have been some very interesting articles coming out of that. And, just wanna thank you again, and hope you enjoy the rest of your evening.
[00:45:58] Unknown:
Thank you very much. I appreciate it. And and, I will also I'll just take that moment to remind folks that, you know, data dot world is free to join and sign up, and it always will be. So if you're if you're interested, you know, please just go to the go to the website, data dot world, and create an account, and, thanks very much for for inviting me on.
Introduction and Guest Introduction
Brian Jacob's Background in Data Management
Challenges in Data Integration
Mission and Vision of Data.World
Data.World's Architecture and Tooling
Scaling and Technical Challenges
Privacy, Compliance, and Licensing
Handling Unstructured Data
APIs and External Tool Integration
Future Capabilities and Improvements
Versioning and Data Provenance
Competitors and Unique Positioning
Notable Uses and Community Impact
Closing Remarks and Final Thoughts