Summary
Most databases are designed to work with textual data, with some special purpose engines that support domain specific formats. TileDB is a data engine that was built to support every type of data by using multi-dimensional arrays as the foundational primitive. In this episode the creator and founder of TileDB shares how he first started working on the underlying technology and the benefits of using a single engine for efficiently storing and querying any form of data. He also discusses the shifts in database architectures from vertically integrated monoliths to separately deployed layers, and the approach he is taking with TileDB cloud to embed the authorization into the storage engine, while providing a flexible interface for compute. This was a great conversation about a different approach to database architecture and how that enables a more flexible way to store and interact with data to power better data sharing and new opportunities for blending specialized domains.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
- Your host is Tobias Macey and today I’m interviewing Stavros Papadopoulos about TileDB, the universal storage engine
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what TileDB is and the problem that you are trying to solve with it?
- What was your motivation for building it?
- What are the main use cases or problem domains that you are trying to solve for?
- What are the shortcomings of existing approaches to database design that prevent them from being useful for these applications?
- What are the benefits of using matrices for data processing and domain modeling?
- What are the challenges that you have faced in storing and processing sparse matrices efficiently?
- How does the usage of matrices as the foundational primitive affect the way that users should think about data modeling?
- What are the benefits of unbundling the storage engine from the processing layer
- Can you describe how TileDB embedded is architected?
- How has the design evolved since you first began working on it?
- What is your approach to integrating with the broader ecosystem of data storage and processing utilities?
- What does the workflow look like for someone using TileDB?
- What is required to deploy TileDB in a production context?
- How is the built in data versioning implemented?
- What is the user experience for interacting with different versions of datasets?
- How do you manage the lifecycle of versioned data to allow garbage collection?
- How are you managing the governance and ongoing sustainability of the open source project, and the commercial offerings that you are building on top of it?
- What are the most interesting, unexpected, or innovative ways that you have seen TileDB used?
- What have you found to be the most interesting, unexpected, or challenging aspects of building TileDB?
- What features or capabilities are you consciously deciding not to implement?
- When is TileDB the wrong choice?
- What do you have planned for the future of TileDB?
Contact Info
- stavrospapadopoulos on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- TileDB
- Data Frames
- TileDB Cloud
- MIT
- Intel
- Sparse Linear Algebra
- Sparse Matrices
- HDF5
- Dask
- Spark
- MariaDB
- PrestoDB
- GDAL
- PDAL
- Turing Complete
- Clustered Index
- Parquet File Format
- Serializability
- Delta Lake
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I'm working with O'Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to data engineering podcastdot com slash 97 things to add your voice and share your hard earned expertise. And when you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With their managed Kubernetes platform, it's now even easier to deploy and scale your workflow, so try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode, that's l I n o d e, today and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For more opportunities to stay up to date, gain new skills, and learn from your peers, there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to data engineering podcast.com/conferences to check out the upcoming events being offered by our partners and get registered today. Your host is Tobias Macy. And today, I'm interviewing Stavros Papadopoulos about TileDB, the universal storage engine. So, Stavros, can you start by introducing yourself?
[00:01:49] Unknown:
Absolutely, Tobias. Thank you very much for having me. I'm Stavros Papadopoulos. I'm the CEO and founder of TileDB. I'm a computer scientist, and I'm I'm excited to to talk about TileDB and everything we have done. And do you remember how you first got involved in the area of data management? I did my PhD in databases, so that's how I started. I have always been in the databases space. I was at the time focusing mostly on multidimensional data structures, data privacy, and cryptography. But then in 2014, I joined Interlabs and MIT where I worked on a big data initiative alongside some database gurus at MIT as well as some high performance computing ninjas at Interlabs, and, here we are. This is where everything started.
[00:02:27] Unknown:
And so you have recently started building the TileDB project. Can you give a bit of an overview about what it is and some of the problems that you're trying to solve with it? I'm gonna start by explaining in a nutshell what it is. So it is a novel engine, novel kind of database,
[00:02:43] Unknown:
which allows you to store any kind of data, not just tables as traditional databases. So it can be genomic variants. It can be geospatial imaging. It can be data frames as well. It can be tables, but it is more than that. And we it has its own universal storage format to be able to do this. And then it allows you to manage this data so you can define access policies. You can share the data with anybody in the world. You can log everything. And, of course, you can access this data with any language or or tool. So it goes beyond traditional SQL that that you find in databases. It consists of of 2 components so that we drive the conversation later. It has an open source component, which we call TileDB embedded.
And this is the storage engine that that is based on this universal format that uses multidimensional arrays, and we're gonna discuss about this a little bit later. And this contains all the the language APIs as well as tool integrations, plus everything that has to do with cloud optimized storage as well as data versioning. And there is a private offering, which we call TileDB Cloud. And this is a SaaS platform which allows you to share your TileDB data with anybody in the planet and allows you to to define arbitrary user defined functions with dependencies and and dispatch them to to a cloud service. And the most important thing about this this cloud service is that it is all serverless. We do that at extreme scale, and it is built from ground up, serverless.
[00:04:03] Unknown:
And you mentioned that you've been working on and with databases for a number of years now. I'm curious what you are drawing inspiration from as far as some of the systems that you've worked with that you're using to direct your designs on TileDB and some of your motivation for building a new database engine that is drastically different than most of the ones that I've had experience with anyway? So, TileDB has a long history. It started at the end of 2014,
[00:04:30] Unknown:
the beginning of 2015 when I was working at, at MIT Intel. At the time, I was just looking for a research project to work on under this big umbrella of of big data, I mean, this initiative we we were working on at the time. And I was a c plus plus programmer. So I had 2 different types of influences. Right? The the MIT people who were building traditional commercial database systems and then Interlabs who were building high performance computing software. And a lot of it was, was around linear algebra, which is at the core of machine learning, deep learning, and all all advanced analytics. So I was looking for a way to to combine these 2 areas. And from a research perspective, what I wanted to do was mostly sparse linear algebra, which essentially means linear algebra with with matrices that have a lot of zeros or or empty cells. Right? And these these are more peculiar from from a performance perspective, and they need careful handling.
And, also, I was very much influenced from geospatial data from my time, during my PhD years. So, frankly, I was looking for a way to store sparse matrices so that I can do very fast sparse linear algebra, and at the same time, I can I can capture some some of the geospatial use cases? Again, everything completely research oriented. So I had a a couple of requirements as I was building this engine for sparse arrays. The first requirement, of course, was that it had to handle sparsity and ideally dense arrays as well so that it is a unified engine of a dense array has values everywhere. So the the number of zeros is not as as big as as in spars arrays. The second requirement was that whatever we were building, it had to work very, very well on the cloud because we saw a big shift to the cloud.
So the storage engine should work on AWS s 3, Google Cloud Storage, Azure Blob Storage or any other object stores in the cloud. Another requirement was that it had to be an embedded library. So it had to be built from scratch by definition because it was the storage and and I couldn't use any other component from, established databases. So I want to build it from scratch in c plus plus and in an embedded way so that you don't have to set up a server to use it. And the 4th requirement, at least for me, was that it should be built in c plus plus. 1st for speed. 2nd, because I was good at c plus plus. But finally, because I had the the longer vision that these libraries should interoperate with other languages as well, so having a c plus plus library may make this a little bit easier.
Now I have to mention that at the time, there were such storage engines like HDFI, for example, a very popular dense array engine. But that was architected around dense array, so I couldn't use it for for my sparse problems. And second, it was not built for the cloud because it's been around for decades and and and the cloud gained popularity only recently. So it was not architected to work very well on s 3, for example. So that's how it started. I that's what motivated the storage engine. So I built it in a way that handles both dense and sparse arrays in a unified way because if I architected to in a way to handle sparse arrays, maybe there are tons of similarities in handling dense arrays. So let's identify what is different, and then let's spell this out and and handle both in a very, very efficient way. And at the same time, I was very fortunate that Intel was working with a prominent genomics institute, and they presented me with with a very, important and difficult problem around storing genomic variants. So huge data in essentially a sparse format. The the genomics data is very, very sparse. So the solution that I presented was very relevant.
We created the proof of concept. It went very well, and it got adopted. So we said, okay. This storage engine probably is very meaningful for more use cases than I had originally thought. So let's give it a chance and start building it up. And this is what made TileDB embedded. That's the open source system that I created at the time. And, of course, it evolved, and we can discuss later about how. But that's entirely the motivation behind the the TileDB embedded storage engine, which is the only system that handles both dense and sparse arrays, sparse part dimensional arrays in a unified way. And what was the motivation behind TileDB Cloud? Now at the time also, we were discussing with a lot of scientists. Again, of course, I had the databases perspective from from MIT, but I was talking to other groups and other scientists from geosciences, from, from genomics and other scientific domains.
And I observed a couple of similarities. The first thing that I observed is that every single domain has its own crazy data format. It is a file format, which is very domain specific. And it's crazy in the sense that it has a lot of jargon. Although, at the end of the day, it's just data. And I'm gonna explain and clarify a little bit what I mean by by that. And a big similarity there was that regardless of what format you choose for a specific domain custom made for your application, no matter how good it is, and you can make it very, very good, All hell breaks loose when when you have updates, data versioning, and access control.
Right? A single file works great, but not so much if if you start updating this file or you're adding more files. You end up analyzing thousands of files. And that was the same in genomics as well as geospatial, the the exact same thing. Another thing that I observed was that every domain preferred different languages and tools. They they had, for example, 1 group in bioinformatics really like r, another group like Python and geospatial. For example, you would find somebody who who like Java as well. So a lot of different preferences in terms of what languages you want to use in order to access your data. And then that goes back to the original decision that we build everything in c plus plus so that we build APIs for every language.
Again, regardless of the domain, the scientists wanted always to share their data, of course, with with access policies and everything and code for reproducibility. Right? So just sharing files was not going to cut it. So, eventually, the biggest observation of all was that the data management principles, data management features that we have in databases, I couldn't find them in domains like genomics and geospatial. And later, we we found that that that was true for for other domains as well. So data management was a problem. It it was not the science behind those domains that was creating all the problems.
So we kind of lucked out in the fact that the other observation was that all data, regardless the vertical, can be efficiently modeled as a dense or a sparse multidimensional array. For example, an image is a dense 2 d array. Genomics is a sparse 2 d array. LIDAR point clouds are sparse 3 d arrays. Even key values can be considered as a sparse 1 dimensional vector where the keys are are string values in the string domain. So even that, I I can prove to you that that essentially boils down to to sparse arrays. So a lot of common things across the verticals, and we already had TileDB embedded, which addressed the issue of storing everything as multidimensional arrays, address the issue of interoperability that, hey. Everybody can access the data from their favorite tool and their favorite language.
What we needed was to try to to scale the other data management features like access control at the global scale that did not exist, try to do everything serverless because that alleviates the pain of setting up clusters and addressing certain issues with with scalability. And also creating user defined functions with arbitrary dependencies as task graphs and deploying them in in the cloud. And that effectively gave rise to TileDB Cloud, which is the SaaS platform we built for the cloud, which handles data management. So access control and logging at planet scale as well as serverless compute in the form of task graphs.
[00:12:31] Unknown:
And an interesting thing to note too is that, as you said, all of these different specific domains have their own custom file formats that they've been using for years, which means that a lot of these people who are working and researching these domains or who are building applications probably have piles of data lying around in those formats. I'm curious what you have seen as far as the approach to being able to translate that information from those legacy formats into TileDB or from TileDB into those legacy formats formats
[00:13:06] Unknown:
to be able to fit with their existing tooling? This is where we spend the majority of our time, admittedly. Right? Because, again, we we're a storage first company, so we spend most of our time understanding each vertical, and so we spend most of our time understanding each vertical and each file format. And, of course, we had to bring some brilliant people in our team who had this knowledge or we were working very closely with customers, which, of course, provided us with this knowledge. So, essentially, what we had to do was understand the data format, try to map it into a multidimensional dense or sparse array depending on their access patterns. Right? So it took a little bit of back and forth in order to understand what the best modeling is. But at the end of the day, it was an array. Then we created ingestors that were reading from those legacy formats into the TileDB format, and then everything fit in place. The reason is that once you get your data into the TileDB format, then you inherit everything we build on top regardless of your vertical. For example, if you're in genomics and you store the data as arrays, you get our integration with DASK, Spark, MariaDB, PrestoDB, the 6 APIs we have. You get the whole ecosystem and our whole mantra in the company is that we are going to integrate with pretty much everything that exists out there. So once you put the data into TileDB, you get this versatility, this flexibility to process your data with anything you like, including your own tools. For example, for for the geospatial verticals, we did integrate with popular geospatial libraries like PUDL and Jira. And, of course, we're happy to do the same with in genomics, for example, it is in our plans to integrate with a popular library called Hale.
[00:14:43] Unknown:
So because of the fact that you have this universal data format that can model all of these different problem domains and you're focused on being able to store the information efficiently and have these versatile interfaces for all the different computation layers. I'm curious what you have seen as far as the challenges of being able to design the APIs to make it easy to be able to actually use all of these different computation layers on top of TileDB because you mentioned things like Spark and Presto and MariaDB. So you're working in touring complete languages. You're also working in SQL. I'm curious what some of the challenges are as far as being able to make the access patterns intuitive and efficient for all those different use cases. Yes. This is a great question.
[00:15:32] Unknown:
Again, we kind of lucked out in in that respect. In the past years let's start with the databases, and then we're gonna explain about everything else, all all the other computation tools. The databases recently shifted to to a framework where they support pluggable storage. Right? Before they were monolithic, they handled all the layers in in the stack. Right? From parsing the query down to to storing the data on the back end. And most recently, they just unbundled the the storage. Right? So they created their own APIs that allow you to plug your own storage engine, your own stone format there. So that made it very easy for us to just go into MariaDB, for example, or PrestoDB or or Spark. Spark has data connectors by definition and just plug it. So it was a lot of work to do it because we had to understand how every single tool does it. So it's a time issue rather than a complexity issue because those guys did a good job to expose clean APIs to do that. And then fortunately, for the databases, we have an 1 to 1 mapping between a data frame and an array. And this is done by pretty much selecting a subset of your columns to become your dimensions, and that's your fast indexable columns. These are the columns that Huddl will allow you to to slice very fast on. So for databases, we lucked out because they they were already doing it and we just planned TileDB into them. For Spark also, it was easy because they had data connectors. For DASK, the same thing. They have data connectors. They don't bind their storage to a particular library. So that was easy to do. And pretty much the it's the same story for for the rest of of the tools like GDAL and PDAL. But we needed to have people that have done it before in order to do that very, very efficiently, both in terms of of time as well as performance. And, again, we have people in our team that are specialized in doing exactly that. So it was not that much of a challenge from from an engineering perspective.
It was just a a time investment, which we happily did because that completes the vision of being a universal data engine, and we will continue doing that.
[00:17:49] Unknown:
And particularly for things like a SQL interface that's used to working with a 2 dimensional array, I'm curious how you represent an n dimensional array. Is it just a series of different tables axes and then join across them, and then TileDB handles translating that into the multidimensional array on the back end? Or was there some other level of abstraction that you needed to add to be able to make it easier of abstraction that you needed to add to be able to make it easier for people to be able to process and analyze these multidimensional structures?
[00:18:21] Unknown:
Yeah. So let's clarify this a bit. We directly bundle to vanilla SQL. Right? SQL on tables, not specific adaptations for matrices. At least we haven't done it just yet. We may do it in the future. But as of today, you can use, for example, mariadb with talib plugged in and you can run any sequel query as you would do it on mariadb alone any ansysql query and it's gonna work. The only thing that you substitute is in in the from clause, you put an array URI, a tallyb array URI, which could be local on s 3, on Google Cloud, on Azure, pretty much anywhere. And the whole query is just gonna work. So there is nothing to be done by the user in order for the SQL to work.
The only thing that the user should know from a performance perspective is which of the columns we marked as dimensions in the TileDB world. Because those call if you slice if you have a predicate in the workflows that does a range query or quality query on those particular columns, you're gonna get a very fast query time. That's the only thing that the user should know that, you know, those columns are special. They index essentially, TileDB acts like a cluster index on those particular columns. So you're gonna get a lot of performance from that. And similarly, if you are the 1 who constructs the table, even from SQL, we have added configuration options that allow you to say, okay. This particular column is a dimension. So you can mark in in the create table statement. You can mark which which of the columns are dimensions, and you should think of those as a clustered index. That's the best way to think about it.
And everything works like in the SQL world.
[00:20:11] Unknown:
So can you dig a bit more into the actual on disk format of the multidimensional arrays and how they're stored by TileDB for being able to then query and analyze them and just some of the ways that users of TileDB need to think about data modeling that might be different than the ways that they're used to using either relational structures or graph databases or some of the custom file formats that they might be coming from?
[00:20:39] Unknown:
So we're gonna make a categorization because every category eats peculiarities. So let's take, for example, a dense array case. Let's take an image. Okay? So if you want to store an image, so each pixel in a database table with a standard traditional database and be able to slice it multidimensionally, for example, put the range on 1 axis, put the range on the other axis, and get this the slice. We call this a slice. Right? A multidimensional slice. And arrays are pretty good at at giving you these slice very, very fast. That's why you use arrays. Right? So if you want to alternatively store this in a in a traditional database, the very first thing that you should do is create 1 record per pixel.
So instead of storing just the value of the pixel, right, the RGB or whatever it is, you have to explicitly store the coordinates of that pixel. For example, 11, 12, 1314 in separate columns. It's gonna be 1 column for the 1 dimension, another for the other, and then perhaps 3 columns for r g b. Right? So that when you issue a SQL query, a standard SQL engine is gonna understand, okay, the first predicate is on the 1st column. The second predicate is on the second. And I can even create a clustered index, and there you go. Everything works very, very fast. Right? The problem is that you are introducing those 2 extra columns and dense arrays do not store explicitly the coordinates of the pixels in the dense case. And that's a very important difference versus the sparse case. So going back to your question, for dense arrays, we don't store the pixel coordinates.
We just impose in 1 dimensional order to those 2 dimensional or n dimensional values. And there are ways to do that. We we give you a lot of flexibility to impose this order by chunking into tiles, hence the the name tiledb. So, essentially, we impose an order Then based on on some explicit tile capacity, we chunk we chunk those values, and this this chunk is called a tile in TileDB. And then these values are serialized in a file, 1 per attribute. So it is a columnar format like parquet, for example. Right? R all the values are gonna be stored in 1 file. All the values along g, are gonna be stored in another and b in another. But not the coordinates.
That's a very important distinction versus sparse arrays as well as traditional tables. Right? Because for tables, if you don't store the indices, how are you going to slice on that? The tables do do not have any semantics of serializing a 2 dimensional space into an 1 dimensional curve. There's no such semantics in in the database. But the reason in a dense array storage engine, a tally b HDF5. That's exactly what these storage engines do very, very well. Okay? So this is the on disk format. We serialize the multidimensional objects into a single dimensional order. So, essentially, we sort in a particular order.
We chunk. We compress each chunk individually. We put them in 1 file per column, per attribute, and then we store them in a subdirectory in an array directory, which is time stamped, and it is called a fragment. And this fragment is immutable. After it is stored, it will never be changed. And this is a very important architectural decision we took for data versioning as well as for working very, very well on, cloud object stores when there are updates. So that's the dense case. The sparse case is almost identical with the difference that now since we don't know exactly which cell is empty and which cell has a value and because we don't materialize the empty values or the 0 values for 2 dimensional matrices, for example, then we need to explicitly store the coordinates of the non empty cells. And imagine that, again, there is a 1 dimensional order imposed on the multidimensional space with some specific configurations.
And again, we do tiling. Again, we put the coordinates along each dimension in a separate file, then the attributes in separate files as well. And then we put everything into a subdirectory in the array directory. And specifically for this sparse case, we employ multidimensional indexes for fast pruning and fast slicing like our trees. That's what we use as the in memory structures when opening an array to be able to slice fast and find the non empty cells. And this pretty much summarizes what what the on this format is.
[00:25:23] Unknown:
Today's episode of the data engineering podcast is sponsored by Datadog, a SaaS based monitoring and analytics platform for cloud scale infrastructure, applications, logs, and more. Datadog uses machine learning based algorithms to detect errors and anomalies across your entire stack, which reduces the time it takes to detect and address outages and helps promote collaboration between data engineering, operations, and the rest of the company. Go to data engineering podcast.com/datadog today to start your free 14 day trial. And if you start a trial and install Datadog's agent, they'll send you a free t shirt.
And for people who are trying to determine how they want to structure the data that they're storing, what are some of the data modeling considerations that they should be thinking about or fundamental concepts that they need to be able to understand to be able to employ TileDB to the best effect?
[00:26:17] Unknown:
Yeah. This is very, very similar to fine tuning a database. Like, what kind of indexes are you going to use on which columns? What kind of page size we are gonna use, and all those configuration parameters. It is equally difficult. Let me start by by saying that it it is equally difficult. Of course, we have guidelines about performance. For example, how the tile extend affects performance, how the order affects performance, so on and so forth. For most cases, it could be straightforward. For example, for dense images, it could be straightforward because dense images are, for example, 2 dimensional. It's fairly natural to think in terms of arrays.
You know which dimension corresponds to the width, which corresponds to the height. Then you do some reasonable chunking such that each tile is, for example, 10 kilobytes or a 100 kilobytes or 1 megabyte because this affects how much data you're fetching from the cloud or from any back end when you're slicing. So that could be a little bit easier. It becomes a little bit more complex for sparse arrays even for database tables because, first of all, you need to select a subset of your columns to be your dimensions. So you need to to see the the workloads that you have and say, okay. I slice on stock and time for this asset trading dataset, for example. Right? So I better make those as dimensions because TileDB is gonna give me this performance boost whenever I have a predicate on either of those of those 2 dimensions.
And then, of course, there's gonna be some trial and error. And, for other use cases like like genomics, we do them vertically. For example, for a specific genomic variant, use case, we did a lot of benchmarks. We got from customers and users the access patterns. We said, okay. This should be dimension. This should be dimension. That should be the order. That should be the trunking. And we fine tune all the other configuration parameters, and the customization we build specifically for genomics hides those. Of course, it exposes the configurations for the user to set, but we have figured 90% out of everything that you need to do so that you can start using it immediately. It is a difficult problem, though, and that's why we're around. We're always happy to to help with, with the user's use cases. They they contact us frequently, and we're extremely interested to to dive in and and optimize for them.
[00:28:41] Unknown:
And for somebody who is going to start using TileDB both for the embedded and for the cloud use case, what does the overall workflow look like, and what are some of the benefits that you're seeing of unbundling the storage layer from the computation for being able to interface with that storage engine from multiple different libraries and run times?
[00:29:06] Unknown:
I would like to separate those 2 questions, if I may. So on the first 1 regarding what is the workflow, so here it is. For tile to be embedded, it's very easy to install any of the integrations and any of the APIs you'd like. So that's the first thing you do. The second thing that you need to do is depending on your use case, you need to use a particular ingestor to ingest the data from the format that you have it in into TileDB. And this is what we're here to help with. We have created most of the ingestors, for example, from all geospatial formats through our integration with GDAL. We do a translation to TileDB.
So you just use a GDAL command and you're that's it. You can store any any geospatial format into TileDB. For genomics, we build our own. And for CSV files, we build we we rely on on Panda CSV ingestor, for example. And the list of ingestors grows. So you need to ingest your data from whatever format you have it into the tile b format. And, again, you you need to do it through some ingestor. But from that point and onwards, you can use either any of the APIs we expose directly from TileDB for direct access, and this is the fastest you can interface with your data with. Or you can just use SQL so you don't change your workloads whatsoever, or you use prudal and gdal in Geospatial. And, again, you don't change your workloads at all. Or you use Spark in the same way that you would use Spark with Parquet. You can use Spark with TileDB.
And the same is true for Dask. So we're trying to incur as little friction as possible when it comes to to using the data directly. And this is true for TileDB embedded. For TileDB cloud, it is even easier. You can just sign up, sign in, and go. We host Jupyter notebooks there with with a single click. We can just spin up a Jupyter notebook, and then we have all the dependencies. Everything is installed. Of course, in the future, we're gonna allow you to install any anything you like, but it's a JupyterLab notebook. And we have tons of, examples there with example notebooks for for multiple use cases, and you can start writing code immediately. You can start ingesting your data, or you can start working directly on public data that we have ingested for everybody on TileDB cloud. And we will keep on adding datasets there. We will keep on adding notebooks there. So once again, the best way to to learn to all the best to go check out those notebooks, even download them if you like to work on them locally, but without installing anything. You just sign up, sign in, and go.
Now that was the the the first question. The second question is about unplugging storage from the processing tools. And this is exactly what is gonna help me clarify a little bit the vision of, of FoundDB. So the benefit for databases like MariaDB, for example, or Presto DB of unbundling, even Spark, even DASK, right, even computational framework. So it expands beyond the databases. The benefit of unbundling storage is that you can effectively separate storage from compute and allow you to scale storage and compute separately. This is 1 of the biggest benefits that I personally see. Right? For example, in the past, you had to pay licenses for enterprise grade databases based on the amount of data you store in the database.
Right? But that's not truly reasonable when it comes to genomics where you hit petabytes of data because the licenses are gonna become extraordinarily expensive. Then it depends on way where you store the data. And if you don't store the data in in a cloud object store, then, of course, you need to pay for that storage, and it is extremely expensive. And finally, you end up not using all the data at the same time 247. Of course, you do analysis frequently, but not scanning the whole terabyte, for example, 247. So why would you pay for the whole petabyte or for compute for the whole petabyte 247. So there are economic benefits from separating storage from compute. And now the question is after you do that, what do you do? You need to store the data somewhere. So there is there has to be some kind of data format which can lie on on an object store like AWS 3 or Google Cloud Storage or or Azure Blob Storage. And then whenever I want, I can spin up a database server or I can spin up a serverless function, and I can access this data.
So the first benefit is economical. The second 1 has to do with interoperability. If you store the data in a format which is understood by multiple tools, you can do a SQL operation on the same data. But at the same time, you can spin up perhaps a Python user defined function or an R user defined function to do some statistical analysis on the same data, which is something that a database or at least the database that you're using could not do. So the second 1 has to do with flexibility and functionality. But the last thing that I want to mention is that if you just unplug storage from a database, it solves 1 of your problems or 2 of your problems, which is savings as well as interoperability and flexibility.
But you start introducing new problems like data management problems. Okay. I stored my data in those files on s 3. Okay. How do I impose access control on those? How do I impose access control in a way that when I use SQL, these access policies are are respected. And at the same time, when I don't use the SQL engine and I use something entirely different, which is, I don't know, through my Java API, I do something or through Spark or through DASK, still I get those access policies to be respected. And if those access policies are not file based, AWS S3 is not gonna help you. What if you have array semantics?
What if you want to define an access policy on a slice of your data? So what we did was exactly the opposite of what the databases did. So a database unplugged the storage engine. We unplugged the compute. So we kept the storage. We kept the updates. We kept the versioning. We kept the access control. We kept the login. The only thing that we unplugged was processing because we want you to be able to process the same data with a powerful SQL engine that and there are a lot out there, but also leverage the power of Spark. Also leverage the the power of DASK. Also do something with the geospatial tool or even write your own computational engine without worrying about the data management hassles. So that's the difference that we actually did to address this problem.
[00:35:59] Unknown:
Yeah. That's definitely the thing that stands out to me most about TileDB is, as you said, you still have a lot of the benefits that you get from a vertically integrated database as far as access control and versioning without having to go and reimplement that all on your own as you would if you were just using JSON files on s 3 or parquet files, where, as you said, you can manage access on the file level, but not on the per column level unless you have some other layer that everything has to go through. And so I'm curious if you can dig more into how TileDB itself is architected to be able to handle all of those additional benefits on top of just the raw bits and bytes storage?
[00:36:41] Unknown:
Yes. This is exactly where TileBee Cloud comes into the picture. So let's clarify again what you can do with each of the offerings. With embedded, you have a way to store any kind of data in a universal format as multidimensional arrays. The data versioning is built into this format. So still in an embedded way and effectively serverless, you can take advantage of the versioning. You don't have to spin up a server to have serializable rights when you have concurrency. That's already handled. That's built into the format. That's how we architect the TileDB embedded. So that's pushed down. So at at least 1 of the data management aspects, which is handling updates and handling data versioning, This is built into the format, and you get it, of course, for free. And you get it into the format so that you don't have to reinvent it for every single higher level application that you're using TileDB with.
So that's what you get from TileDB embedded. You get, again, the efficient storage into multidimensional arrays and the efficient slicing, compression, and all those nice stuff, the optimizations for the cloud, the parallelism, the integrations with all the tools that I mentioned, and, of course, the data versioning and the updates and all that stuff. You get that in an embedded way. You don't need to speed up anything, and this is not tied to any particular subset of the ecosystem. It's for the entire ecosystem. Now if you want to do access control, especially at the scale that we're discussing about, which is planet scale, right? You should be able to share any portion of your data set with anybody, anywhere, and with as many people as you like, even beyond your organization.
Right? This is exactly what Hardi b Cloud was was built to do because that cannot be done in a completely decentralized way. There must be somebody who keeps a database with all the customers and all the access policies in order to be able to enforce it. And that's exactly what TileDB Cloud does. It enforces the access policies while keeping the rest of the code identical. Right? You have a SQL query. It's gonna work the same whether you're using TileDB cloud or you're using TileDB embedded. But if you're using TileDB cloud, then we know how to enforce any access policies that that come along with with that particular array. So that's how we build a universal layer and an access control layer and that comes along also with logging. We log everything that is happening on your arrays or on somebody else's arrays.
And the reason why this is universal is because all the access policies are defined on this universal storage format. If we did not have a universal storage format and we were we were an engine that supported parquet and orc and xar and hdf5, we would not be able to seamlessly define access policies in a single way and be able to to scale access control to planet scale.
[00:39:42] Unknown:
And in terms of the evolution of the project, I'm curious what have been some of the ways that it has changed since you first began working on it and some of the assumptions that you had early on in the project that have had to be reconsidered as you started getting more people using TileDB and more different problem domains and technology stacks?
[00:40:03] Unknown:
Yeah. The original TileDB was just a research project. Right? There was a crazy dude writing some codes and trying to convince people that this has a lot of value in in all those domains. Right? The original designs, remained more or less the same, and we lucked out on that respect. I'll give you an example. The original decision to work with immutable batches of rights of written files, it was an important architectural decision because it allowed us first to do updates on on sparse data, which are very, very difficult because otherwise you would have to reorganize the whole dataset if you're just infusing data in random places. But most importantly, this object immutability is exactly what you want if you're working on an object store like like s 3 or Google Cloud or Google Cloud Storage or Azure Blob Storage because all those objects are immutable. Right? You cannot change just 4 bytes in a single file. You will have to rewrite the whole file. And that allowed us, of course, to to become super optimized on the cloud. So that decision remained.
A lot of stuff in the core code got completely refactored, obviously, but not from an architectural point of view when it comes to the format. It's mostly the code, how optimized we made it. We we made the protocol to s 3 much less chatty, which allowed us to avoid certain latencies. So it was mostly around optimizations. But 1 of the biggest, perhaps, architectural decisions that we make or or format decisions that we made, which indeed was was important to happen after we created the company and actually appeared only recently a couple of months ago with taliby 2 2.0 was the feature that allows you to define any of your dimensions in the sparse array to have different data types.
I mean, in a traditional array definition, all dimensions have probably integral values. Right? It doesn't make sense to have a a float coordinate, for example. Of course, we had supported float coordinates since the get go, but we wanted to make each of the dimensions to have a different data type if the user wants to because that was the only way that we could capture data frames. Because for data frames, ideally, the user can choose any subset of the columns with any data types and say, this is my clustered index. Make sure that despite the fact that those have different data types, I want the slicing to be very, very fast on those dimensions. And that required a lot of refactoring.
And that's what TileDB 2.0 introduced. So that was an important technical refactoring that we did. And, of course, it starts to pay off massively because now we can handle generically any kind of data frame with duplicates and everything, stuff that the traditional array would just not be able to handle.
[00:42:59] Unknown:
And then another core element of the data format that we've mentioned a few times in passing is data versioning, which is particularly critical for things like machine learning workloads where you're doing a lot of different experimentation and generating different output datasets, and you need to be able to backtrack or figure out what version of code went with a particular set of data. So I'm wondering if you can dig a bit more into some of the versioning aspects of the file format and how it's implemented and some of the challenges that you're overcoming as far as being able to manage life cycle policies
[00:43:32] Unknown:
to handle things like cost optimization or garbage collection of old versions of data? This is 1 of the most powerful features in in TileDB and the big differentiator from other formats as well. And, again, this is built into the format. So I don't know of any embedded storage engine that that can do that. I mean, you can kind of do that with Parquet files, but you need you need to use something like Delta Lake on top in order to be able to pull it off. It's still the Parquet format that allows you to do versioning. You need to kind of hack it on top with a different piece of software in order to be able to do it. TileDB builds it into the format. Right? That that's exactly how it is architected. But I would like to clarify a little bit what we mean by by data versioning so that people don't think that we have, we should build some kind of of Git for data. This is not exactly what Tardis is. Although if there is if there is enough interest, we we may be able to build something like that. We we do have the foundation for that. So when we say versioning is that when you perform a write, even the parallel writes, it doesn't matter. When you perform a write, this particular write is a batch write.
We usually tell users not to write 1 record or 1 value some value at a time. Just batch your your values and then perform 1 write because TileDB paralyzes everything. And it's very, very fast when it comes to batched writes. And each batched write creates a subdirectory, a time stamped subdirectory within the array directory. And all the files that pertain to that batch right are inside that subdirectory. So when you do multiple writes and when we timestamp every write, we give you the ability to be able to time travel back. Right? To travel back in time and open the array in a state before some of the updates happened. For example, I do an update today, and I do 1 tomorrow, 1 day after. But then something I feel that something is not right, and I wanna see what happened yesterday and what happened the day before. So we give you the ability to open the array at at a particular timestamp and then get all the contents of the array so you can issue any query. That's the same query if you want to, but then see a state of the array as if it goes before the rights happen after the timestamp that you provided.
And we have architected it in in such a way that we provide excellent isolation. Right? Every fragment does not interfere with any other fragment. A fragment is the subdirectory. It's it's this batch. Right? So every fragment does not interfere with any other fragment. There's no locking. There's no central locking. No locking is needed because the fragment name is unique across all the fragments. It carries a timestamp and the UUID, which is random. So serializability is guaranteed by default. So this is what we call data versioning. This is different from saying, okay. I'm gonna go back to a particular version, then I'm gonna fork it, which is something that you would do with git. This is, again, doable, but we'd like to see more use cases in order to be able to build it. And I'm curious how this differs from things like Datomic as far as being able to handle the
[00:46:40] Unknown:
versioning of data across time and doing things like event sourcing so that you don't ever actually delete anything. You just mutate a record and keep the previous version so that you can be able to say, these are all the different changes that happened to a particular attribute. 1 of the canonical examples being you have a user who has an address, and they move to a new location. So the fact that they used to live at a particular point never ceases to be a fact. They just have a new fact as to their current location so that you can be able to go back through time and see what was the value at a particular point.
So, yeah, I'm just wondering if you can give a bit of comparison as to how the versioning in TileDB compares to something like Datomic for being able to handle the way that data is represented in a versioning capacity.
[00:47:37] Unknown:
Yeah. Data versioning entirely b is more similar to what Delta Lake provides with parquet files. And, of course, we don't have the same asset guarantees that the Delta Lake provides. This this is a large topic, which we will discuss in in future video tutorials. But what we do provide is right serializability without any kind of locking, everything serverless with Delta Lake, you need to to have a spark cluster or or a Presto DB cluster in order for this to work. We don't need any cluster for this to work. And it's mostly batched rights, which can be done in in parallel, and then you can open the array at any instant in time ignoring all the updates that happened afterwards. We do not have any transactional semantics at at the moment. That's not something we optimized for up until now. And, also, at least in the embedded format, the Talib embedded format, we don't keep the logs of, you know, who accessed which attribute when. This is not the functionality you're gonna get. You get all the logs, very detailed logs of Tali b cloud, but that's not about data versioning. You just get tons of logs about pretty much everything that you have done, but we don't consider that as part of the data versioning feature that we have. At least not today.
[00:49:01] Unknown:
And then the other thing that I'm curious about is how you handle concurrency in access to the data and being able to resolve conflicts, particularly because of the fact that different batched rights will produce different versions of data. And so if you have somebody who reads the data both at the same time and then they create different batched rates, how you resolve those different updates.
[00:49:27] Unknown:
Now we have architected TileDB in a way that can handle multiple such writers and multiple interleaved readers as well in the following manner. As I mentioned, every batch to write creates a fragment, which is a subdirectory in the array directory, which does not interfere with anything else. And it will never collide because the name is guaranteed to be different because we have a random token in it. So with only negligible probability, you can end up with a conflict there. So multiple writers can write at the same time, and there are gonna be no conflicts, no corruption whatsoever even if 1 of the rights fails. If the right completes, we introduce another object, a special okay file, which says, okay. This subdirectory is good to go. And then we respect all the eventual consistency issues that, for example, s 3 introduces. So it is architected in order to work with s 3's eventual consistency and therefore we inherit that model when it comes to consistency.
Okay? The reads do not conflict with the rights because a read will never read a partially written fragment, and that's because of the absence of of this okay file. If if they read upon opening the array, if it doesn't see this okay object, it's going to completely ignore any partially written written fragments. So this allows us to perform concurrent writes and reads without having a centralized service to manage any kind of conflict.
[00:51:03] Unknown:
And then the other interesting element of this is the fact that the TileDB embedded project is open source and publicly available for free. And then you're also building a company around that and the cloud service on top of it. So I'm curious how you're managing governance and ongoing sustainability of the open source aspects of the project and the tensions of trying to be able to build a profitable business on top of that?
[00:51:32] Unknown:
TagDB Embedded is entirely open source, and we will maintain it as such. We do manage it as a team. We govern it. We welcome contributions from anybody. We're very happy to to see contributions to it. We're very responsive if you can see in forums and the GitHub issues. And we will abide by by this style. Like, TileDB embedded and the integrations and the APIs are all going to be open source. The good news for us is that TileDB Cloud is completely orthogonal. It uses TileDB embedded. So all our servers that we spin up and we do the serverless computations, they're all reliant on TileDB embedded. We use the array format to define the access policies, the logs, and and everything else. But all the TileDB cloud functionality is completely orthogonal to what we do in TileDB Embedded. And that allows us to to to have a very clean separation of the 2, and and this has not created problems for us so far.
[00:52:34] Unknown:
And as far as people who are using TileDB to build applications on top of it, what have you found to be some of the most interesting or unexpected or innovative ways that it's being used?
[00:52:45] Unknown:
We have seen very diverse applications for TileDB embedded and most recently on TileDB Cloud as well. What I want to note mostly is the ones that I find admirable because TaliPig was used first in an important domain like genomics. Right? And some very high profile organizations trusted us to do that since we were just 4 people in the company. Right? And much earlier when when I was a single person in the labs in MIT trying to just to just create a a very quick proof of concept. So I find this admirable because those are important important use cases, and data management is a huge bottleneck for them. I mean, can you believe it that data management is the bottleneck to the actual science? You cannot do analysis at scale, especially in genomics, which is important to do it at scale.
And you cannot do it because you are blocked by data management. You're blocked by all those legacy formats. You are blocked by inefficient formats and inefficient data management in general. So this is what surprised me the most, not the fact that hard b p handled those cases. That was not what surprised me. But that certain people in certain high profile organizations trusted us to build this and improve it so that we solve a very important problem in a very important domain.
[00:54:07] Unknown:
In terms of your own experience of building and growing the project and the business around TileDB, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:54:19] Unknown:
The most challenging part to build this piece of software is not the near 1, 000, 000 lines of code that I build. That that's not it. That's the kind of easy part. The most difficult part is to start it from scratch and build a brilliant team around it. The most difficult part is to inspire some brilliant people to come and invest their time and put their passion in order to build this colossal vision. I mean, we are not delusional here, and this is something that I really want to to stress strongly. We're not delusional. This is a tall order. This is a very bold vision. But this is what excites us the most. And the most challenging part is to convince the engineers to come and work with me, the people that are doing the marketing, the investors, of course, the consultants, even my my managers at Intel and my colleagues at MIT to to even start this project. So that was the most challenging part.
And we are in a good shape. I mean, we've been doing this for for 3 years. We feel very confident about the team, very confident about about the software. There is a long road ahead. But as long as we're excited and and enthusiastic, I think the end result is going to reward everybody.
[00:55:38] Unknown:
And TileDB is an ambitious project, and you have a potentially huge scope of work to be done in terms of the core capabilities of the storage format, the cloud platform that you're building around it, all the different integrations for all of the run times and compute interfaces. I'm curious. What are some of the features or capabilities that you're consciously deciding not to implement or that you're deferring to other people to build out as part of your surrounding ecosystem?
[00:56:11] Unknown:
Great question. All the computational parts as we explicitly state on the website as well. Right? We we go for pluggable compute. But let me elaborate a little bit. The first thing that we don't wanna do, we don't wanna create another language to access the data. Right? And we believe that that would be catastrophic. People like to to access data in so many different ways. They wanna access data directly through language APIs. They wanna access the data through already popular tools. So it would not be wise to just create our own thing and try to convince people to just completely change the way they work every day.
So that's that's the first thing that that I left out since since day 1. And with that comes a query parser and and all the technology that comes comes along with defining a new language. So definitely not a new language. The second thing is we're not building a new SQL engine. There are so many wonderful SQL engines out there. Our strategy is to partner with with all those brilliant people that are building those engines. We can alleviate a lot of the storage problems that, probably they're not interested in in solving if if they really want to to work, for example, in query optimization. So we we let those guys work on on query optimization.
So we we're not interested in building a SQL a SQL engine from scratch. What we are interested in doing in that respect is pushing down some of the primitives of the compute primitives that the SQL engine could use only because, first, if you if you push it down as close to the data as possible, probably it's gonna be faster because you're gonna avoid certain copies of the data. We're doing a good job internally in the in the core to do everything multi threaded, vectorized, and so on and so forth. So we are equipped with with the knowledge and skills to to do this efficiently. But also, most importantly, that certain computational primitives you find in SQL engines have exactly the same in in other engines as well. A filter is a filter even in PDL or GDAL or or Pandas.
It's the same. So why don't we push it down, do it very, very efficiently there, and then have all the other pieces of software that are plugged on top to utilize it? The same goes with group buys, with merges. But we're not gonna rebuild a whole SQL engine because the SQL engine is is not just a couple of primitives put together. There is a lot of intelligence, a lot of sophistication there, and we are not there yet. We we don't wanna do that. As I said, the original motivation was was linear algebra, and we're very much interested in in building all those distributed algorithms on on TileDB Cloud. We focused so far mostly on the infrastructure. So how do we create a serverless infrastructure to be able to dispatch any kind of user defined function task graph so that eventually other people as well as ourselves but also other users can build distributed algorithms with linear algebra algorithms being part of those on this infrastructure. So, again, we kind of delegated building those distributed algorithms to anybody who is who is equipped and capable and willing to build those those algorithms.
[00:59:16] Unknown:
TileDB is definitely an interesting project and has a broad range of applications. But what are the cases when it's the wrong choice?
[00:59:25] Unknown:
Yeah. That that's another great question. So TileDB is not a transactional database. Don't use TileDB as a transactional database. Theoretically, you can do transactions through mariadb and its integration with taldb, but that's not that's not our thing. It's mariadb. But the credit if you do transaction is gonna be is gonna go to MariaDB. It's not going to us. And we we act as a data connector to that. If you want to use, for example, some asset guarantees that you must have to in order to be transactional through direct access from Python, you're not gonna get those today.
You can get some of those guarantees that can get you a long way for certain applications. But if you are a core transactional application, that's not something that you would use TileDB for, for sure. Another thing that you would not use TileDB for or or at least you wouldn't change to TileDB is that if you're using a data warehouse, if you're happy, if you're doing only SQL, if you don't care about interoperability, if you don't care that much about cloud storage and separating storage from compute, then probably you should stick to the data warehousing solution that you have because these are not competitors to us. You would use TileDB even for data frames if you want to separate stories from compute, if you want to do user defined functions in other languages, in any language, actually, because that's what we're trying to do. But not if you if you're sticking only to SQL. If you want SQL plus more, then Taltig is a great solution for that. And finally, we have not tested, we have not optimized for streaming scenarios, again, only because we didn't have use cases that demanded that. But, again, you cannot consider us as a streaming solution, as a core streaming solution. So transactional and streaming solutions, I wouldn't consider Tiled
[01:01:13] Unknown:
before. You've mentioned a few different things that you have planned for the future roadmap of TileDB. Are there any other aspects of the work that you're doing that you have planned for the upcoming releases that you wanna discuss or any other aspects of tile d b and multidimensional storage that we didn't discuss that you'd like to cover before we close out the show? Yes. Absolutely. So the TileDB embedded engine, we will always evolve. There are so
[01:01:39] Unknown:
many issues even publicly on GitHub that we're working on heptically to get them done, always on performance, always on added features. So Tiledev embedded will will always evolve, and we will always have several people on Tiledevi embedded full time. But the biggest bet that we have and the biggest investment of our time is gonna go on cloud. So TileDB Cloud, again, allows you to share data and code with anybody, and it allows you to do everything serverlessly. And that's exactly what we want to focus our efforts on because once you solve the storage issues, which we believe we we did to a great extent, especially for the use cases that we work on, the next step is how do we alleviate all the engineering hassles? Because, again, data scientists, they want to do scalable analysis. They want to to get to insights very quickly, which can lead to scientific discoveries. That's what what a scientist wants to do. Right? They don't want to spin up clusters. They don't wanna monitor clusters. They don't wanna debug clusters. So TileDB Cloud has this goal to alleviate all this burden from all the scientists that want to work with data at scale and very, very easily.
So the plans for the future, double down on cloud. Tons of cool stuff are are coming up. So stay tuned, and and you're gonna see them in releases very, very
[01:03:03] Unknown:
soon. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:03:19] Unknown:
Yeah. There is a lot of brilliance and sophistication in data management today. That's not the problem that we saw whatsoever. The biggest problem that we saw was that any data management solution out there, especially very sophisticated data management solutions out there, were architected around a single data type, for example, tables, and a single query engine, for example, SQL. If you use tables in SQL, there are tons of great solutions out there. But that was problematic, as I mentioned before, for other verticals. Right? So that was the biggest gap. The biggest gap was that there hasn't existed so far a system that can work on any data seamlessly, right, in a in a unified way, build all the data management features like access control and logging and updates and data versioning on this universal storage format that can capture all the data and then interoperate with all the languages and all the tools out there to give the flexibility to operate on the same data without converting from 1 to another. That system has never existed, and this is why we built TileDB as the universal data engine.
[01:04:32] Unknown:
Well, thank you very much for taking the time today to join me and discuss the work that you're doing with TileDB. As I said, it's definitely a very interesting project and very forward looking, and I'm interested to see where it goes in the future and some of the ways that the ecosystem grows around it. So thank you for all of your time and effort on that, and I hope you enjoy the rest of your day. Thank you very much. It's been a pleasure. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.
Introduction and Project Overview
Interview with Stavros Papadopoulos
Stavros' Background and Involvement in Data Management
Overview of TileDB
Inspiration and Design of TileDB
Motivation Behind TileDB Cloud
Challenges in API Design for TileDB
On-Disk Format and Data Modeling in TileDB
Workflow and Benefits of TileDB
TileDB's Architecture and Data Management
Evolution and Assumptions of TileDB
Data Versioning in TileDB
Concurrency and Conflict Resolution
Open Source Governance and Business Sustainability
Innovative Uses of TileDB
Challenges in Building TileDB
Future Roadmap and Features
Closing Thoughts and Contact Information