Exploring The TileDB Universal Data Engine

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

What are the pieces of advice that you wish you had received early in your career of data engineering?

If you hand a book to a new data engineer, what wisdom would you add to it? I'm working with O'Reilly on a project to collect the 97 things that every data engineer should know, and I need your help.

Go to data engineering podcastdot

com slash 97 things to add your voice and share your hard earned expertise. And when you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With their managed Kubernetes platform, it's now even easier to deploy and scale your workflow, so try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode,

that's l I n o d e, today and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up to date

with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.

For more opportunities to stay up to date, gain new skills, and learn from your peers, there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to data engineering podcast.com/conferences

to check out the upcoming events being offered by our partners and get registered today. Your host is Tobias Macy. And today, I'm interviewing Stavros Papadopoulos about TileDB,

the universal storage engine. So, Stavros, can you start by introducing yourself?

Absolutely, Tobias. Thank you very much for having me. I'm Stavros Papadopoulos.

I'm the CEO and founder of TileDB. I'm a computer scientist, and I'm I'm excited to to talk about TileDB and everything we have done. And do you remember how you first got involved in the area of data management? I did my PhD in databases, so that's how I started. I have always been in the databases space. I was at the time focusing mostly on multidimensional data structures, data privacy, and cryptography. But then in 2014, I joined Interlabs and MIT where I worked on a big data initiative alongside some database gurus at MIT as well as some high performance computing ninjas at Interlabs, and, here we are. This is where

everything started.

And so

you have recently started building the TileDB project. Can you give a bit of an overview about what it is and some of the problems that you're trying to solve with it? I'm gonna start by explaining in a nutshell what it is. So it is a novel engine, novel kind of database,

which allows you to store any kind of data, not just tables as traditional databases. So it can be genomic variants. It can be geospatial imaging. It can be data frames as well. It can be tables,

but it is more than that.

And we it has its own universal storage format to be able to do this. And then it allows you to manage this data so you can define access policies. You can share the data with anybody

in the world. You can log everything.

And, of course, you can access this data with any language or or tool. So it goes beyond traditional SQL that that you find in databases.

It consists of of 2 components so that we drive the conversation later. It has an open source component, which we call TileDB embedded.

And this is the storage engine that that is based on this universal format that uses multidimensional arrays, and we're gonna discuss about this a little bit later. And this contains all the the language APIs as well as tool integrations,

plus everything that has to do with cloud optimized storage as well as data versioning.

And there is a private offering, which we call TileDB Cloud. And this is a SaaS platform which allows you to share your TileDB data with anybody in the planet

and allows you to to define arbitrary user defined functions with dependencies and and dispatch them to to a cloud service. And the most important thing about this this cloud service is that it is all serverless. We do that at extreme scale,

and it is built from ground up, serverless.

And you mentioned that you've been working on and with databases for a number of years now. I'm curious

what you are drawing inspiration

from as far as some of the systems that you've worked with that you're

using to direct your designs on TileDB and some of your motivation for building a new database engine that is drastically different than most of the ones that I've had experience with anyway? So, TileDB has a long history. It started at the end of 2014,

the beginning of 2015 when I was working at, at MIT Intel.

At the time, I was just looking for a research project to work on under this big umbrella of of big data, I mean, this initiative we we were working on at the time.

And I was a c plus plus programmer.

So I had 2 different types of influences. Right? The the MIT people who were building traditional commercial database systems

and then Interlabs who were building high performance computing software. And a lot of it was, was around linear algebra, which is at the core of machine learning, deep learning, and all all advanced analytics. So I was looking for a way to to combine these 2 areas.

And from a research perspective,

what I wanted to do was mostly sparse linear algebra,

which essentially means linear algebra with with matrices that have a lot of zeros or or empty cells. Right? And these these are more peculiar from from a performance perspective, and they need careful handling.

And, also, I was very much influenced from geospatial data from my time, during my PhD years.

So, frankly, I was looking for a way to store sparse matrices

so that I can do very fast sparse linear algebra, and at the same time, I can I can capture some some of the geospatial use cases? Again, everything completely research oriented.

So I had a a couple of requirements as I was building this engine for sparse arrays.

The first requirement, of course, was that it had to handle sparsity

and ideally dense arrays as well so that it is a unified engine of a

dense array has values everywhere. So the the number of zeros is not as as big as as in spars arrays. The second requirement was that

whatever we were building, it had to work very, very well on the cloud because we saw a big shift

to the cloud.

So the storage engine should work on AWS s 3, Google Cloud Storage, Azure Blob Storage or any other object stores in the cloud.

Another requirement was that it had to be an embedded library. So it had to be built from scratch by definition

because it was the storage and and I couldn't use any other component from,

established databases. So I want to build it from scratch in c plus plus and in an embedded way so that you don't have to set up a server to use it. And the 4th requirement, at least for me, was that it should be built in c plus plus. 1st for speed. 2nd, because I was good at c plus plus. But finally, because I had the the longer vision

that these libraries should interoperate with other languages as well, so having a c plus plus library

may make this a little bit easier.

Now I have to mention that at the time, there were

such storage engines like HDFI, for example, a very popular dense array engine.

But that was architected around dense array, so I couldn't use it for for my sparse problems. And second, it was not built for the cloud because it's been around for decades and and and the cloud gained popularity only recently.

So it was not architected to work very well on s 3, for example.

So that's how it started. I that's what motivated the storage

engine. So I built it in a way that handles both dense and sparse arrays in a unified way because if I architected to in a way to handle sparse arrays, maybe there are tons of similarities in handling dense arrays. So let's identify what is different, and then let's spell this out and and handle both in a very, very efficient way.

And at the same time, I was very fortunate that Intel was working with a prominent genomics institute, and they presented me with with a very, important and difficult problem around storing genomic variants. So huge data in essentially a sparse format. The the genomics data is very, very sparse. So the solution that I presented was very relevant.

We created the proof of concept. It went very well, and it got adopted. So we said, okay. This storage engine probably is very meaningful for more use cases than I had originally thought. So let's give it a chance and start building it up. And this is what made TileDB embedded. That's the open source system that I created at the time.

And, of course, it evolved, and we can discuss later about how.

But that's entirely the motivation behind the the TileDB embedded

storage engine, which is the only system that handles both dense and sparse arrays, sparse part dimensional arrays in a unified way. And what was the motivation behind TileDB Cloud?

Now at the time also, we were discussing with a lot of scientists. Again, of course, I had the databases

perspective from from MIT, but I was talking to other groups and other scientists from geosciences, from, from genomics and other scientific domains.

And I observed a couple of similarities.

The first thing that I observed is that every single domain

has its own crazy data format. It is a file format, which is very domain specific. And it's crazy in the sense that it has a lot of jargon.

Although, at the end of the day, it's just data. And I'm gonna explain and clarify a little bit what I mean by by that.

And a big similarity there was that regardless of what format you choose for a specific domain custom made for your application,

no matter how good it is, and you can make it very, very good,

All hell breaks loose when when you have updates,

data versioning, and access control.

Right? A single file works great,

but not so much if if you start updating this file or you're adding more files. You end up analyzing

thousands of files. And that was the same in genomics as well as geospatial, the the exact same thing.

Another thing that I observed was that every domain preferred different languages and tools. They they had, for example, 1 group in bioinformatics really like r, another group like Python

and geospatial. For example, you would find somebody who who like Java as well.

So a lot of different preferences

in terms of what languages you want to use in order to access your data. And then that goes back to the original decision that we build everything in c plus plus so that we build APIs for every language.

Again, regardless

of the domain,

the scientists wanted always to share their data, of course, with with access policies and everything and code for reproducibility.

Right? So just sharing files was not going to cut it. So, eventually, the biggest observation of all was that the data management principles, data management features that we have in databases, I couldn't find them in domains like genomics and geospatial.

And later, we we found that that that was true for for other domains as well. So data management was a problem. It it was not the science

behind those domains that was creating all the problems.

So

we kind of lucked out in the fact that

the other observation was that

all data, regardless the vertical,

can be efficiently modeled as a dense

or a sparse multidimensional

array. For example, an image is a dense 2 d array.

Genomics is a sparse 2 d array. LIDAR point clouds are sparse 3 d arrays.

Even key values

can be considered as a sparse 1 dimensional vector where the keys are are string values in the string domain. So even that, I I can prove to you that that essentially boils down to to sparse arrays. So a lot of common things across the verticals,

and we already had TileDB embedded, which

addressed the issue of storing everything as multidimensional arrays, address the issue of interoperability

that, hey. Everybody can access the data from their favorite tool and their favorite language.

What we needed

was to try to to scale the other data management features like access control at the global scale that did not exist,

try to do everything serverless because that alleviates the pain of setting up clusters and addressing certain issues with with scalability.

And also creating user defined functions with arbitrary dependencies

as task graphs and deploying them in in the cloud. And that effectively gave rise to TileDB Cloud, which is the SaaS platform we built for the cloud, which handles

data management. So access control and logging at planet scale as well as serverless compute in the form of task graphs.

And an interesting thing to note too is that, as you said, all of these different

specific domains have their own custom file formats that they've been using for years, which means that a lot of these people who are working and researching these domains or who are building applications

probably have piles of data lying around in those formats. I'm curious

what you have seen as far as the approach to being able to translate that information

from those legacy formats into TileDB

or from TileDB into those legacy formats formats

to be able to fit with their existing tooling? This is where we spend the majority of our time, admittedly. Right? Because, again, we

we're a storage first company, so we spend most of our time understanding each vertical, and so we spend most of our time understanding each vertical

and each file format. And, of course, we had to bring some brilliant people in our team who had this knowledge or we were working very closely with customers, which, of course, provided us with this knowledge.

So, essentially, what we had to do was understand the data format,

try to map it into a multidimensional dense or sparse array

depending on their access patterns. Right? So it took a little bit of back and forth in order to understand what the best modeling is. But at the end of the day, it was an array. Then we created ingestors that were reading from those legacy formats into the TileDB format, and then everything fit in place. The reason is that once you get your data into the TileDB format,

then you inherit

everything we build on top regardless of your vertical. For example, if you're in genomics and you store the data as arrays, you get our integration with DASK, Spark, MariaDB,

PrestoDB,

the 6 APIs we have. You get the whole ecosystem

and our whole mantra in the company is that we are going to integrate with pretty much everything that exists out there. So once you put the data into TileDB,

you get this versatility, this flexibility

to process your data with anything you like, including your own tools. For example, for for the geospatial verticals, we did integrate with popular

geospatial libraries like PUDL and Jira. And, of course, we're happy to do the same with in genomics, for example, it is in our plans to integrate with a popular library called Hale.

So because of the fact that you have this universal data format that can model all of these different problem domains and you're focused on being able to store the information efficiently

and have these versatile interfaces for all the different computation layers.

I'm curious

what you have seen as far as the challenges of being able to

design the APIs

to make it easy to be able to actually use all of these different computation layers on top of TileDB because you mentioned things like Spark and Presto and MariaDB.

So you're working in touring complete languages. You're also working in SQL. I'm curious what some of the challenges are as far as being able to

make the access patterns

intuitive and efficient for all those different use cases. Yes. This is a great question.

Again, we kind of lucked out in in that respect. In the past years

let's start with the databases, and then we're gonna explain about everything else, all all the other computation tools.

The databases

recently shifted

to to a framework where they support pluggable storage.

Right? Before they were monolithic,

they handled all the layers in in the stack. Right? From parsing the query down to to storing the data

on the back end.

And

most recently, they just unbundled

the the storage. Right? So

they created their own APIs

that allow you to plug your own storage engine, your own stone format there. So that made it very easy for us to just go into MariaDB, for example, or PrestoDB or or Spark. Spark has data connectors by definition

and just plug it. So it was a lot of work to do it because we had to understand how every single tool does it. So it's a time

issue rather than a complexity issue because those guys did a good job to expose clean APIs to do that. And then

fortunately, for the databases, we have an 1 to 1 mapping between a data frame and an array. And this is done by pretty much selecting a subset of your columns to become your dimensions,

and that's your fast indexable

columns. These are the columns that Huddl will allow you to to slice very fast on. So for databases, we lucked out because they they were already doing it and we just planned TileDB into them. For Spark also, it was easy because they had data connectors. For DASK, the same thing. They have data connectors. They don't

bind their storage to a particular library. So that was easy to do. And pretty much the it's the same story for for the rest of of the tools like GDAL and PDAL. But we needed to have people that have done it before in order to do that very, very efficiently, both in terms of of time as well as performance. And, again, we have people in our team that are specialized in doing exactly that. So it was not that much of a challenge from from an engineering perspective.

It was just a a time investment, which we happily did because that completes the vision of being a universal data engine, and we will continue doing that.

And particularly for things like a SQL interface that's used to working with a 2 dimensional array, I'm curious

how you represent an n dimensional array. Is it just a series of different tables axes and then join across them, and then TileDB handles translating that into the multidimensional array on the back end? Or was there some other level of abstraction

that you needed to add to be able to make it easier

of abstraction that you needed to add to be able to make it easier for people to be able to process and analyze these multidimensional structures?

Yeah. So let's clarify this a bit. We

directly

bundle to vanilla SQL. Right? SQL on tables, not specific

adaptations for matrices.

At least we haven't done it just yet. We may do it in the future.

But as of today, you can use, for example, mariadb

with talib plugged in and you can run any sequel query

as you would do it on mariadb alone any ansysql query

and it's gonna work. The only thing that you substitute is in in the from clause, you put an array URI, a tallyb array URI,

which could be local on s 3, on Google Cloud, on Azure, pretty much anywhere. And the whole query

is just gonna work. So there is nothing to be done by the user in order for the SQL

to work.

The only thing that the user should know

from a performance perspective

is

which of the columns

we marked as dimensions

in the TileDB world. Because those call if you slice if you have a predicate in the workflows that does a range query or quality query

on those particular columns,

you're gonna get a very fast query time. That's the only thing that the user should know that, you know, those columns are special. They index essentially, TileDB

acts like a cluster index on those particular columns. So you're gonna get a lot of performance from that. And similarly,

if you are the 1 who constructs the table,

even from SQL,

we have added configuration options that allow you to say, okay. This particular

column is a dimension. So you can mark in in the create table

statement. You can mark which which of the columns are dimensions, and you should think of those as a clustered index. That's the best way to think about it.

And everything works like in the SQL world.

So can you dig a bit more into the actual on disk format of the

multidimensional

arrays and how they're stored by TileDB

for being able to

then query and analyze them and just some of the

ways that users of TileDB need to think about data modeling that might be different than the ways that they're used to using either relational structures or graph databases or

some of the custom file formats that they might be coming from?

So we're gonna make a categorization

because every category

eats peculiarities.

So let's take, for example, a dense array case. Let's take an image.

Okay?

So if you want to store an image, so each pixel

in a database table with a standard traditional database

and be able to slice it multidimensionally,

for example,

put the range on 1 axis, put the range on the other axis, and get this the slice. We call this a slice. Right? A multidimensional slice.

And arrays are pretty good at at giving you these slice very, very fast. That's why you use arrays. Right? So if you want to

alternatively store this in a in a traditional database,

the very first thing that you should do is

create 1 record per pixel.

So instead of storing just the value of the pixel,

right, the RGB or whatever it is,

you have

to explicitly store the coordinates of that pixel. For example, 11,

12,

1314

in separate columns. It's gonna be 1 column for the 1 dimension, another for the other, and then

perhaps 3 columns for r g b. Right? So that when you issue a SQL query, a standard SQL engine is gonna understand, okay, the first predicate is on the 1st column. The second predicate is on the second. And I can even create a clustered index, and there you go. Everything works very, very fast. Right?

The problem is that you are introducing

those 2 extra columns

and dense arrays

do not store explicitly

the coordinates of the pixels

in the dense case. And that's a very important difference versus the sparse case. So going back to your question,

for dense arrays, we don't store the pixel coordinates.

We just

impose in 1 dimensional

order

to those 2 dimensional

or n dimensional

values. And there are ways to do that. We we give you a lot of flexibility to impose this order by chunking into tiles, hence the the name tiledb.

So, essentially, we impose an order

Then based on on some explicit

tile capacity,

we chunk we chunk those values,

and this this chunk is called a tile in TileDB.

And then these values are serialized

in a file,

1 per attribute. So it is

a columnar format like parquet, for example. Right? R all the values

are gonna be stored in 1 file. All the values along g, are gonna be stored in another and b in another. But not the coordinates.

That's a very

important distinction versus sparse arrays as well as

traditional tables. Right? Because for tables, if you don't store the indices, how are you going to slice on that? The tables do do not have any semantics of

serializing a 2 dimensional space

into an 1 dimensional

curve. There's no such semantics in in the database. But the reason in a dense array storage engine, a tally b HDF5.

That's exactly what these storage engines do very, very well. Okay? So this is the on disk format. We serialize the multidimensional

objects

into a single dimensional order. So, essentially, we sort in a particular order.

We chunk. We compress each chunk individually.

We put them in 1 file per column, per attribute,

and then we store them in a subdirectory

in an array directory,

which is time stamped, and it is called a fragment.

And this fragment is immutable.

After it is stored, it will never be changed. And this is a very important architectural decision we took for data versioning

as well as for working very, very well on, cloud object stores when there are updates.

So that's the dense case. The sparse case is almost identical

with the difference that now since we don't know exactly

which cell is empty

and which cell has a value and because we don't materialize the empty values or the 0 values for 2 dimensional matrices, for example, then we need to explicitly store the coordinates of the non empty cells. And imagine that, again, there is a 1 dimensional order imposed on the multidimensional

space with some

specific configurations.

And again, we do tiling.

Again, we put the coordinates along each dimension in a separate

file, then the attributes in separate files as well. And then we put everything into a subdirectory in the array directory. And specifically for this sparse case, we employ

multidimensional

indexes for fast pruning and fast slicing like our trees. That's what we use as the in memory structures when opening an array to be able to slice fast and find the non empty cells. And this pretty much

summarizes what what the on this format is.

Today's episode of the data engineering podcast is sponsored by Datadog,

a SaaS based monitoring and analytics platform for cloud scale infrastructure,

applications,

logs, and more.

Datadog uses machine learning based algorithms to detect errors and anomalies across your entire stack, which reduces the time it takes to detect and address outages and helps promote collaboration

between data engineering,

operations, and the rest of the company.

Go to data engineering podcast.com/datadog

today to start your free 14 day trial.

And if you start a trial and install Datadog's agent, they'll send you a free t shirt.

And for people who are trying to

determine how they want to structure the data that they're storing, what are some of the data modeling considerations that they should be thinking about or

fundamental concepts that they need to be able to understand to be able to employ TileDB to the best effect?

Yeah.

This

is very, very similar to fine tuning

a database.

Like, what kind of indexes are you going to use on which columns?

What kind of page size we are gonna use, and all those configuration parameters. It is equally difficult. Let me start by by saying that it it is equally difficult. Of course, we have guidelines about performance. For example, how the tile extend affects performance, how the order affects performance, so on and so forth.

For most cases,

it could be straightforward.

For example, for dense images,

it could be straightforward because dense images are, for example, 2 dimensional. It's fairly natural to think in terms of arrays.

You know which dimension

corresponds to the width, which corresponds to the height. Then you do some reasonable

chunking

such that each tile is, for example, 10 kilobytes or a 100 kilobytes or 1 megabyte because this affects how much data you're fetching from the cloud or from any back end when you're slicing.

So that could be a little bit easier. It becomes a little bit more complex for sparse arrays even for database tables because, first of all, you need to select a subset of your columns

to be your dimensions. So you need to to see the the workloads that you have and say, okay.

I slice on stock and time for this

asset trading dataset, for example. Right? So I better make those as dimensions because TileDB is gonna give me this performance boost whenever I have a predicate on either of those of those 2 dimensions.

And then, of course, there's gonna be some trial and error. And, for other use cases like like genomics, we do them vertically. For example, for a specific genomic variant,

use case, we did a lot of benchmarks. We got from customers and users the access patterns. We said, okay. This should be dimension. This should be dimension.

That should be the order. That should be the trunking. And we fine tune all the other configuration parameters,

and the customization we build specifically for genomics hides those. Of course, it exposes the configurations for the user to set, but we have figured 90%

out of everything that you need to do so that you can start using it immediately. It is a difficult problem, though, and that's why we're around. We're always happy to to help with, with the user's use cases. They they contact us frequently, and we're extremely interested to to dive in and and optimize for them.

And for somebody who is

going to start using TileDB

both for the embedded and for the cloud use case, what does the overall workflow look like, and

what are some of the

benefits that you're seeing of unbundling

the storage layer from the computation for being able to

interface with that storage engine from multiple different

libraries and run times?

I would like to separate those 2 questions,

if I may. So on the first 1 regarding

what is the workflow,

so here it is. For tile to be embedded,

it's very easy to install

any of the integrations and any of the APIs you'd like. So that's the first thing you do. The second thing that you need to do is depending on your use case, you need to use a particular ingestor

to ingest the data from the format that you have it in into TileDB.

And this is what we're here to help with. We have created most of the ingestors, for example,

from all geospatial formats through our integration with GDAL. We do a translation to TileDB.

So you just use a GDAL command and you're that's it. You can store any any geospatial format into TileDB. For genomics, we build our own. And for CSV files, we build we we rely on on Panda CSV ingestor, for example.

And the list of ingestors

grows.

So you need to ingest your data from whatever format you have it into the tile b format. And, again, you you need to do it through some ingestor. But from that point and onwards,

you can use either

any of the APIs we expose directly from TileDB for direct access, and this is the fastest you can interface

with your data with. Or you can just use SQL so you don't change your workloads whatsoever,

or you use prudal and gdal in Geospatial. And, again, you don't change your workloads at all. Or you use Spark in the same way that you would use Spark with Parquet. You can use Spark with TileDB.

And the same is true for Dask. So we're trying

to incur as little friction as possible

when it comes to to using the data directly. And this is true for TileDB embedded. For TileDB cloud, it is even

easier. You can just sign up, sign in, and go. We host Jupyter notebooks there with with a single click. We can just spin up a Jupyter notebook, and then we have all the dependencies. Everything is installed. Of course, in the future, we're gonna allow you to install any anything you like, but it's a JupyterLab notebook. And

we have tons of,

examples there with example notebooks for for multiple use cases, and you can start writing code immediately. You can start ingesting your data, or you can start working directly on public data

that we have ingested

for everybody

on TileDB cloud. And we will keep on adding datasets there. We will keep on adding notebooks there. So once again, the best way to to learn to all the best to go check out those notebooks, even download them if you like to work on them locally, but without installing anything. You just sign up, sign in, and go.

Now

that was the the the first question. The second question is about

unplugging

storage from the processing tools.

And this is exactly

what is gonna help me clarify a little bit the vision of, of FoundDB.

So the benefit for databases

like MariaDB, for example, or Presto DB

of unbundling, even Spark, even DASK, right, even computational framework. So it expands beyond the databases.

The benefit of unbundling storage

is that you can effectively separate storage from compute

and allow you to scale storage and compute

separately. This is 1 of the biggest benefits that I personally

see. Right? For example, in the past, you had to pay licenses for enterprise grade

databases based on the amount of data you store in the database.

Right? But that's not truly reasonable

when it comes to genomics where you hit petabytes of data

because the licenses are gonna become extraordinarily

expensive.

Then it depends on way where you store the data. And if you don't store the data in in a cloud object store, then, of course, you need to pay for that storage, and it is extremely expensive. And

finally, you end up not using all the data at the same time 247. Of course, you do analysis

frequently,

but not scanning the whole terabyte, for example, 247. So why would you pay

for the whole petabyte or for compute for the whole petabyte

247. So there are economic benefits from separating storage from compute. And now the question is after you do that, what do you do? You need to store the data somewhere. So there is there has to be some kind of data format which can

lie on on an object store like AWS 3 or Google Cloud Storage or or Azure Blob Storage. And then whenever I want, I can spin up a database server or I can spin up a serverless function, and I can access this data.

So

the first benefit is economical.

The second 1 has to do with interoperability.

If you store the data in a format

which is

understood by multiple tools, you can do a SQL operation on the same data. But at the same time, you can spin up perhaps a Python user defined function or an R user defined function to do some statistical analysis on the same data, which is something that a database or at least the database that you're using could not do.

So the second 1 has to do with flexibility and functionality.

But the last thing that I want to mention is

that if you

just unplug storage from a database,

it solves 1 of your problems

or 2 of your problems, which is

savings as well as interoperability and flexibility.

But you start introducing new problems like data management problems. Okay. I stored my data

in those files

on s 3.

Okay. How do I impose access control on those?

How do I impose access control

in a way that

when I use SQL, these access policies are are respected.

And at the same time, when I don't use the SQL engine and I use something entirely different, which is, I don't know, through my Java API, I do something or through Spark or through DASK,

still I get those access policies to be respected.

And if those access policies are not file based,

AWS S3 is not gonna help you. What if you have array semantics?

What if you want to define an access

policy on a slice of your data?

So what we did

was exactly the opposite

of what the databases did. So a database

unplugged the storage engine.

We unplugged the compute.

So we kept the storage.

We kept the updates.

We kept the versioning.

We kept the access control. We kept the login.

The only thing that we unplugged

was processing

because we want you to be able to process the same data with a powerful SQL engine that and there are a lot out there,

but also leverage the power of Spark. Also leverage the the power of DASK. Also do something with the geospatial tool or

even write your own computational engine

without worrying about the data management hassles. So that's the difference that we actually did

to address this problem.

Yeah. That's definitely the thing that stands out to me most about TileDB is, as you said, you still have a lot of the benefits that you get from a vertically integrated database as far as access control and versioning

without having to go and reimplement that all on your own as you would if you were just using JSON files on s 3 or parquet files,

where, as you said, you can manage access on the file level, but not on the per column level unless you have some other layer that everything has to go through.

And so I'm curious if you can dig more into how TileDB itself is architected to be able to handle all of those additional benefits on top of just the raw bits and bytes storage?

Yes. This is exactly where TileBee Cloud comes into the picture.

So

let's clarify again what you can do with each of the offerings.

With embedded,

you have a way

to store any kind of data in a universal format as multidimensional

arrays.

The data versioning is built into this format. So still in an embedded way

and effectively serverless,

you can take advantage of the versioning. You don't have to spin up a server

to have serializable

rights when you have concurrency.

That's already handled. That's built into the format. That's how we architect the TileDB embedded. So that's pushed down. So at at least 1 of the data management aspects, which is handling updates and handling data versioning,

This is built into the format, and you get it, of course, for free. And you get it into the format so that you don't have to reinvent it for every single higher level application that you're using TileDB with.

So that's what you get from TileDB embedded. You get, again, the efficient storage into multidimensional arrays and the efficient slicing,

compression, and all those nice stuff, the optimizations for the cloud, the parallelism,

the integrations with all the tools that I mentioned, and, of course, the data versioning and the updates and all that stuff. You get that in an embedded way. You don't need to speed up anything, and this is not tied to any particular subset of the ecosystem. It's for the entire ecosystem.

Now if you want to do access

control,

especially at the scale that we're discussing about, which is planet scale, right? You should be able

to share any portion of your data set with anybody,

anywhere, and

with as many people as you like, even beyond your organization.

Right?

This is exactly what Hardi b Cloud was was built to do because that cannot be done in a completely decentralized way. There must be

somebody who keeps a database with all the customers

and all the access policies in order to be able to enforce it. And that's exactly what TileDB Cloud does. It enforces the access policies

while keeping the rest of the code identical. Right? You have a SQL query. It's gonna work the same whether you're using TileDB cloud or you're using TileDB embedded. But if you're using TileDB cloud,

then we know how to enforce any access policies that that come along with with that particular

array. So that's how we build

a universal

layer and an access control layer and that comes along also with logging. We log everything that is happening on your arrays or on somebody else's arrays.

And the reason why this is universal is because all the access policies

are defined on this universal

storage

format.

If we did not have a universal storage format and we were we were an engine that supported

parquet and orc and xar and hdf5,

we would not be able

to seamlessly

define access policies in a single way and be able to to scale access control to planet scale.

And in terms of the

evolution of the project, I'm curious what have been some of the ways that it has changed since you first began working on it and some of the assumptions that you had early on in the project that have had to be reconsidered as you started getting more people using TileDB and more different problem domains and technology stacks?

Yeah. The original TileDB was just a research project. Right? There was a crazy dude writing some codes and trying to convince people that this has a lot of value in in all those domains. Right? The original designs,

remained more or less the same,

and we lucked out on that respect.

I'll give you an example.

The original

decision

to work with immutable

batches

of rights of written files,

it was an important architectural decision because it allowed us first to do updates on on sparse data, which are very, very difficult because otherwise you would have to reorganize the whole dataset if you're just infusing data in random places.

But most importantly,

this object immutability is exactly what you want if you're working on an object store like like s 3 or Google Cloud or Google Cloud Storage or Azure Blob Storage because all those objects are immutable. Right? You cannot change

just 4 bytes

in a single file. You will have to rewrite the whole file. And that allowed us, of course, to to become super optimized on the cloud. So that decision remained.

A lot of stuff in the core code got completely refactored, obviously, but not from an architectural point of view when it comes to the format. It's mostly the code, how optimized we made it. We we made the protocol

to s 3 much less chatty, which allowed us to avoid certain latencies.

So it was mostly

around optimizations.

But 1 of the biggest, perhaps,

architectural decisions that we make or or format decisions that we made,

which indeed was was important to happen after we created the company and actually appeared only recently a couple of months ago with taliby 2 2.0

was the feature

that allows you

to define

any of your dimensions in the sparse array to have different data types.

I mean, in a traditional

array definition, all dimensions

have

probably

integral values.

Right? It doesn't make sense to have a a float

coordinate, for example. Of course, we had supported float coordinates since the get go, but

we wanted to make each of the dimensions to have a different data type if the user wants to because that was the only way that we could capture data frames. Because for data frames,

ideally, the user can choose any subset of the columns with any data types and say, this is my clustered index. Make sure that despite the fact that those have different data types, I want the slicing to be very, very fast on those dimensions.

And that required

a lot of refactoring.

And that's what TileDB 2.0 introduced. So that was an important technical

refactoring that we did. And, of course, it starts to pay off massively because now we can handle generically any kind of data frame with duplicates and everything, stuff that the traditional array would just not be able to handle.

And then another core element of the data format that we've mentioned a few times in passing is data versioning, which is particularly

critical for things like machine learning workloads where you're doing a lot of different experimentation and generating different output datasets, and you need to be able to backtrack or

figure out what version of code went with a particular set of data.

So I'm wondering if you can dig a bit more into some of the versioning aspects of the file format and how it's implemented and some of the challenges that you're overcoming as far as being able to manage life cycle policies

to handle things like cost optimization or garbage collection of old versions of data? This is 1 of the most powerful features in in TileDB and the big differentiator from other formats as well. And, again, this is built into the format. So I don't know of any embedded storage engine that that can do that. I mean, you can kind of do that with Parquet files, but you need you need to use something like Delta Lake on top in order to be able to pull it off. It's still the Parquet format that allows you to do versioning. You need to kind of hack it on top with a different piece of software in order to be able to do it.

TileDB

builds it into the format. Right? That that's exactly how it is architected. But I would like to clarify a little bit what we mean by by data versioning so that people don't think that we have,

we should build some kind of of Git for data. This is not exactly what Tardis is. Although if there is if there is enough interest, we we may be able to build something like that. We we do have the foundation for that. So when we say versioning is that

when you perform a write, even the parallel writes, it doesn't matter. When you perform a write,

this particular write is a batch write.

We usually tell users not to write 1 record or 1 value some value at a time. Just batch your your values and then perform 1 write because TileDB paralyzes everything. And it's very, very fast when it comes to batched writes.

And each batched write

creates a subdirectory, a time stamped

subdirectory

within the array directory. And all the files that pertain to that batch right are inside that subdirectory.

So

when you do multiple writes and when we timestamp every write,

we give you

the ability

to be able to time travel

back. Right? To travel back in time and open the array in a state before some of the updates

happened. For example, I do an update today, and I do 1 tomorrow,

1 day after. But then something I feel that something is not right, and I wanna see what

happened yesterday and what happened the day before. So we give you the ability to open the array at at a particular timestamp

and then

get all the contents of the array so you can issue any query. That's the same query if you want to, but then

see a state of the array

as if it goes before the rights happen after the timestamp that you provided.

And

we have architected it in in such a way that we provide excellent isolation.

Right? Every fragment

does not interfere with any other fragment.

A fragment is the subdirectory. It's it's this batch. Right? So every fragment does not interfere with any other fragment. There's no locking. There's no central locking.

No locking is needed because the fragment name is unique across all the fragments. It carries a timestamp and the UUID, which is random.

So serializability

is guaranteed by default.

So this is what we call data versioning. This is different from saying, okay. I'm gonna go back to a particular version, then I'm gonna fork it, which is something that you would do with git. This is, again, doable, but we'd like to see more use cases in order to be able to build it. And I'm curious how this differs from things like Datomic as far as being able to handle the

versioning of data across time and doing things like

event sourcing so that you don't ever actually delete anything. You just mutate a record

and keep the previous version so that you can be able to say, these are all the different changes that happened to a particular attribute. 1 of the canonical

examples being

you have a user who has an address, and they move to a new location.

So

the fact that they used to live at a particular point

never ceases to be

a fact. They just have a new fact as to their current location

so that you can be able to go back through time and see what was the value at a particular point.

So, yeah, I'm just wondering if you can give a bit of comparison as to how the versioning in TileDB

compares to something like Datomic for being able to handle

the way that data is represented in a versioning

capacity.

Yeah. Data versioning entirely b is more similar to what Delta Lake provides with parquet files.

And, of course, we don't have the same asset guarantees

that the Delta Lake provides. This this is a large topic, which we will discuss in in future video tutorials.

But

what we do provide is

right serializability

without any kind of locking,

everything serverless with Delta Lake, you need to to have a spark cluster or or a Presto DB cluster in order for this to work. We don't need any cluster

for this to work.

And it's mostly

batched rights,

which can be done in in parallel, and then you can open the array at any instant in time

ignoring all the updates that happened afterwards. We do not

have any transactional semantics at at the moment. That's not something we optimized for

up until now. And, also,

at least in the embedded format, the Talib embedded format, we don't keep the logs of, you know, who accessed which attribute when. This is not the functionality you're gonna get. You get all the logs, very detailed logs of Tali b cloud,

but that's not about data versioning. You just get

tons of logs about pretty much everything that you have done, but we don't consider that as part of the data versioning

feature that we have. At least not today.

And then the other thing that I'm curious about is how you handle

concurrency in

access to the data and being able to resolve conflicts,

particularly

because of the fact that different batched rights will produce different versions of data. And so if you have somebody who reads the data both at the same time and then they create different batched rates, how you resolve those different updates.

Now we have architected TileDB in a way that can handle multiple such writers

and multiple interleaved readers as well in the following manner.

As I mentioned,

every batch to write creates

a fragment, which is a subdirectory in the array directory,

which does not interfere with anything else. And it will never collide because the name is guaranteed

to be different because we have a random token in it. So with only negligible probability, you can end up with a conflict there. So multiple writers can write at the same time,

and there are gonna be no conflicts, no corruption whatsoever

even if 1 of the rights

fails. If the right completes, we introduce another object, a special okay file, which says, okay. This subdirectory

is good to go. And then we respect all the eventual consistency

issues that, for example, s 3 introduces. So it is architected

in order to work with s 3's

eventual consistency and therefore we inherit that model when it comes to consistency.

Okay? The reads do not conflict with the rights because a read will never read a partially written fragment, and that's because of the absence of of this

okay file. If if they read upon opening the array,

if it doesn't see this okay object, it's going to completely ignore any partially written

written fragments. So this allows us to perform

concurrent writes and reads without having a centralized

service

to manage any kind of conflict.

And then the other interesting element of this is the fact that the

TileDB embedded project is open source and publicly available for free. And then you're also building a company

around that and the cloud service on top of it. So I'm curious how you're managing governance and ongoing sustainability

of the open source aspects of the project

and

the tensions of trying to be able to build a profitable business on top of that?

TagDB Embedded

is

entirely open source, and we will maintain it as such. We do manage it as a team. We govern it.

We welcome contributions

from anybody. We're very happy to to see contributions to it. We're very responsive if you can see in forums and the GitHub issues. And

we will abide by by this style. Like, TileDB embedded

and the integrations and the APIs

are all going to be open source.

The good news for us is that TileDB Cloud

is completely orthogonal.

It uses TileDB embedded. So all our servers that we spin up and we do the serverless computations, they're all reliant on TileDB embedded.

We use

the array format to define the access policies, the logs, and and everything else. But

all the TileDB cloud functionality is completely orthogonal to what we do in TileDB Embedded. And that allows us to to to have a very clean separation of the 2, and and this has not created problems for us so far.

And as far as people who are using TileDB to build applications on top of it, what have you found to be some of the most interesting or unexpected or innovative ways that it's being used?

We have seen

very diverse

applications

for TileDB embedded and most recently on TileDB Cloud as well. What I want to note mostly

is the ones that I find admirable

because

TaliPig was used first in an important domain like genomics. Right? And some very high profile organizations

trusted us to do that since we were just 4 people in the company. Right? And much earlier when when I was a single person in the labs in MIT

trying to just to just create a a very quick proof of concept.

So I find this admirable

because those are important important use cases,

and data management is a huge bottleneck for them. I mean, can you believe it that data management

is the bottleneck to the actual science? You cannot

do

analysis at scale,

especially in genomics, which is important to do it at scale.

And you cannot do it because you are blocked by data management. You're blocked by all those legacy formats. You are blocked by inefficient formats

and inefficient data management in general.

So this is what surprised me the most, not the fact that hard b p handled those cases. That was not what surprised me. But that certain people in certain high profile organizations

trusted us

to build this and improve it so that we solve a very important problem in a very important domain.

In terms of your own experience of building and growing the project and the business around TileDB,

what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

The most challenging part

to build this piece of software is not the near 1, 000, 000 lines of code that I build.

That that's not it. That's the kind of easy part.

The most difficult part is to start it from scratch

and build a brilliant team around it. The most difficult part is to inspire

some brilliant people

to come and invest their time

and put their passion

in order to build this

colossal vision. I mean, we are not delusional here, and this is something that I really want to to stress strongly. We're not delusional. This is a tall order. This is a very bold vision. But this is what excites us the most. And the most challenging part is to convince

the engineers

to come and work with me, the people that are doing the marketing,

the investors, of course, the consultants,

even my my managers at Intel and my colleagues at MIT to to even start this project. So that was the most challenging part.

And we are in a good shape. I mean,

we've been doing this for for 3 years.

We feel very confident about the team, very confident about about the software.

There is a long road ahead. But as long as we're excited and and enthusiastic,

I think the end result is going to reward everybody.

And TileDB

is an ambitious project, and you have a potentially

huge scope of work to be done in terms of the core capabilities

of the storage format, the cloud platform that you're building around it, all the different integrations

for all of the run times

and compute interfaces.

I'm curious. What are some of the features or capabilities that you're consciously deciding not to implement or that you're deferring to other people to build out as part of your surrounding ecosystem?

Great question.

All the computational parts

as we explicitly state on the website as well. Right? We we go for pluggable compute. But let me elaborate a little bit. The first thing that we don't wanna do, we don't wanna create another language

to access the data. Right? And we believe that that would be catastrophic.

People like to to access data in so many different ways. They wanna access data directly through language APIs.

They wanna access the data through already popular tools.

So it would not be

wise to just create our own thing and try to convince people to just

completely change the way they work every day.

So that's that's the first thing that that I left out since since day 1. And with that comes a query parser and and all the technology that comes comes along with defining a new language. So definitely not a new language. The second thing is we're not building a new SQL engine. There are so many wonderful SQL engines out there. Our strategy is to partner

with with all those brilliant people that are building those engines. We can alleviate a lot of the storage problems that,

probably they're not interested in in solving if if they really want to to work, for example, in query optimization.

So we we let those guys

work on on query optimization.

So we we're not interested in building a SQL a SQL engine from scratch. What we are interested in doing in that respect is pushing down some of the primitives of the compute primitives that the SQL engine could use only

because, first, if you if you push it down as close to the data as possible, probably it's gonna be faster because you're gonna avoid certain copies of the data. We're doing a good job internally in the in the core to do everything multi threaded, vectorized, and so on and so forth. So we are equipped with with the knowledge and skills to to do this efficiently. But also, most importantly, that certain computational primitives you find in SQL engines have exactly the same in in other engines as well. A filter

is a filter

even in PDL

or GDAL or or Pandas.

It's the same. So why don't we push it down, do it very, very efficiently there, and then have all the other pieces of software that are plugged on top to utilize it? The same goes with group buys, with merges. But we're not gonna rebuild

a whole SQL engine because the SQL engine is is not just a couple of primitives put together. There is a lot of intelligence, a lot of sophistication there, and we are not there yet. We we don't wanna do that. As I said, the original motivation was was linear algebra, and we're very much interested in in building all those distributed algorithms on on TileDB Cloud.

We focused

so far mostly on the infrastructure. So how do we

create a serverless infrastructure to be able to dispatch any kind of

user defined

function task graph

so that eventually

other people as well as ourselves but also other users

can build distributed algorithms

with linear algebra algorithms being part of those

on this infrastructure. So, again, we kind of delegated

building those distributed algorithms

to anybody who is who is equipped and capable and willing to build those those algorithms.

TileDB is definitely an interesting project and has a broad range of applications.

But what are the cases when it's the wrong choice?

Yeah. That that's another great question. So TileDB

is not a transactional database. Don't use TileDB as a transactional database.

Theoretically,

you can do transactions

through mariadb and its integration with taldb, but that's not that's not our thing. It's mariadb.

But the credit if you do transaction is gonna be is gonna go to MariaDB. It's not going to us. And we we act as a data connector to that. If you want to use, for example, some asset guarantees that you must have to in order to be transactional

through direct access from Python, you're not gonna get those

today.

You can get some of those guarantees that can get you a long way for certain applications.

But if you are a core transactional

application,

that's not something that you would use TileDB for, for sure. Another thing that you would not use TileDB for or or at least you wouldn't change to TileDB is that if you're using a data warehouse, if you're happy, if you're doing only SQL, if you don't care about interoperability,

if you don't care that much about cloud storage

and separating storage from compute,

then probably you should stick to the data warehousing solution that you have because these are not competitors to us. You would use TileDB

even for data frames if you want to separate stories from compute, if you want to do user defined functions in other languages, in any language, actually, because that's what we're trying to do. But not if you if you're sticking only to SQL. If you want SQL plus

more, then Taltig is a great solution for that. And finally,

we have not tested, we have not optimized for streaming scenarios,

again, only because we didn't have use cases that demanded that. But, again, you cannot consider us as a streaming solution, as a core streaming solution. So transactional and streaming solutions, I wouldn't consider Tiled

before. You've mentioned a few different things that you have planned for the future roadmap of TileDB.

Are there any other aspects of the work that you're doing that you have planned for

the upcoming releases that you wanna discuss or any other aspects of tile d b and

multidimensional

storage that we didn't discuss that you'd like to cover before we close out the show? Yes. Absolutely. So the TileDB embedded engine, we will always evolve. There are so

many issues even publicly

on GitHub that we're working on

heptically to get them done, always on performance,

always on added features. So Tiledev embedded will will always evolve, and we will always have

several people

on Tiledevi embedded full time. But the biggest bet that we have and the biggest investment of our time is gonna go on cloud.

So

TileDB Cloud, again, allows you to share data and code with anybody,

and it allows you to do everything serverlessly.

And that's exactly what we want to focus our efforts on because

once you solve

the storage issues, which we believe we we did to a great extent, especially for the use cases that we work on, the next step is how do we alleviate

all the engineering hassles? Because, again, data scientists,

they want to do

scalable analysis. They want to to get to insights very quickly,

which can lead to scientific discoveries. That's what what a scientist wants to do. Right? They don't want to spin up clusters. They don't wanna monitor clusters. They don't wanna debug clusters. So TileDB Cloud

has this goal to alleviate all this burden

from all the scientists that want to work with data

at scale

and very, very easily.

So the plans for the future, double down on cloud.

Tons of cool stuff are are coming up. So stay tuned, and and you're gonna see them in releases very, very

soon. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Yeah. There is a lot of brilliance and sophistication in data management today. That's not the problem

that we saw whatsoever.

The biggest problem

that we saw was that

any data management solution out there, especially very sophisticated data management solutions out there, were architected around a single data type, for example, tables,

and a single

query engine, for example, SQL.

If you use tables in SQL, there are tons of great solutions out there. But that was problematic, as I mentioned before,

for other verticals. Right? So that was the biggest gap. The biggest gap was that there hasn't existed so far a system that can work

on any data seamlessly, right, in a in a unified way,

build all the data management features like access control and logging and updates and data versioning on this universal

storage format that can capture all the data

and then

interoperate

with all the languages

and all the tools out there to give the flexibility to operate on the same data without converting from 1 to another. That system has never existed,

and this is why we built TileDB

as the universal data engine.

Well, thank you very much for taking the time today to join me and discuss the work that you're doing with TileDB.

As I said, it's definitely a very interesting project

and very forward looking, and I'm interested to see where it goes in the future and some of the ways that the ecosystem grows around it. So thank you for all of your time and effort on that, and I hope you enjoy the rest of your day. Thank you very much. It's been a pleasure.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links