Bring Vector Search And Storage To The Data Lake With Lance

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Imagine catching data issues before they snowball into bigger problems. That's what DataFold's new monitors do. With automatic monitoring for cross database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies

and anomalies in real time right at the source.

Whether it's maintaining data integrity or preventing costly mistakes, DataFold monitors give you the visibility and control you need to keep your entire data stack running smoothly.

Want to stop issues before they hit production?

Learn more at data engineering podcast.com/datafold

today. Your host is Tobias Macy, and today I'm interviewing Weston Pace about the Lance file and table format for column oriented vector storage. So, Weston, can you start by introducing yourself?

Hi. Nice to meet you. Yeah. So I'm Weston Pace. I'm a software engineer at Lance DB. I'm on the PMC for Arrow and Substrate. I've been doing data engineering and open source for a little while now, but only about the last 5 years.

And do you remember how you first got started working in data?

Yeah. So,

it's kind of a funny story that when I was in college, we had to pick what courses we wanted for our senior

year, and there's just this big syllabus of all these interesting,

computer science courses. And

I remember saying very distinctly at the time, it's like, I don't know what I will take. The only thing I know is I'm not taking databases.

Data seems like such a solved problem. This was back in 2010.

And,

so, after many years of app developments and and kind of unrelated things, I found myself working at a company that did test and measurement,

and they built these really expensive precise measurement devices.

And then they output a CSV file and, told the customers kind of have fun from there. And so we were working to build a a much better, like, data solution, and so then I ended up having to figure out, okay. Well, what is a data solution? What is a data lake? What are all these data things people are talking about? Started working with Py Aero a lot and then ended up working for Ursa Computing, kind of doing PyArrow and Arrow c plus plus maintenance full time. It's interesting

that you issued the database courses in your academic career, and now here you are writing the lowest level aspect of a database.

Yeah. Yeah. It's like not only am I working on a data lake or something like that, I'm working on data storage and file formats. And and,

again,

something that I thought was surely solved by this point. But

It it it's solved

to a certain extent.

There are always other solutions to it just like everything in software engineering. The answer is always it depends.

Yeah. That's that's a really good good way to look at it. It's like there are great solutions out there, and then there's always just a further niche, a further corner to explore. And,

as solutions get more and more advanced, you have to start looking into those things. Absolutely.

And so to that end, can you start by giving an overview about what Lance is and some of the story behind how it got started and how you got involved with it?

Sure. So Lei and Chung are cofounders.

Started it, this is before about 6 months to a year before

I'd really gotten involved. Chung was working at Tubi, and Lee was working at Cruise at the time, so they're both working with a lot of kind of multimedia data. They need to solve these vector search problems and just other kinds of search problems.

And and even outside of search, just working with image data and video data, there were not a lot of great data solutions.

And they were kind of building

things internally,

but also realizing, like, there's just a big hole here in the

in the database world of a good solid solution for

working with this data. And so they decided to start a company. And so now Lance is,

we've kind of been focused on

you know, we we do a lot of vector search type tasks. That is where we got our start, and now there's a big focus on just multimedia data management in general. So

kind of, a database that can handle vector embeddings. It can handle

compressed images. It can handle raw images or video. It can handle all your basic scalar TPCH style data as well. So

trying to find something that that handles the problems that a lot of AI engineers are kind of running headfirst into now. And as you were mentioning,

vector search, vector storage,

those are, again, solved problems

if you look at it in a certain way where there are vector databases,

there are databases that have vector indexes,

some of them have support for multimedia

storage, others are focused on being able to store embedding matrices

and text documents.

And

I'm curious, what are the core problems that Lance is designed to solve that aren't addressed by all of those other solutions?

Yeah. That's a good question. So there are,

as you pointed out, sort of special purpose

tools and libraries

for a lot of the different problems that need to be solved. And so

vector index is

a good place to start. So, like, vector search, we have

Face is the vector search library, essentially, and then Pinecone.

And a number of others have built on top of that or built their own sort of specialized vector search. And, initially, they were all in memory, very much just focused on just being this thin index

that you can

attach to your data engineering solution. The problem is,

it's not data engineers that are running into these problems. It's machine

kind of ML engineers, ML scientists, people who are trying to build models. They don't want

to build a giant data engineering

customized

solution. They're looking for a database.

And so when you have

something like,

an in memory vector index that maybe

they threw in oh, you can also attach

a metadata

object to each vector. It's very different than having a database where one of your columns is a vector embedding. You can create an index on that, or you can do all your normal database stuff on your other columns.

You can, you know, create an index on your

genres column and use that for filtering. Or a very common

solution that, I think people run into

is things like, okay. I wanna load all the images that match this filter because I need to do some kind of

training on that small dataset. And then further, maybe I only want 5% of those

matches,

and I want them in random order.

Or it'll be, okay. I start with a bunch of images, and now I'm gonna go and I'm gonna calculate

some embedding. Oh, wait. Now I want a slightly different embedding. Oh, wait. Now I want maybe a different embedding. But go ahead and store all of these because I'm not sure which one I wanna use. Maybe I won't actually know until query time. Maybe I wanna use all 3 and kind of pick the best. And so

these are all things that kind of having an all in one database that handles all of these different

use cases

is more convenient than having

a fairly complex custom built data engineering solution that's using a bunch of different tools.

And it's probably also worth

calling out the differentiation

between Lance DB, which is the I just have a in process or a serverless database engine for doing all of this vector storage from the Lance file and table format, which is the underlying

piece that powers LANDS DB, but is also usable

in independently

of that LANDS DB

interface. And I'm curious if you can maybe

give some thought as to what are what what is in scope for the LANDS file and table format, and what are the pieces that you are explicitly keeping out of scope for that because that is the domain of the Lance DB or whatever other client is interfacing with the file store?

Yeah. That is a good question. And,

one, we are we're still kind of clarifying and and better explaining because we do have these different layers, and we wanna make sure the distinction of what belongs in each one. So we have a file format at the lowest level, which is a storage format like parquet or kinda like the error IPC format or or just some way you can store tabular data.

And then we have a Lance table format, which again is gonna be very similar to some of the other table formats. So it's, like iceberg or Delta Lake. It's solving the same problems there. And then on top of that, Lance DB

is more of the application view. It provides the user with an interface that they're kind of expecting from a database. So you connect to it. You,

you know, insert things, you perform searches,

and that's about it. It hides a lot of the complexity

of really the lower level table format and file format.

So Lance,

I think the project in Python is PyLance,

is gonna be much lower level. It's tends to be more flexible. You can do more things with it. And so when you're building

certain integrations,

sometimes you have to drop into those lower level libraries. Like, if you wanna do a complex

multiprocess

bulk ingestion or something like that. Whereas, LanceDB is more of a, okay. Here's a straightforward API that doesn't take long to learn. Ingest a bunch of data, do some vector searches on it. You mentioned

Parquet as the analogous

file format, which has been very popular for data engineering,

especially for data lakes and lakehouse architectures.

In the read me, it mentions that it's very straightforward to convert from parquet to Lance and that there are some fairly substantial

speed benefits that you get. I'm wondering if you can talk to some of the motivation

behind that

compatibility support

and some of the cases where maybe you want to keep your data in parquet

for whatever use case and despite the fact that you're going to get speed improvements from using LA.

Yeah. I like Parquet. Parquet has done

a surprising number of things right, and so it's a great library, actually. That's kind of how I met Cheng and Lei. So

And I said, well, yes.

And they said, can I make it faster? And it's sort of like, well, there are quite a few things we can do

in the library to make this faster. It it hasn't been a priority yet.

At some point, you're gonna run into some limitations within the file format itself. And,

and I think we've learned since then

that both those things

continue to be true, which is people that build parquet libraries aren't often thinking about random access.

And so even in cases where the format could handle random access better,

the library

doesn't have the right APIs. It's not

being as efficient as it could be. And then when you get down to the format with Parquet, there's certain things that just make working with

this data in Parquet

tricky and and not quite as optimal

as it could be. And I talk about random access

a lot when I'm talking about Lance, and that's because

2 of our main

use cases in the multimedia

AI data world boil down to random access. So any kind of search problem is usually you're searching against something that

you can't turn into a nice primary clustered

time stamp style index where you can just sort of charge your data files

and sort by that that column. So, like, searching vector data,

there is no nice

sort algorithm

for vector data. So you can't

you can do, like, z order or Hilbert curves or something like that, but it's not really gonna solve the problem. And so when you do a vector search, you search this very small index file, and it tells you which rows you want. And now you have to go get those rows, and that's random access at that point more or less.

And, the other is when people are doing training, they want their data in random order, and so they need to have some kind of efficient random access. And and with Parquet,

you can really cheat. And I'm I'm rambling a little bit here, but, if you'll allow me. Especially with

cloud storage and with Parquet,

you can get away without having random access a lot of times when you're dealing with

TPCH style scalar data, some numbers, and small strings.

Because if you have,

you know, maybe a 1000000000 rows,

then you

have, like, if it's a 64 bit integer, less than 8 gigabytes of data. And cloud storage is gonna give you, you know, 4 or 5 gigabytes per second easily. And so you can search that entire column and just do random access within 2 seconds by loading the whole column and throwing away what you don't need. And and you can do much better than that with parquet. You don't have to pull down the whole column. You pull down the pages you need. The read amplification doesn't have to be too bad.

But once those columns are

vector indices, you hit this interesting

sort of sweet spot where a vector index, like a sample one, might be 4 kilobytes per

vector. Now suddenly, if you wanna pull down a 1,000,000,000 vectors, you're talking about 4 terabytes of data. That's not something you wanna search through exhaustively. And even when you talk about, like, page level random access, if you're saying, well, I can get you to within

10,000 rows.

We're still talking about now 10 megabytes

versus 4 kilobytes.

That's way too much. We need a much finer resolution to random access.

But at the same time, vectors are not nearly so big that you can get away

with just using, like, a blob storage or something. Just saying, grab this vector as a blob. Every single one gets its own file, or I'm gonna use some special format, and I'm gonna perform an IO operation per vector

for all of my columnar tasks. That doesn't work either. So you need columnar storage, but you also need good support for random access. And so that that's sort of the origins of,

the lance file format and and how we started moving away from parquet. In that random access question as well,

as you noted,

column oriented

formats are best for aggregates or scans of a column. But if you're doing random access of a vector, the chance is that you also wanna get one of the attributes of that same row in order to be able to understand what is this vector even telling me because it's just a bunch of floating point numbers in a matrix, but what I actually want is the text that it represents

or the image that it represents. And so from that perspective, I'm wondering,

what are some of the ways that you're addressing that row

attribute access while still being able to take advantage of the columnar compression that you get? That's a good question. And what we've got today, and that works pretty well, is we stick with columnar. But for each column, we can grab the values in really, like, one IO operation. If it's like a string or some kind of variable size column, you need to. And

so then the math becomes pretty straightforward. If you say you're taking 5 columns and you need 10 random rows, you're gonna end up with, like, 50 to a 100 IO operations.

And you said, well, if I store those in row based storage, I could drop that down, you know, 5 times less than what I had otherwise.

And that's

true, but we haven't usually needed to go that far.

And

in the cases where we do need to go that far, something we're working on right now

is the idea of it's called basically the packed struct,

encoding. And it's saying,

take these 3 columns or maybe these 4 columns,

and I always need to access

all of these or none of these. And so we'll go ahead and just pack them into 1 column. And so you end up with kind of a dial that you can turn you can turn it all the way to 0, which means every field is its own column. And you can turn it all the way to 1, which means that you now have a row major format, and you can turn it anywhere in between. So you can kind of fine tune to what your patterns are. So, like, we have, you know, some customers we found

where they have a very, very consistent query pattern. I'm always going to want these 5 columns, and I'm gonna be querying thousands of times per second.

Well, then it makes sense to go ahead and pack those columns together. But most of our customers are in sort of the okay. I don't have 1,000

per second queries. I really need to be able to do a lot of data exploration and filtering and things like that. So there are times where I just want individual column access.

I just need random access to not be too slow

when it comes time for it. And, what we found is columnar is not too bad for that. It works in most of those cases. And when it's not, you just turn it down.

As far as data bottling,

the

most notable addition that Lance adds over and above parquet or other file formats is a native vector representation

and vector indexing.

I'm wondering if there are any other features or constraints that engineers need to be thinking about when they're modeling

their data embeddings or modeling their overarching data

where maybe they use LAANC for all of the vectors and

the mapping of what the vectors are presenting, and maybe they use parquet for all of their standard

tabular SQL data or, you know, any

variation along that axis?

Well, so

luckily, the answer is

things work together well, these days.

So I've really loved the work of, the Arrow project. I worked with Arrow for quite a while. And one of the one of our other kind of core

file table developers will

is also comes from having worked on Arrow for a while. Lance is

fully interoperable with the Arrow

type system. So

all arrow types can be stored in lance and retrieved that way, which means we more or less have full interoperability with parquet. So even our vectors, if you wanted to convert them into arrow and then to parquet somewhere else, You can do that with the fixed size list data type. One thing we found as we started using this fixed size list data type a lot more is that, there are a number of libraries that had just gaps for fixed size list because they they added it, but it was kind of an afterthought. And,

so we've been not so much anymore, but just like a year ago, we had to fix a number of small things where fixed size list just wasn't supported. There was no equality function for it, or there was no hash function for it, or you couldn't use it in this spot. And so we helped to round out that

support.

So

for the most part, you don't have to worry. It's

pyrotypes.

Everything kind of interoperates.

There are, like, minor thoughts when data modeling. You know, it's nice if your vector embedding is a multiple of

16, because then that makes the SIMD algorithms that we run on it work nicely, but that generally hasn't been a problem. Every one that people have come up with, works well. Yeah. So we try to be

flexible

and

robust so that your data modeling decisions

shouldn't have major impacts on your performance. One of the key considerations around data modeling in the context of vectors that I've seen for some of the other vector engines is that depending on how they approach the implementation, there may be a hard limit on the maximum number of dimensions that any one vector can have for various reasons. I either because of purpose of space constraint or

because of the computational complexity of maintaining the indices, etcetera. And I'm wondering

where do you fall on that side both for the case of Lance, the file and table format, and Lance, the database interface?

It's a trade off. Smaller is always better for performance.

So if you're able to do like, SIFT vectors are, like, a 128

8 bit vectors.

Great for performance. You can train a index on SIFT vectors

very quickly and easily. We see a lot more often now. It's like a 1024

or 4 byte vectors. Those are just going to be a little slower because there's more data to process, and it's a trade off. You get better accuracy versus better performance.

As far as limits go, once you hit 1 megabyte per vector, I would say that things are just going to be really slow, but go for it. We do we don't actually have a hard cap beyond cannot

go larger than 2 gigabytes per vector.

I I think you're solving a different problem by the time you get there, and that's you might actually be able to get away with it. Now there's probably I think it's a 32 bit size

in arrow for the dimension of the vector.

But, otherwise,

I don't think we have any caps in there.

We

Lance

as

a file format and table format is,

it doesn't stop you from shooting yourself in the foot, maybe as often as it could. NinesDB will try a little harder, but I don't think we've run into needing to add a a vector cap. Yep. Yeah. It it the vector index itself is interesting too because it's basically a lossy compressed version of the vectors that you pass in. And so you end up with just this hierarchy of, compression,

which is like your your original document or image or whatever.

Conceivably, if you're able to compare those, you would get the best accuracy. But it's just too much data to compare in a meaningful way. So you have these embeddings, which are effectively

compression, lossy compression of the images. And then you put them in a vector index, which just kinda compresses them even further.

And then you get to play this game of

how much compression is too much compression before my recall starts to fall off a cliff. For children of the nineties, how many times can you copy that CD before you stop being able to hear it anymore?

Yeah. Exactly.

And in terms

of

the modeling from the file size and trunking perspective,

what are the

considerations

that people need to be aware of when they're thinking about, I'm going to be using this in some sort of cloud store. I'm going to be doing

scans and searches, whether it's from Lance DB or I know that the Trino engine is working on adding support for Lance

and just some of the ways to think about the balance of IO performance against

the, you know, the the network retrieval times

and the

searchability

of those chunks where you don't want to have too small too many small files, but you also don't wanna have everything in one big file or just some of the rules of thumb or the heuristics to go with for figuring out how to actually segment those files?

So I used to work with Para datasets and Parquet, and there would be a lot of

those sorts of

configuration things.

One of our goals with Lance

was really to get rid of those as much as possible because,

we found

when you

add these,

wide data types, it either becomes hard hard to figure out what the right setting is or just downright impossible. There is no good setting. So with Lance, we got rid of row groups,

which I told my friend at the time. If getting rid of row groups is, like, the only thing I do in my life, then I will consider it a success. So, hopefully, that can eventually propagate back to parquet and other file formats. And

so what's left is we need page size, you know, how much,

data do you buffer before you write to the disk. But 8 megabytes is, like works well for everything. That's big enough

that you're gonna have good s 3 performance,

and it's

not

so big that you use up too much memory. So, like, 8 megabytes per column is usually a reasonable amount of memory for a writer to use. So I think the only one that we might play with still is how many rows you have per file. And I think the minimum is what we use today, which is a million. You really shouldn't be going below a million because then some of your columns are gonna be just too short. There might be some cases where you wanna go above a1000000,

but we found

for

for pretty much all of our customers and use cases, the defaults work fine.

So that I was very happy to be able to achieve. And now digging into the table format, you mentioned that it's analogous to the iceberg table format that's become very popular for these cloud data lake houses.

And I'm curious if you can talk to some of the design considerations

around the LAANC table format,

juxtapose it with maybe iceberg or hoodie or delta, whichever table format you're most well versed in, and some of the ways that they are disjoint and the ways that they can interoperate.

So those formats, we would like to be able

to support there's sort of 2 questions. There's, like, one is, being able at some point to do vector search on those existing formats.

Another is being able to get those formats to use Lance files. So doing vector search on those formats, there's been some there was some interesting research done recently by a friend of mine, Tony. He kind of, built this system called Rocknest, where he was able to build vector indexes against parquet and search iceberg,

data lakes. And there's a few challenges

that you run into. 1 is that parquet doesn't do great on that random access problem. 2 is that the table formats don't really have a standard for where you put those index files, and the readers aren't designed to look for them. And and then 3 is

one of the challenges that

we inspired us to do our own table format is this idea of the row ID

and being able to because in your index, what you essentially have

is a usually a column or columns of searchable stuff, and then you have a column of row IDs. And so you search the small searchable stuff to find your matches, and it tells you what your row IDs are. The problem is you can use

what we call row addresses, which is like

the 5th file

and the 17th row. That is something you can use with iceberg,

but

that address is going to change when you run compaction. It's going to change when you delete rows or update a row,

and you need to then be able to reflect that change somehow in your index.

So either you're updating your index when you're doing compaction, which is how the Lance table format works. Although we're trying to move away to something a little bit more sophisticated where you have kind of this primary index structure, which allows you to resolve those addresses

from IDs.

And, so that's that's one of the challenges.

And then

as far as using Lance within Iceberg goes, it's something we would definitely

encourage. It's not something we're

actively

working on right now, but I don't know any reason that wouldn't work. So I think Iceberg does have some placeholders in there for different storage formats.

And, again, you would run into, but I think you could solve some of these challenges where, like, libraries just aren't written with random access in mind. Like, when we built our table format, we thought, oh, we do have this random access. So then we looked at an operation like an up cert operation. And what you're doing there is you're taking a

table of new data and all your existing data, and you're kind of doing this join to figure out which rows you're replacing

and which rows you're inserting new rows. And if that join is happening on a column that you have an index for and you have random access, then there's actually ways you can do that join more efficiently. So those types of tricks aren't necessarily gonna be built in yet, but I think they could eventually

reach it and take advantage of those as well. So I I think we would like to see more integration there someday. I don't know that we're 100%

ready yet or have a ton of time to work on it yet, but that's something that we always kind of keep in the back of our mind of making sure that we don't wanna

completely alienate those formats and that we're able to integrate with them. And, also, one thing we wanna do a lot with the LAANC file format and LAS table format both is put these out there as examples of saying, hey. Here are ways that you can support these

operations.

You know, if you're gonna do an update to some of these existing formats, take a look at what we've done and see if you can integrate some of those features because that would make it much easier for us to support those kinds of data in the future. On the point of indexing,

I know that in the context

of AI engineering where you're building some sort of a rag system and you have an embedding model. Anytime you change that embedding model, you have to completely rebuild the entire dataset because the embeddings are no longer compatible.

I also know that there are other cases where you can't necessarily incrementally update an index and instead need to completely rebuild it. And I'm curious how you're starting to think about that problem space in the context of Lance as far as how to allow for

evolution of schema, maybe have some means of compatibility

or updating of existing embeddings so that you don't have to rewrite all data everywhere,

just some of the challenges of in that space? Yeah. I'm glad you asked,

because

I was thinking about Iceberg and Delta Lake. One of the biggest differences with the LanceDB format is we absolutely needed to be able to add a column

without rewriting

existing data. So if you think

a lot of data lake models, you'll have this

feature pipeline, which is sort of you start with some ingestion of raw data. You run through your pipeline. You add some different columns to it, and it writes out your dataset. And then in that sort of environment, you add a new column. A lot of times, what will happen is they'll just rerun the whole pipeline, rewrite the whole dataset. It's 100 gigabytes data. Who cares? When you have already an embedding vector and maybe a 1000000000 rows, you've got in that embedding column now,

4 gigabytes of data. Or no. Sorry. Four terabytes of data. And so if you wanna add what we'll see a lot of people wanna add a second embedding. They don't wanna drop the first embedding because they wanna be able to do compare and contrast and that sort of thing. Well, if adding a column requires rewriting 4 terabytes of data in addition to the 4 terabytes of data you already have to write, then it's just too expensive. So the Lance table format, our our schema evolution, we essentially, what we do is we support multiple files

per fragment where each file's

adding different columns, which is something that Iceberg and Delta don't have yet. And that lets us to do our our we can add columns essentially for for free. You have to write the new column, but you don't have to rewrite any existing data. I can definitely see how that would be very valuable, like you said, for compare and contrast and experimentation, but also for purposes

of when you have a situation where you say this embedding model is better than the one I had before, and I need to go and regenerate all of these embeddings. But I already have the actual

core data that I'm representing

in

the Lance files. So I don't have to rewrite all of those. I just have to update the embeddings. So that's definitely a very useful approach to that, and I can also see it being useful just from a generalized schema evolution perspective of, hey. I wanna add this new column, and I need to backfill

the default values, which is something that you might do in a typical

application database, but that's because you're only dealing with

100, maybe 1,000 or low millions of rows with where you're dealing with, you know, megabytes to low gigabytes?

Yeah. Yeah. And like you said, we've had customers that have come to us and said, I don't care about vector search, but the fact that you can add new columns into that schema evolution cheaply and the fact that I can do my training,

against this data, that's all I need. So that's that's a main one of the main use cases that we wanna keep supporting. And, also, I can imagine that even if you're not going to completely replace an embedding model, maybe you have a different vector representation that you need for a different use case. Maybe it's a lower dimensionality or a higher dimensionality vector where you need either greater accuracy or greater speed, but you also still wanna have the default embedding for a different application.

It supports those use cases as well. Yep. There's a few use cases. 1, like you said, where you're making a trade off between 2 embedding columns.

There's also

like, you will have sometimes

when we're dealing with string, for example, we might create an embedding,

which is like a semantic embedding to allow semantic search. But we also want the original string column itself because we wanna be able to do full text search, and then we need to write that full text index, which has its own

copy of the data. And and so having those all there so that you can use sometimes, you even wanna use both of them, and that can be true even for semantic search. Sometimes you'll have 2 semantic models, and you can search with both semantic models and then rerank the results using some of these cool re ranking algorithms.

So,

yeah, we've had one of our

our biggest,

sort of customers in terms of, like, how fast they hit the system has several different embedding models, and and that was one of the key features they liked. And in that context of building

AI applications or retrieval augmented generation systems, it also brings up the question of latency because the LLMs

on their own add enough latency to the system, so you wanna try and speed up all the other pieces so that the end user experiences

as fast as possible. And I'm curious

as to some of the ways that you're thinking about the speed of

Lance as a file format and table format and some of the different

operational environments

that will benefit that latency trade off where maybe you don't want it all running in s 3 because then everything is a network hop. Maybe you want it all running on disk somewhere or just some of the different deployment considerations

about how and where to store that data particularly,

and then also being able to age out data where this data is old. I still want access to it, but that goes in s 3. This other data here is running on SSDs so I can get faster latency.

Yeah. One of the

biggest,

challenges that we faced initially, and it took us a while to gain confidence that we were overcoming this and we're we're kind of happy we are now, is we came out with a vector search product that was built against

storage.

And at the time, all of the vector search products were built against memory. And so we had to prove, you know, can I still get latency

that matches

Pinecone running against memory or my custom face thing that's running against memory? And

we're not magical.

So against pure s 3, you're gonna have limitations.

But what we're seeing now is you have s 3 Express popping up. You have I'm Juice and Weka, these sort of s three caching layers that are popping up. And then in our enterprise deployments, we actually have our own sort of,

caching solution that'll do NVMe cache on top of s 3. And once you get there, then 2 things happen. Is 1, you're

absolutely fast enough you can kind of match those latencies.

Your your IO

is now

kind of on par with your compute. You know, vector index searches are actually pretty compute expensive as well. And so we reached the point where, you know, IO is not necessarily the bottleneck for latency anymore. And 2 is you hit the point where,

again, as I've been kinda saying, a lot of those assumptions that were baked into parquet and cloud storage of, well, it's cheaper to pull down the whole column than it is to do some random access and grab the bits I need. They start to fall apart as well once you have these caching layers and these really fast IO layers in place. So, yeah, latency is

critical in

a search application. But by choosing just

storage in general, we haven't we haven't really,

written it off. And what we've seen now

that we really like to see is,

Pinecone has said, oh, you know what? Maybe serverless and running against, storage as the final back end is not such a bad idea, and so they're trying to adopt,

our model as well.

You have turbo popper popping up, which is another vector search company that's also built against, cloud storage with caching layers.

And so we're we're kinda feeling pretty confident that that is the way things are moving going forwards,

which is you can

build your product against cloud storage,

and you're not gonna get great latency, but make everything work. And then these caching layers can add the latency and get you competitive with something that's running just in memory. Another interesting aspect of the table specification

is that it has built in specification for feature flags to enable

various elements of experimentation.

And I know that you also said that you have some extensibility

mechanisms in the file format as well. I'm wondering if you can talk to some of the decision process for adding that feature flagging, some of the use cases that it enables, and how it has aided in enabling some of that experimentation

and rapid development for a format that is targeting such a fast moving ecosystem.

And we have had

to make sure that we can evolve quickly. And,

at the moment, we are open source. We have the benefit of you know, we're not not multi vendor open source at the moment, so

we can

kind of move quickly.

And so that's been helpful. We've we've evolved quite a bit since our original implementation. So we were able to prototype

with the, feature flags some things like, a stable row ID. I kind of alluded to that before, but that was something that we were able to change. And just having a spot to mark, oh, I'm doing something a little differently here, make sure that you can support it, has been great. Another one that I I think I have to give Will credit for this one is one of the first things that you do when you connect to a LAN's dataset is you wanna get the latest version. And so you have to if it's stored in cloud storage, you have to do, basically a list bucket command to get all the versions, and then you find the latest one. Well, turns out all those buckets return data

in reverse alphabetical order. And so one of our flags that we just recently added is, hey. I'm gonna

start naming my versions at 999-9999-9999

and working downward instead of starting at 0 and working upward. And as a result, the latest version is always in the first.

We don't have to page through the list bucket command, and we can find our latest version much quicker that way.

So

but I I think extensibility needs to be baked into everything. Like, with the file format

we

put

in our pages, you have to say, well, what encoding am I using to encode the data for this page, and what compression am I using? And I said, just make it a protobuf any object.

And,

at the time, people said, well, that's

now you're adding this extra layer of, like, decode, and then you're putting in if checks, and it's gonna slow things down because you have it seems like the standard at the time was, oh, here's an.

And if you want a new encoding, you change the spec to add support for new encoding, and you reserve yourself a new slot on the enum.

It's like, well, one of the nice things that we did in the file format is pages are always 8 megabytes large, which is, in the grand scheme of things, huge when you're looking at, like, rows and and how much data you have. So there aren't that many pages in the lance file, and that lets us be very flexible

with our encoding. So now you can add we're still trying to get the SDK out there and things like that for it, but we've got support for basically being able to add a new encoding without making any change to the code at all. You can just take a plug in, throw that on there, and now you have a new compression technique. And and because what we're seeing is compression techniques, again, something that seemed relatively solved, well, now people are talking about fast lanes and FSST and ALP and all these interesting compression techniques. And I think we start to see people looking at these LLMs, and now you suddenly start to have new domains

and new compression techniques. Like, people will say, oh, every row is a source code document because I'm building the next Copilot.

And it's like, alright. Well, maybe if we take advantage of that fact, you know, we can build, like, a column wide

dictionary that knows all of the tokens in your source code. So, like, if and and and while and switch,

and maybe that can help compression. I'm not actually, you know, an expert in designing these. Maybe this is way off base, but we wanted to be flexible for that kind of thing. In terms of the

engineering

effort to build this format,

build the table specification,

and some of the design considerations

that you've gone through as you have

developed and evolved the formats.

I'm curious if you can talk to some of the most challenging

or most impactful decisions that you've made. Yeah. So I talked a little bit a while ago about removing row groups. And for me, I think that was the most it was impactful. I

I felt that it it could be done,

but I also needed to be proved, and I needed to come up with a clever way of

doing scheduling so that you can still get parallelism

even if you don't have row groups. And so

the first solution I came up with was really bad. In terms of it, it was just way too complex, and it's actually still the one that's in use today.

We're working on 2 dot 1, which will kind of make it simpler.

But,

there was, like, a good month, I think, where it was kind of like, can I really get this? You have these concepts of running out of memory because I read in too much IO. I didn't have enough back pressure. And then there's also deadlock of, I read in the wrong IO, and I can't use it.

And I can't read in more IO because my my buffer is full.

And then performance of making sure all of this can run quickly enough.

And so that was

the biggest just engineering design challenge

that I've had to go through in the last, I think, 6 months. Fortunately,

it all worked out.

So we have this nice 2 gigabyte IO buffer,

and

we apply back pressure when it fills up, and we were able to get the priority scheduling in there. And it will be all much, much simpler in a in a future version.

And as far as the access interface, we've touched on that a little bit where it has support in Py Aero. You can use DuckDb for querying. I'm curious if you can talk to some of the other work that's happening in the ecosystem that you're building of

different

engines or interfaces or access methods for integrating lands, particularly

as people are investing more in this vector store and vector index space because of the rapid growth of AI and AI based applications?

Yeah. We love kind of the

the decentralized database

and the, integrations that are popping up around that. Query engines, we use Data Fusion

pretty extensively

internally, and then we also use DuckDb.

And so the the way that we are able to get DuckDb to work is, I think,

a,

a a good testament to the whole decentralized database concept is,

DuckDb is able to query pyro datasets.

And so LAANC is able to look like a pyro dataset. And so then we can feed that to DuckDb. So DuckDb never actually wrote a LAANC integration. We didn't have to write a DuckDb integration. But because we look like a pyro dataset, duckdb can query us. Although that does mean that duckdb sends the push down filters in as pyro compute expressions. Well, pyro

is able to serialize its compute expressions to substrate, and data fusion is able to parse

substrate now into filter expressions.

So we

have duckdb

querying

LAANC because it thinks it's py arrow,

writing push down filters, which get turned into substrate, which get turned into data fusion. And then we use data fusion to actually

well, then we search the LAANC files. We give the data to Data Fusion, which actually applies the filters, and then we give it all back to DuckDV, and, it all works. You can query LAANC like an SQL database with that. It's very good for just, like, data exploration and playing around. Those integrations are fun, and and having worked on the decentralized database concepts for a while now, it is sort of fun to see these moments where it all comes together.

And now we're also working with AI, and we're starting to see all these integrations that I,

having been in data engineering, was just not really aware of. So you have things like Ray Data

and Torch Data Loaders, which

any AI person will tell you, oh, of course, I use those all the time. And and, you know, coming from Pyropandas world, I just wasn't all that aware of those concepts. So we have those integrations as well. Ray and Torch are popular. We've

had a lot of, community support actually around building up our

Java,

integration so that we can have Spark against Lance. And then there's these various just very AI specific integrations like Ollama and things like that where, from their perspective, they're they're not aware of anything data engineering related. Theirs is just a vector database, and Lance is the one that wants you. So

As far as the applications

of the Lance file and table format

and some of the ways that you're seeing it applied, I'm wondering if you can talk to some of the most interesting or innovative or unexpected applications.

I think it's fun to see ways

that vector storage can be useful. So, like, we've had people that use vector

search for,

deduplication,

duplicate elimination, which, again, for maybe an ML person, it's kind of obvious that those two things are related. But for me, it's like, oh, those are the same problem. Oh, okay. I guess they are. I think most unique

I won't get

political here and take sides on this, but someone did share

that they used,

Lance to index

the entire

project 2025

documentation

so that it would become searchable for anyone that wanted to learn more about it. They could, just do semantic search against it. And,

that was not something I expected to hear that we were getting used for.

So it's definitely fit into the unique category. And then,

the other, I think, is just seeing people that are

using this for,

well,

so one of the issues that we're working on right now is being able to store,

these really big blob type columns. And when I talk about blob here, I'm talking, like, 4 to 8 megabytes or more per value. So these people are putting in videos or something like that. And so we say, well, we really need these columns to be in their

own category because we can we gotta run compaction on them differently. Because if we

if we take 25 100000 row files and come back them into a 1,000,000 row file

and one of those rows is a 100 megabyte videos, then then you're talking about something huge. And so the reason I bring this up as an interesting, use is because the way we solve this is

we're having

lance dataset for your all your data. And then if you have blob columns, there's a second lance dataset inside there that's just configured differently with different compaction thresholds. That's our blob data storage. So lance itself uses lance.

For lance. That's funny. Turtles all the way down.

Yeah. Going back to the extensibility

point, both the table and file format because of the fact that you're dealing with vectors and there's so much investment in that space. I know that HNSW

is one of the more popular index formats for vectors, and I'm wondering what are some of the ways that you're thinking about the future expansion

or support for

new and I don't necessarily

say better, but different

index types for different use cases and some of the ways that,

the work that you're doing enables that and maybe some of the ways that people can experiment with those different index types without necessarily having to round trip it through getting it upstreamed into Lance?

We are

not 100% of the way there that you can experiment without upstreaming,

like, yet without diving into the source code. But it's sort of like, we are on, I think, our 3rd pass now through indexing,

like what abstractions we wanna use internally to represent these vector indexes.

And in each pass, we're sort of narrowing down what those traits look like, what what we need. And the reason we've done so many passes is because internally, we've been having to experiment and play around with these different,

solutions. We had an HNSW solution. I think we still do. We have IVFPQ.

So we're trying to come up with some good abstractions to to finalize that. So I I think we will get to a point where those abstractions

are solid enough that you can turn it into a plug in at that point, an extension point. Today, if you wanted to, you'd have to kinda dive in and get a little bit more

involved. But it's a fairly isolated part of the code base. So if someone wanted to do that and they shot us a message, we could get them pointed in the right direction. Scalar indices are actually a little bit easier. That abstraction's been pretty solid for a while. So we have been talking with some of the GeoJSON

people, and I think at some point, we might be able to do some

integration there to see if we can build, like, a, I don't know what you call them, but, like, a quadtree geojson type, index to search geometry data, which would be pretty cool. And so there, like, for the scalar index, we have

2 traits, really. 1 is, here's a data fusion expression.

Do you think this is something that you can search your index with? So you kinda have to parse that query.

And then 2 is, okay.

You parse the query into x. Now go tell me which rows match that. And in your experience

of working on lands, working in this space of dealing with vectors and matrices as a storage layer, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process? Yeah. I think cloud storage is always a lesson that I have never stopped learning. I joke with people that working with cloud storage, I very much feel less like

a engineer and more like a physicist, where I have this unknown system, and I just run experiments against it, and I start to build up this model. And every now and then, I'll run an experiment, and it'll just shatter my model, and I'll have to readjust and come up with a new hypothesis.

And, but I have a I have a pretty good working view at this point of,

how cloud storage works, but it is constantly being just sort of, like, reinvented. It has gotten to the point now where I I will,

find talks from, like, s three people that went and talked at at different conventions and try and glean information out of some of the random things that they say about, like, how they, scale up their random access support. And for them, it's really IOPS per second support. So that that's always a challenge.

I've learned a lot about Parquet. I'm pretty sure I could write a Parquet implementation now, so that's been fun. I I think this is more for me personally working at Lance. We've been able to work a lot with customers who are in this AI space, building these cool new AI things. And just it's I guess this is US for challenges. This isn't really a challenge as much as it's just a fun thing. Is it working with these customers and getting to see what they're building and and turn that into realistic data engineering requirements has been a lot of fun. And for people who are in the position of saying, I have a bunch of vectors that I need to store and work with, what are the cases where Lance is the wrong choice? So I'll translate wrong choice into, like, we're not there yet, and we don't plan to get there soon.

No.

So definitely,

if you're doing, like, OLTP,

Lance is not your application transactional database.

We don't have support for multi document transactions, and we probably won't, soon. The other is,

if you're starting to do, like, TPCH,

TPCDS style queries against trillions of rows where you really want, like, a distributed query engine,

we're probably not gonna be tackling the distributed query engine, TPH

CH scale factor 10,000 or whatever anytime soon.

So we're more focused on the scale you can reach when you have images, which is surprising.

But, again, I you know, 50,000,000,000, a 100,000,000,000, those are some of, like, our biggest

targets there. And just, like, getting specific, like, if you're working with the LAANC file format or something versus Parquet, one thing Parquet does really well that we just don't have the resources to tackle yet is handling untrusted input. So you could be pretty confident if you someone gives you a random file and you open it with parqcpp,

then your worst thing's gonna happen is you'll get an exception. Or, you know, if you open that with Lance, then you might have a panic and crash your whole process. So we haven't you know, if you're dealing with untrusted input or if you're dealing with,

yeah, just files that you could have random garbage introduced, we haven't gotten there yet either. So I don't think we'll necessarily tackle some of these things. But if someone really was passionate about LAANC and wanted to tackle one of these things, we would definitely help them go there. And as you continue to build and iterate on and evolve the LAANC format for file and table store of vectors, what are some of the things you have planned for the near to medium term or any particular problems or problem areas you're excited to explore? Yeah. We're working on a 2 dot one lance file format, which is not dramatically different, but,

it solves another problem that I really wanted to get to at some point, which is being able to do random access in either 1 or 2 IOPS no matter what data type you have. So you could have 10 levels of list and struct and direction, and it still boils down to 1 or 2 IOPS. So that's that's fine. So better random access in general, we're always working on. Another kind of big new area that we're looking at is these training workflows. I mentioned when people need to access

in random order some filtered set of their data so that they can feed, their GPUs to train a new model. That's,

something that a lot of users we've run into have encountered. And it's something we do okay, but there's definitely room for improvement in in how we solve that. It's sort of a perfect storm of terrible access patterns.

And I think we've come up with some pretty good ideas for how we can do that efficiently. So that's coming down the hood. The other thing we've been working on is looking at just kind of database management, making it easier to scale up and grow your database,

taking some of the tasks that you have to do manually today, like running compaction and re updating your indices,

and starting to make those things automatic and a little bit more foolproof.

Are there any other aspects of the Lance

project,

the format that you're building, or the applications of Lance

to AI engineering or generalized

vector and matrix search that we didn't discuss yet that you'd like to cover before we close out the show? No. I think we covered things pretty well. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and the other folks at Lance DB are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

So something I would love to, you know, if all priorities went away and love to stand and build someday is I've had a lot of fun with the substrate project. I think just like a lightweight, call it kind of vendor agnostic

library for

expressions,

function expressions.

It's like x is greater than 7 and and 10 is greater than 5. You have, like, pair of compute expressions. Polars has its own library for this. Data Fusion has its own library for this. Having something that's based just on substrate and not really taking sides between the different implementations, but like a Python library, would be nice and helpful.

Hey. A fun project that I think,

would be tons of work, but if someone wants to build, like, a, open source cost based optimization query planner, then, yeah, go have at it. I think, again, we have these different query planners. Like, within DuckDV, they have a really solid one, but it's,

not really exposed. Data Fusion has one that you can extend, but you gotta wanna know about it. So it's,

it's one of those things that I think would would help to have out there too. Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing on Lance. It's definitely a very interesting project. It's great to see all of the work that's been going into it and the capabilities that it unlocks. So I appreciate all of the time and energy that you and the rest of the Lance team are putting into that, and I hope you enjoy the rest of your day. Thanks. Have fun talking.

Thank you for listening, and don't forget to check out our other shows. Podcast.net

covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems.

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at data engineering podcast.com

with your story.

Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and

coworkers.