Index Your Big Data With Pilosa For Faster Analytics

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With 200 gigabit private networking,

scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

And if you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances.

Go to data engineering podcast.com/linode,

that's l I n o d e, today to get a $20 credit and launch a new server in under a minute.

Eluxio is an open source distributed data

layer that makes it easier to scale your compute and your storage independently.

By transparently pulling data from underlying silos, Eluxio unlocks the value of your data and allows for modern computation intensive workloads to become truly elastic with a cloud.

With Aluxio, companies like Barclays,

jd.com,

Tencent, and 2 Sigma can manage data efficiently,

accelerate business analytics, and ease the adoption of any cloud.

Go to data engineering podcast.com/aluxio,

that's a l l u x I o, today to learn more and to thank them for their support.

And understanding how your customers are using your product is critical for businesses of any size. To make it easier for start ups to focus on delivering useful features, Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events.

You only need to maintain 1 integration to instrument your code and get a future proof way to send data to over 250 services with the flip of a switch.

Not only does it free up your engineers' time, it lets your business users decide what data they want where.

Go to data engineering podcast.com/

segment I o today to sign up for their start up plan and get $25, 000

in segment credits and $1, 000, 000 in free software for marketing and analytics companies like AWS,

Google, and Intercom.

On top of that, you'll get access to the analytics academy for the educational resources you need to become an expert in data analytics for measuring product market fit.

And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as O'Reilly Media, Dataversity, and the Open Data Science Conference.

Go to data engineering podcast dotcom/conferences

to learn more and take advantage of our partner discounts when you register.

And go to data engineering podcast dotcom to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And please help other people find the show by leaving a review on Itunes and telling your friends and coworkers.

Your host is Tobias Macy, and today I'm interviewing Siabs about Pelosa, an open source distributed bitmap index. So, Siabs, could you start by introducing yourself?

Yeah.

Seeb, I'm a programmer type. I have a broad and weird background of what we call a generalist, but I'm particularly interested in performance stuff, and I think that is how I got to know the Pelosi people.

So I've been doing mostly sort of back end data

internals.

And do you remember how you first got involved in the area of data management? Mostly through Pelosi. I mean, I've done some database things on and off, but when I started working with Palosa, I discovered that there was this

date of all this data stuff going on and I hadn't thought about it all that much before. That was what got me involved. And so can you give a bit of an overview about what the Palosa project is

and any context that you have about the history of the project and how it's gotten to where it is today? I think Pelosa

my my description of it would be that essentially it is

the index part of a database without the rest of the data storage.

It's used

as primarily as sort of a dedicated specialized index. I believe it was originally spun off from a company called Bumble that was I think sports marketing related. I mean you might not expect

that when you find a genuinely novel

idea in database design that Origin would be, you know, a sports company.

But I think that sometimes the outside perspective

is where you get someone saying, well, it'd be really nice if we could do this

and

going and building it because they didn't come from a background where they'd all been told that that isn't what we do.

That's why it's a slightly surprising and different project.

So it's basically a dedicated index. The idea is that in a traditional database,

we often have bitmap indexes

and if you have a query that's hitting 2 of those indexes, we grab bitmaps from them and intersect the bitmaps and so on.

And what pelosid basically does is store those bitmaps

and have them all ready to go and

have some optimizations for making those operations faster,

which helps a lot once you start having very large cardinality in your data sets. I

think in a lot of cases

with postgres, I had a query once which took about 47 seconds to run, and I add an index and it went down to about 10 milliseconds.

And I thought, well, there's another part of this query, and I add an index for that and it went back up to about 500 milliseconds because

the combining bitmap stage was

inefficient.

And the idea is basically we're doing that part of things and storing the result, the partial results, and that allows fairly fast operations.

You know, as as I was reading through the data model, I was starting to get a little confused of how do you really represent the data in here and then started to realize what the actual use case was. As you mentioned, being able to operate on high cardinality data where you're working with

information that might be,

you know, multidimensional

and you wanna be able to just do fast aggregates on the data without necessarily pulling out individual values like you would with a traditional database.

And so I'm wondering if you can talk a bit more about where Pelosa fits in the overall data ecosystem,

and how it might integrate into an existing data stack that somebody is using if they're running on something

like Hadoop or you're using s 3 data lake or something along those lines?

So I I'm not

totally sure what some of the real world cases are. I mean, I've seen some of them, but

my understanding is that what we tend to do is people have an existing database and they have some specific kind of query that's that's being a problem that's

not performing well.

Which can be reduced in some way

to some number of yes or no questions about

entries in their data.

And you

move that into colosa and

sort of flip which are the rows and which are the columns, which is I think 1 of the surprising parts

and which definitely confused me when I was first reading the documentation because rows and columns seem to be backwards.

So the idea is in your normal database,

your, you know, standard relational database,

each row is 1 of your entities

and each column is a fact about that entity.

And in our system,

each column is an entity, and each row is a fact about that entity.

And you might say, well, why not just call them columns and rows in the other order then?

But

what columns and rows actually refer to is the physical structure of the data that at some level

that how things are grouped in a traditional SQL

database,

you will typically have

the items all the data for a row bundled up, and then each row will be bundled up and when you do a query on a column you are selecting

part of each of those rows.

And

when we switch things to having the rows be the facts about a thing

what that means is that we have a bitmap of yes or no answers to some question about data

and that 1 bitmap is that question and not any of the other questions

which allows us to very rapidly combine them. So

so you have questions about

people like, you know is an active member or something like that

and

we have a row which is just the zeros and ones for where 1 is you know has an active account for instance

and then we can

mask that, we could do unions or intersections or whatever to get answers to compound questions about people

very quickly,

and that's

what it's useful for. I think we have a blog post example of using this to do genome comparisons,

which are a good example, I think. And in the documentation

about the data model, it also mentions that you're able to represent

a set number of different data types as well. And so I'm curious how that manifests in terms of the overall storage system of Telosa and how that maps into that bitmap index

of being able to go back and forth of trying to aggregate on a particular set of attributes, but also being being able to identify what those attributes are

beyond just in n dimensional matrix.

So we have a couple of data types.

We've got the default 1 is what we call a set which is

just there's rows and columns and you can have any number of bits set then there's a mutex which is

similar to a set but we only allow 1 row to be set at a time for a given column

and that's

not actually a different data representation that's just different logic that when we set a bit we look for other bits and clear them. There's

also

some top there's some caches that get put on top of this like a cache of

which columns have the largest number of rows set

which was 1 of the original query

terms that made this seem like a useful feature.

And the other representation we have

is

a fairly

weird 1 which is

we represent numbers in this is going to sound really strange as a series of bits and

that actually doesn't sound nearly as revolutionary when you describe it like that but the idea is we have all these rows

and if you want say a number with a range of 0 to 16 we have 4 rows and the bottom row is 1s and the next row is 2s

and

this is not

especially efficient because we do need to then read all those rows

but we can sort of do this in parallel and

this is used in cases where you really need a way to represent something like a range

and it's not as blindingly fast as the rest of the things we can do,

but it's still pretty fast compared to

a more traditional structure frequently.

And in terms of the way that you would query a Pelosa database,

it's not necessarily the same way that you would think about it with SQL or even being able to retrieve a record from a document database.

So what does the query language look like, and what are some of the types of use cases that Palosas uniquely well suited for being able to solve given the way that the data is represented and stored?

So the query language right now, we have a language called PQL, which I think is just Palos, a query language, which is

a very simple

query language.

It's typical examples would be something like,

I've forgotten the exact spelling, but you write things like

you know row equals

5 and that selects everything that has

row 5 set

and then we have unions and intersects

and

exclusive or

and common represent

common interactions like that available

I and we've got some work going on for mapping some SQL queries into

corresponding queries

just because there's a lot of SQL around.

The kind of thing that I think this tends to get used for is cases where

you have

fairly large volumes of data and

you have you know in advance what questions you're likely to ask. So you know things like has an active account

or

you know signed up since 1996

or

signed up in a given year

or has purchased

what this particular product

and

you want to do combinations of these and you know if you look at something like the has purchased this particular product,

in a typical database that's probably joins and you're probably looking things up in 2 or 3 tables,

and the use case with paloso the way you might approach that is

you call product IDs rows and you store

a bit with the column to

you know the user ID is the column and the row

product ID is the row and you store that bid and then if you want to check users who have purchased this product you just

ask about that row and you get back all the columns that match it

and of course this is this is a very sparse representation typically because

you might have many column or many rows in which only a few bits are set

but we aren't storing

the zeros effectively for most of that. We're only storing

a very small number of bits so we can actually do that reasonably efficiently.

And 1 of the other contexts where I've often heard the term bitmap used is with images,

data into data into Telosa to be able to do some sort of image analysis algorithm

on the bitmap as it's represented in Telosa being translated from 1 of those static images?

I I'm not sure it it wouldn't seem as

I I don't think that would probably be a great fit because

typically although bit we talk about bitmap images

they're almost always multiple

bits per pixel. Although there is an interesting historical reference there,

there's a computer called the Amiga back in the 90s

that actually used a similar representation

in memory that if you had a 4 bit image

you didn't have 4 consecutive bits for a pixel. You had 4 bit planes and each bit plane was just

1 of those 4 bits for all the pixels at once.

And that this was a similar

structure, and

it had some performance advantages and some performance disadvantages.

Overall I think it is not great for visual data just because you are much more likely to want to know which

what the pixel value of a single pixel is than

look at the high red bit of all the pixels. I guess it might be nice for steganography.

And as far as the actual data storage in Pelosi,

I know that in the documentation, it mentions that it's not necessarily built to be a primary source of record and that you would usually load data into Pelosa

from either another data source in bulk or consume from a streaming engine such as Kafka, where you have the data being split into 2 streams, where 1 is going to your primary storage layer and 1 is going into Palosa for further analysis. So I'm wondering if you can just talk a bit further through

what a typical workflow is for being able to obtain and analyze data in Telosa and what the overall life cycle of that information would would tend to be. Yeah. I think those are the the basic forms if and, obviously, the other really common 1 is that you combine those and you already have some data.

So you want

to read in all that data and add new data as it comes in. So for imports, we have somewhat different logic because if you're, when you're streaming in new bits, you generally want to have, you know, nice high reliability data rights. You want logs of every bit as it's written so that you don't get out of sync. Whereas when you're importing

a few 1, 000, 000, 000 bits, if you do the full flush to disk for every bit, you will not succeed in

a and we're always looking at that for possible performance enhancements because that is absolutely

1 of the slowest parts of a the process of getting spun up I think

is just if you have many gigabytes of data

getting it migrated can be slowish.

This led to 1 of my favorite

small side projects here, which is I built a tool called imagine,

which

is used to create fake databases.

The etymology of the name is, you know, imagine you have a database with a 1, 000, 000, 000 users.

So I want to have made something where I can write up a description of a database with a 1, 000, 000, 000 users and

a third of the bits are set in these rows or whatever and point it at a local server and have a database. This is

somewhat useful for benchmarking the ingest process. It's also useful for setting up demos.

And

as far as being able to model the data as it's coming from

either a relational source or some sort of structured flat files, what does the interface look like? And how would you go about structuring the data model in Palossa to ensure that you're able to perform the types of analysis that you're looking to do on the data as it's coming in?

So the

the major thing is figuring out what your rows should be and what your columns should be. You know it's easy to say well columns would be users or something like that,

but when you're recording recording data

pelosa tends to favor things that can be well expressed as a yes or no question.

The more something is like a range of values,

the less likely it is that this this will work as well.

So we do support ranges for the cases where they're necessary but

as 1 example

I think we had a case where

the

likely default format that just came to mind

ended up having a field

where

the row value the only row that would be set would be the same as the column value. So it's you know sort of the diagonal row line of ones

and that that was fairly inefficient because you know for a 1, 000, 000, 000 records that means you now have a 1, 000, 000, 000 separate storage

files each with 1 bit and that's not the most efficient usage.

In that case you it's possible just not store the data.

So the kind of thing you want to look for is what are the actual questions we need to ask?

Because you know if

you like say say you have,

you know, you're looking at a package database and you're looking

for number of packages that import this package.

And that that's a number. So you might record it as a range and store

this range of values. And the high end might be a few 1, 000. So you need you know

10 or 20 bits of storage per package to

represent the number of packages that refer to it.

Well let's say you look at your actual workflow and the only thing you're ever checking

is whether the number of packages is greater than 0 or not. Well you don't need to store the number you need to store the is it greater than

0. If you do that, then you're in 1 of the cases where Telosa will perform really well because that's that's the kind of yes no question that

the big map index is really good for.

And I know that when I was looking at the documentation, it mentioned that you need to construct the index sets upfront and that they can't be modified after the fact. So I'm wondering how that would play into your decision making of doing the upfront modeling

and if there are any other types of decisions that need to be made early on in the process that could have ramifications later on as far as what types of analysis you can perform? I I feel like you you can, of course, change it. It's just the changing it may require you to do a lot of re ingesting or

modification.

1 thing I would tend to recommend is you

know construct

maybe

build a sample set with a subset of the data so it doesn't take too long and build some of the sample queries just walking through the process and seeing what happens

and whether you are able to express the right queries

because just because the data structure is unfamiliar and isn't quite like the way relational databases tend to work

it's easy to be unsure of what will work out well or what you're going to be gaining from it. I think there's a tendency for people coming from structured databases to think

more in terms of named columns and or and even if you switch to columns and rows thinking in terms of named things

and values for them

rather than

yes or no questions about.

Of course there is the dance the danger that if you go too far with that and you just encode

the question you have right now, if you have a new question later, you may not have a way to answer it.

So thinking about why you want to store the data and what you need to know about it,

and that's which is always a thing with databases, but possibly more so when you're reducing things to yes and no questions.

And are there any additional complexities

or

considerations that need to go into handling highly dimensional data and how that's represented and stored, or any difficulties

of finding appropriately high dimensional data in structured files sort of in the wild. I imagine that things like HDF5

that would lead to some of these highly dimensional data types. Yeah. I mean, that is definitely going to be hard.

They we've got essentially the ability to have fields with rows,

and they can have a lot of rows. I would say in some cases it might make sense to

just convert

higher dimensions to

column numbers or row numbers you know treat them as you know if you have

a 1000 by a 1000 by a 1000 thing

Just, you know, how the first layer use columns

1 0 through a 1000 and rows 0 through a 1000.

And then the next

slice would get 1 through what or a 1, 001 through 2, 000 and so on. But

that will obviously not scale up to very large dimensions.

At which point I think you want to try to find

something underlying this that

describes what you're looking at

in different terms. Like

I you know, like with

we've got the the genome example that's not highly dimensional, but it's a case where

you're replacing these very large strings of

you know ACTG

with

a yes or no how's this gene present question.

So your rows or columns, depending on the kind of question you're asking,

would be numbered

known genes that we are track tracking the presence or absence of. And is it possible for a bitmap in Palossa to have a reference to another bitmap to be able to possibly construct some of these dimensional matrices?

Sort of. It's not there is not currently very direct support for it.

You can have

arbitrary values in the BSI field and they could for instance be

column numbers from another thing.

Currently we don't have very good tools for

directly making queries like that, but

if you query if you get back data from 1 query you can use it to build another

and that is the thing we are looking at and we have some, you know, we have some experimental things that we've done in the tree as an experiment and not yet merged or committed because it's not quite what we want

but the idea is to make it easier to do queries that do that kind of thing and yeah that that is a way to approach the dimensionality

and

also simplify some queries that currently would require a fair amount of back and forth traffic.

And as far as your experience

of working on the Palossa project, what are some of the most interesting or challenging aspects or lessons learned that you have encountered in the process? Well, I I think my favorite is probably going to have to

be the time that I looked at some code and I said, no. I bet I could make this faster.

I managed to get, I think, a factor of 20 speed up in the code. I was really pleased with that until I realized that what I had done was

suppress the

operation log write. So basically

I was being faster because I wasn't actually writing the data to disk.

So that that was a good reminder of how easy it is to be overconfident

in performance tuning.

I think

I mean, that in general,

I've been doing a lot of focus on benchmarking and performance tuning because that's a personal interest, and

it's been very interesting because

there are

frequently very unexpected

opportunities for performance improvements that don't always seem like they're going to be significant

and it it's this is a great example of the general performance rule that you really need to profile things to know what what you're doing or what you want to do and where to focus your time.

It's been very interesting code to work on and

you know as with any high performance code there's a lot of

interesting special cases.

For instance, if you're comparing the contents of 2 arrays

and they're both sorted arrays you just want to see how many items they have in common.

It turns out that it matters quite a bit whether you are iterating over the longer array and then the shorter 1 or the shorter 1 and then the longer 1. And what are some of the other overall strategies that Pelosi uses to be able to achieve the types of performance that it is aiming for? And what are some of the current bottlenecks that you're trying to work through?

Well, I I think a large part of it is we are pretty focused on what you can actually have in memory. So

once the system is up, it will have the data in memory. It doesn't currently support

picking things off the disk and that requires a fair amount of work on making the in memory representation efficient.

We use a modified variant of the roaring bitmap format.

The distinction being that ours handles 64 bit ranges instead of just 32 bit ranges

but that lets us fit

many many gigabytes of well, theoretical gigabytes of zeros and ones into much smaller amounts of memory.

So the biggest issue is trying to make sure the data fits and that's 1 of the reasons that we have support for sharding and clustering and so on because at some point you just plain have more data than you have memory. The other

and the other bottlenecks

I think are mostly at the level of

just performance of specific cases

where

access patterns are inefficient

and

you know if you're when you're when

you're ingesting values

the access patterns can make a factor of 10 difference in speed.

So

being able to arrange to produce the data in a sorted order for instance can make a huge difference in how quickly it gets written and

we're working on some improvements to that because we found some places where I think there's there's some good opportunities for making it faster.

And

we were having an issue that

if you had

a very sparse data but you had a lot of it

we were having memory issues and we've

been working on reducing

memory usage in that case pretty significantly.

And I know that the primary language or actually I think the only language of implementation for Pelosa is actually in Go. So I don't know if you have any thoughts as to the benefits and trade offs that that provides of having that be the implementation target. And I know that you have said that you're still fairly recent to the project, but

given the context that you do have, if there are any architectural decisions that you think that you would make differently if you were to start the whole project over today?

That's a good question. Go is go is a reasonably good choice, I think.

I I'm sure we could get

several percent faster, maybe

a a fair bit faster working in C, but I also don't think it'd be done yet. I'm a reasonably experienced C programmer and

I find it useful sometimes to not have to do

quite as much of that, but we are definitely seeing some cost to the garbage collector and the allocation.

And so a lot of the

performance optimization

opportunities

are basically finding the cases where it really is worth the extra time to outsmart

those garbage collector a bit and bypass some of what it would otherwise do for allocation

and that that can make a very large difference in performance.

Overall, it's been a fairly good fit. It's

it's expressive,

but it does allow you to get down to the bit level and write code that

where you have a pretty good idea of exactly what will happen.

We don't have any in line as we don't have any assembly code in there yet,

but we might someday for a few of the particularly

expensive loops or whatever,

but most of the time the primary expenses are just

the sheer amount of data there is to work on.

And at that point,

it's not a bad fit at all as a language, I think.

Architecturally,

I think I'm pretty happy with it. It it basically makes sense.

And as far as any experience that you have of working with end users of Pelosa, what have you found to be some of the common points of confusion or difficulty that they encounter when trying to get something up and running and start using it for their own purposes?

Or also any sort of common feedback that you hear on the,

open source repository as far as issues that people encounter with either trying to use or contribute to the source?

I would say that probably the columns versus rows thing is the issue. I am not sure anything else even

I feel like if if we have a chart of you know how many of them we get that might well be over 50% of all the questions. It certainly

was point of confusion for me. The next most common thing I think I see people asking about is ingest performance and that's just because

the first thing you do with the system is try to get all your data into it.

And then you

have to experiment with the ingest options.

And there's relatively straightforward things that work by, you know, parsing csv files or whatever,

but which don't always get the best performance. And depending on how much data you have that can be worth,

you know, it can be worth a bit of time because

if your initial data ingest is going to take 6 hours to run and if you can spend 3 hours making it faster, that may be a good use of your time. And as far as that overall ingestion workflow, do you have any sense as to the

source on disk size versus the representation

in Pelosa after it's been converted to that bitmap format? That will depend a lot on the specific data

just because it depends on how much data you're compressing down to a bit. If you're

starting with something where the source has a column that contains the first paragraph of people's favorite novel

and the representation close is going to have

a row set for is their favorite novel Moby Dick. You're gonna be saving a lot of

space.

Actually, possibly not. I think the first sentence of Moby Dick is really short,

But

if you're

if you've got data that's basically bit like already and you're converting it into HelloSA,

you're generally going to see some

some effective compression because the roaring bitmap format is very efficient for a whole lot of the likely use cases.

I don't have exact numbers, but I know that when I was doing test databases,

I was putting

gigabytes and gigabytes of data into

tables and

it was not taking up gigabytes of space on disk.

It's

quite efficient for a lot of cases.

It's not quite accurate, but you can sort of approximate by pretending that it's just storing the ones. And as far as that ingestion,

does Pelosa actually store

the ingested data on disk, or does it just parse it while it's flowing in and then just cut the rest of it to debnull after it's translated the representation into the bitmap format that you're storing in Palossa?

Yeah. We just yeah. We're just storing the bitmap form of whatever we're asked to store. So

if there's other data going into determining it, we mostly don't see that. So for instance,

if you've got, let's say, the fairly typical row and column case where you're setting each bit gets a row and column that tells you where in to put the 1 bit. We get in a stream

of pairs of 64 bit numbers,

and we're producing

0, 0, 2, 0, etc, 655, 350,

you send in that to the 16th set of pairs

and what we actually store on disk is

a run container holding

the values from 0 to 65535

and

that's 32 bits of data plus a little overhead. So it's it's a lot smaller. And what are some of the types of use cases that Pelosa is not well suited for where you would recommend an alternative,

where where you would recommend an where you would recommend an alternative tool or architecture?

Oh strings.

Strings would not be a strong point.

I I think strings are not a strong point

heavily relation you know joint heavy things in a typical relational database are likely to be a poor fit although

in some cases it's good. If pelosin will be good at a case where you're looking up a fact about 1 table to pick something out of another

because that's something we can easily represent as a bit.

If you want to actually combine the data from the 2 tables and build the results

that is something where pelosa doesn't even really have a starting point for it

and that's something where you that's probably a task that you want to use it in your other database for.

Hello's strength in that case would be

you use it to get a list of the ids in your first table that you you are going to be doing all these queries on.

And it can be really good for that but then for the actual relational database workload it's not really very very useful. It's

it is just the index, basically. And is there a sort of general

guideline as far as the relative scale of data that you would want to be at before you would bother looking to Palossa for accelerating some of your analysis on it? Or do you find that it's even useful at the, you know, tens or dozens of gigabyte scale?

I would say in terms of timing,

if the determination of what your data you want to look at is taking

more than a few milliseconds,

it is possible that it starts being useful to have a specialized index.

It's

so I'm not sure how much data that is. It does depend on what your existing data is like and what your existing database engine is doing.

If you haven't put indexes on your regular database first, at least try their indexes.

I just just in case that already solves the problem,

but if those indexes are not fast enough or if you need to do things like combining those indexes and that's being slow

that's the point where we start having a real utility to offer in making those queries faster.

And the other thing and also some kinds of aggregation

of data or, you know, as I said, things like

which entries in this have the most bits or whatever. Like with the, you know, projects that import other projects.

Which projects are imported by the most other projects

is something that something like Belloso will handle very well

in general

and a lot of databases

you know to do that

they're going to be doing you know

count group by and all these aggregate operations

and

they're actually going to be reading every every row in the table

or hitting an index 50 times or something, and Pelosi will probably just look it up and respond immediately.

So that's that's definitely a strength. And before I ask the last question, I'm curious if you know where the name choice came from.

Oh, yeah.

So well, as you know, sloths are famous for being fast, and Pelosi is just the scientific name for sloth. That's why our Twitter handle is sloth.

And if you look, the logo is actually a stylized sloth. It it took me a while to spot that. I was like, what's that weird swirly thing? That's a sloth.

When I was googling about it a little bit, I came to the Wikipedia entry about the scientific term for Pelosi and I was rather amused.

We we try to have a sense of humor about things and I I just I really like the the sloth and because, you know, as everyone knows, sloths are They're they're actually quite good swimmers, surprisingly.

Yeah. I I did not know that.

And so what do you have planned for the future of Pelosa in terms of performance improvements or feature additions or maybe any types of planned integration with other storage systems to maybe automatically be able to create these indices as the data is being ingested into the primary storage layer? Alright. Well, performance improvements,

the most immediate thing is I have some really cool ideas about

the way we do ingest

of the the value range things the you know representing integers as a series of bitmaps.

I have some ideas on that.

We just did a

performance improvement that reduced memory usage in cases with sparse data.

I think for 1 of the workloads we're looking at we went from about 70 gigabytes of memory to about 32 gigabytes of memory or thereabouts,

which I was pretty pleased with. For the ingest thing, I mean we've definitely encountered that this is a difficulty that people can run into is how do you figure out what to ingest? How do you integrate with other systems?

And the

plan as I understand it is to start working on building a managed service for that kind of thing because

it's something where the expertise you develop from working on solving that problem for 1 case really translates well to the next case.

So if everyone who wants to do it has to do all that learning

from scratch and then the next person comes along and they have to do it all. That's a lot of people spending a lot of time studying on this and

might be more efficient

to have some

expertise being shared and we're sort of building a front end on that to help achieve something. And are there any other aspects of Telosa or bitmap indices or the types of analyses that you support that we did discuss yet that you think we should cover before we close out the show?

I can't think of any immediately,

but they've got some really cool blog posts with interesting pictures of and

you know graphs of things that we've worked on that I think are really interesting to look at. My my favorite is between Gmail and 1 probably.

Alright. Well, for anybody who wants to get in touch with you or follow along with the work that you're doing at Palosa, I'll have you add your preferred contact information to the show notes. And as a final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. That is a that is a really hard question. I think there's so much data and there are so many ways to store data

and you know we we all have databases but I think just about every programmer I know has at least once implemented a flat file data store

because the overhead of learning to use SQL was high and they were in a hurry.

And I I really feel like we need to be better at telling people that data storage is a thing and that we have good tools for this,

that

I meet so many developers who don't know about database indexes and

I've seen people develop, you know, 2 caching layers on top of something

because they didn't know they could put an index on

an SQL database. And I I think I I feel like there's a lot of opportunities for education here because it turns out all computers do is process data and knowing what you can do with data and that you can do things at all with data would, I think, help all of us a lot. Yeah. I can definitely second that sentiment of wanting to ensure that developers have a good handle on what's available to them for being able to maintain the data that they're working with in their applications.

Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with Pelosa. It's definitely a very interesting project and a little bit of a different

mind model

little bit of a different mental model for being able to think about

storing and analyzing data. So I appreciate your insights, and I, appreciate the work that you're doing on Pelosa. And I hope you enjoy the rest of your day. Alright. Thank you. It was very interesting.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links