Summary
Database indexes are critical to ensure fast lookups of your data, but they are inherently tied to the database engine. Pilosa is rewriting that equation by providing a flexible, scalable, performant engine for building an index of your data to enable high-speed aggregate analysis. In this episode Seebs explains how Pilosa fits in the broader data landscape, how it is architected, and how you can start using it for your own analysis. This was an interesting exploration of a different way to look at what a database can be.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support.
- Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit.
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Seebs about Pilosa, an open source, distributed bitmap index
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what Pilosa is and how the project got started?
- Where does Pilosa fit into the overall data ecosystem and how does it integrate into an existing stack?
- What types of use cases is Pilosa uniquely well suited for?
- The Pilosa data model is fairly unique. Can you talk through how it is represented and implemented?
- What are some approaches to modeling data that might be coming from a relational database or some structured flat files?
- How do you handle highly dimensional data?
- What are some of the decisions that need to be made early in the modeling process which could have ramifications later on in the lifecycle of the project?
- What are the scaling factors of Pilosa?
- What are some of the most interesting/challenging/unexpected lessons that you have learned in the process of building Pilosa?
- What is in store for the future of Pilosa?
Contact Info
- Pilosa
- Website
- @slothware on Twitter
- Seebs
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform. And if you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to data engineering podcast.com/linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute.
Eluxio is an open source distributed data layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Eluxio unlocks the value of your data and allows for modern computation intensive workloads to become truly elastic with a cloud. With Aluxio, companies like Barclays, jd.com, Tencent, and 2 Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to data engineering podcast.com/aluxio, that's a l l u x I o, today to learn more and to thank them for their support.
And understanding how your customers are using your product is critical for businesses of any size. To make it easier for start ups to focus on delivering useful features, Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain 1 integration to instrument your code and get a future proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers' time, it lets your business users decide what data they want where. Go to data engineering podcast.com/ segment I o today to sign up for their start up plan and get $25, 000 in segment credits and $1, 000, 000 in free software for marketing and analytics companies like AWS, Google, and Intercom.
On top of that, you'll get access to the analytics academy for the educational resources you need to become an expert in data analytics for measuring product market fit. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, Dataversity, and the Open Data Science Conference. Go to data engineering podcast dotcom/conferences to learn more and take advantage of our partner discounts when you register.
And go to data engineering podcast dotcom to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And please help other people find the show by leaving a review on Itunes and telling your friends and coworkers. Your host is Tobias Macy, and today I'm interviewing Siabs about Pelosa, an open source distributed bitmap index. So, Siabs, could you start by introducing yourself?
[00:03:11] Unknown:
Yeah. Seeb, I'm a programmer type. I have a broad and weird background of what we call a generalist, but I'm particularly interested in performance stuff, and I think that is how I got to know the Pelosi people. So I've been doing mostly sort of back end data internals. And do you remember how you first got involved in the area of data management? Mostly through Pelosi. I mean, I've done some database things on and off, but when I started working with Palosa, I discovered that there was this date of all this data stuff going on and I hadn't thought about it all that much before. That was what got me involved. And so can you give a bit of an overview about what the Palosa project is
[00:03:55] Unknown:
and any context that you have about the history of the project and how it's gotten to where it is today? I think Pelosa
[00:04:02] Unknown:
my my description of it would be that essentially it is the index part of a database without the rest of the data storage. It's used as primarily as sort of a dedicated specialized index. I believe it was originally spun off from a company called Bumble that was I think sports marketing related. I mean you might not expect that when you find a genuinely novel idea in database design that Origin would be, you know, a sports company. But I think that sometimes the outside perspective is where you get someone saying, well, it'd be really nice if we could do this and going and building it because they didn't come from a background where they'd all been told that that isn't what we do.
That's why it's a slightly surprising and different project. So it's basically a dedicated index. The idea is that in a traditional database, we often have bitmap indexes and if you have a query that's hitting 2 of those indexes, we grab bitmaps from them and intersect the bitmaps and so on. And what pelosid basically does is store those bitmaps and have them all ready to go and have some optimizations for making those operations faster, which helps a lot once you start having very large cardinality in your data sets. I think in a lot of cases with postgres, I had a query once which took about 47 seconds to run, and I add an index and it went down to about 10 milliseconds.
And I thought, well, there's another part of this query, and I add an index for that and it went back up to about 500 milliseconds because the combining bitmap stage was inefficient. And the idea is basically we're doing that part of things and storing the result, the partial results, and that allows fairly fast operations.
[00:05:58] Unknown:
You know, as as I was reading through the data model, I was starting to get a little confused of how do you really represent the data in here and then started to realize what the actual use case was. As you mentioned, being able to operate on high cardinality data where you're working with information that might be, you know, multidimensional and you wanna be able to just do fast aggregates on the data without necessarily pulling out individual values like you would with a traditional database. And so I'm wondering if you can talk a bit more about where Pelosa fits in the overall data ecosystem, and how it might integrate into an existing data stack that somebody is using if they're running on something like Hadoop or you're using s 3 data lake or something along those lines?
[00:06:40] Unknown:
So I I'm not totally sure what some of the real world cases are. I mean, I've seen some of them, but my understanding is that what we tend to do is people have an existing database and they have some specific kind of query that's that's being a problem that's not performing well. Which can be reduced in some way to some number of yes or no questions about entries in their data. And you move that into colosa and sort of flip which are the rows and which are the columns, which is I think 1 of the surprising parts and which definitely confused me when I was first reading the documentation because rows and columns seem to be backwards.
So the idea is in your normal database, your, you know, standard relational database, each row is 1 of your entities and each column is a fact about that entity. And in our system, each column is an entity, and each row is a fact about that entity. And you might say, well, why not just call them columns and rows in the other order then? But what columns and rows actually refer to is the physical structure of the data that at some level that how things are grouped in a traditional SQL database, you will typically have the items all the data for a row bundled up, and then each row will be bundled up and when you do a query on a column you are selecting part of each of those rows.
And when we switch things to having the rows be the facts about a thing what that means is that we have a bitmap of yes or no answers to some question about data and that 1 bitmap is that question and not any of the other questions which allows us to very rapidly combine them. So so you have questions about people like, you know is an active member or something like that and we have a row which is just the zeros and ones for where 1 is you know has an active account for instance and then we can mask that, we could do unions or intersections or whatever to get answers to compound questions about people very quickly, and that's what it's useful for. I think we have a blog post example of using this to do genome comparisons, which are a good example, I think. And in the documentation
[00:09:16] Unknown:
about the data model, it also mentions that you're able to represent a set number of different data types as well. And so I'm curious how that manifests in terms of the overall storage system of Telosa and how that maps into that bitmap index of being able to go back and forth of trying to aggregate on a particular set of attributes, but also being being able to identify what those attributes are beyond just in n dimensional matrix.
[00:09:42] Unknown:
So we have a couple of data types. We've got the default 1 is what we call a set which is just there's rows and columns and you can have any number of bits set then there's a mutex which is similar to a set but we only allow 1 row to be set at a time for a given column and that's not actually a different data representation that's just different logic that when we set a bit we look for other bits and clear them. There's also some top there's some caches that get put on top of this like a cache of which columns have the largest number of rows set which was 1 of the original query terms that made this seem like a useful feature.
And the other representation we have is a fairly weird 1 which is we represent numbers in this is going to sound really strange as a series of bits and that actually doesn't sound nearly as revolutionary when you describe it like that but the idea is we have all these rows and if you want say a number with a range of 0 to 16 we have 4 rows and the bottom row is 1s and the next row is 2s and this is not especially efficient because we do need to then read all those rows but we can sort of do this in parallel and this is used in cases where you really need a way to represent something like a range and it's not as blindingly fast as the rest of the things we can do, but it's still pretty fast compared to a more traditional structure frequently.
[00:11:31] Unknown:
And in terms of the way that you would query a Pelosa database, it's not necessarily the same way that you would think about it with SQL or even being able to retrieve a record from a document database. So what does the query language look like, and what are some of the types of use cases that Palosas uniquely well suited for being able to solve given the way that the data is represented and stored?
[00:11:57] Unknown:
So the query language right now, we have a language called PQL, which I think is just Palos, a query language, which is a very simple query language. It's typical examples would be something like, I've forgotten the exact spelling, but you write things like you know row equals 5 and that selects everything that has row 5 set and then we have unions and intersects and exclusive or and common represent common interactions like that available I and we've got some work going on for mapping some SQL queries into corresponding queries just because there's a lot of SQL around.
The kind of thing that I think this tends to get used for is cases where you have fairly large volumes of data and you have you know in advance what questions you're likely to ask. So you know things like has an active account or you know signed up since 1996 or signed up in a given year or has purchased what this particular product and you want to do combinations of these and you know if you look at something like the has purchased this particular product, in a typical database that's probably joins and you're probably looking things up in 2 or 3 tables, and the use case with paloso the way you might approach that is you call product IDs rows and you store a bit with the column to you know the user ID is the column and the row product ID is the row and you store that bid and then if you want to check users who have purchased this product you just ask about that row and you get back all the columns that match it and of course this is this is a very sparse representation typically because you might have many column or many rows in which only a few bits are set but we aren't storing the zeros effectively for most of that. We're only storing a very small number of bits so we can actually do that reasonably efficiently.
[00:14:18] Unknown:
And 1 of the other contexts where I've often heard the term bitmap used is with images, data into data into Telosa to be able to do some sort of image analysis algorithm on the bitmap as it's represented in Telosa being translated from 1 of those static images?
[00:14:44] Unknown:
I I'm not sure it it wouldn't seem as I I don't think that would probably be a great fit because typically although bit we talk about bitmap images they're almost always multiple bits per pixel. Although there is an interesting historical reference there, there's a computer called the Amiga back in the 90s that actually used a similar representation in memory that if you had a 4 bit image you didn't have 4 consecutive bits for a pixel. You had 4 bit planes and each bit plane was just 1 of those 4 bits for all the pixels at once. And that this was a similar structure, and it had some performance advantages and some performance disadvantages.
Overall I think it is not great for visual data just because you are much more likely to want to know which what the pixel value of a single pixel is than look at the high red bit of all the pixels. I guess it might be nice for steganography.
[00:15:51] Unknown:
And as far as the actual data storage in Pelosi, I know that in the documentation, it mentions that it's not necessarily built to be a primary source of record and that you would usually load data into Pelosa from either another data source in bulk or consume from a streaming engine such as Kafka, where you have the data being split into 2 streams, where 1 is going to your primary storage layer and 1 is going into Palosa for further analysis. So I'm wondering if you can just talk a bit further through what a typical workflow is for being able to obtain and analyze data in Telosa and what the overall life cycle of that information would would tend to be. Yeah. I think those are the the basic forms if and, obviously, the other really common 1 is that you combine those and you already have some data.
[00:16:37] Unknown:
So you want to read in all that data and add new data as it comes in. So for imports, we have somewhat different logic because if you're, when you're streaming in new bits, you generally want to have, you know, nice high reliability data rights. You want logs of every bit as it's written so that you don't get out of sync. Whereas when you're importing a few 1, 000, 000, 000 bits, if you do the full flush to disk for every bit, you will not succeed in a and we're always looking at that for possible performance enhancements because that is absolutely 1 of the slowest parts of a the process of getting spun up I think is just if you have many gigabytes of data getting it migrated can be slowish.
This led to 1 of my favorite small side projects here, which is I built a tool called imagine, which is used to create fake databases. The etymology of the name is, you know, imagine you have a database with a 1, 000, 000, 000 users. So I want to have made something where I can write up a description of a database with a 1, 000, 000, 000 users and a third of the bits are set in these rows or whatever and point it at a local server and have a database. This is somewhat useful for benchmarking the ingest process. It's also useful for setting up demos.
[00:18:08] Unknown:
And as far as being able to model the data as it's coming from either a relational source or some sort of structured flat files, what does the interface look like? And how would you go about structuring the data model in Palossa to ensure that you're able to perform the types of analysis that you're looking to do on the data as it's coming in?
[00:18:31] Unknown:
So the the major thing is figuring out what your rows should be and what your columns should be. You know it's easy to say well columns would be users or something like that, but when you're recording recording data pelosa tends to favor things that can be well expressed as a yes or no question. The more something is like a range of values, the less likely it is that this this will work as well. So we do support ranges for the cases where they're necessary but as 1 example I think we had a case where the likely default format that just came to mind ended up having a field where the row value the only row that would be set would be the same as the column value. So it's you know sort of the diagonal row line of ones and that that was fairly inefficient because you know for a 1, 000, 000, 000 records that means you now have a 1, 000, 000, 000 separate storage files each with 1 bit and that's not the most efficient usage.
In that case you it's possible just not store the data. So the kind of thing you want to look for is what are the actual questions we need to ask? Because you know if you like say say you have, you know, you're looking at a package database and you're looking for number of packages that import this package. And that that's a number. So you might record it as a range and store this range of values. And the high end might be a few 1, 000. So you need you know 10 or 20 bits of storage per package to represent the number of packages that refer to it. Well let's say you look at your actual workflow and the only thing you're ever checking is whether the number of packages is greater than 0 or not. Well you don't need to store the number you need to store the is it greater than 0. If you do that, then you're in 1 of the cases where Telosa will perform really well because that's that's the kind of yes no question that the big map index is really good for.
[00:20:41] Unknown:
And I know that when I was looking at the documentation, it mentioned that you need to construct the index sets upfront and that they can't be modified after the fact. So I'm wondering how that would play into your decision making of doing the upfront modeling and if there are any other types of decisions that need to be made early on in the process that could have ramifications later on as far as what types of analysis you can perform? I I feel like you you can, of course, change it. It's just the changing it may require you to do a lot of re ingesting or
[00:21:10] Unknown:
modification. 1 thing I would tend to recommend is you know construct maybe build a sample set with a subset of the data so it doesn't take too long and build some of the sample queries just walking through the process and seeing what happens and whether you are able to express the right queries because just because the data structure is unfamiliar and isn't quite like the way relational databases tend to work it's easy to be unsure of what will work out well or what you're going to be gaining from it. I think there's a tendency for people coming from structured databases to think more in terms of named columns and or and even if you switch to columns and rows thinking in terms of named things and values for them rather than yes or no questions about.
Of course there is the dance the danger that if you go too far with that and you just encode the question you have right now, if you have a new question later, you may not have a way to answer it. So thinking about why you want to store the data and what you need to know about it, and that's which is always a thing with databases, but possibly more so when you're reducing things to yes and no questions.
[00:22:27] Unknown:
And are there any additional complexities or considerations that need to go into handling highly dimensional data and how that's represented and stored, or any difficulties of finding appropriately high dimensional data in structured files sort of in the wild. I imagine that things like HDF5 that would lead to some of these highly dimensional data types. Yeah. I mean, that is definitely going to be hard.
[00:22:53] Unknown:
They we've got essentially the ability to have fields with rows, and they can have a lot of rows. I would say in some cases it might make sense to just convert higher dimensions to column numbers or row numbers you know treat them as you know if you have a 1000 by a 1000 by a 1000 thing Just, you know, how the first layer use columns 1 0 through a 1000 and rows 0 through a 1000. And then the next slice would get 1 through what or a 1, 001 through 2, 000 and so on. But that will obviously not scale up to very large dimensions. At which point I think you want to try to find something underlying this that describes what you're looking at in different terms. Like I you know, like with we've got the the genome example that's not highly dimensional, but it's a case where you're replacing these very large strings of you know ACTG with a yes or no how's this gene present question.
So your rows or columns, depending on the kind of question you're asking, would be numbered
[00:24:14] Unknown:
known genes that we are track tracking the presence or absence of. And is it possible for a bitmap in Palossa to have a reference to another bitmap to be able to possibly construct some of these dimensional matrices?
[00:24:29] Unknown:
Sort of. It's not there is not currently very direct support for it. You can have arbitrary values in the BSI field and they could for instance be column numbers from another thing. Currently we don't have very good tools for directly making queries like that, but if you query if you get back data from 1 query you can use it to build another and that is the thing we are looking at and we have some, you know, we have some experimental things that we've done in the tree as an experiment and not yet merged or committed because it's not quite what we want but the idea is to make it easier to do queries that do that kind of thing and yeah that that is a way to approach the dimensionality and also simplify some queries that currently would require a fair amount of back and forth traffic.
[00:25:23] Unknown:
And as far as your experience of working on the Palossa project, what are some of the most interesting or challenging aspects or lessons learned that you have encountered in the process? Well, I I think my favorite is probably going to have to
[00:25:39] Unknown:
be the time that I looked at some code and I said, no. I bet I could make this faster. I managed to get, I think, a factor of 20 speed up in the code. I was really pleased with that until I realized that what I had done was suppress the operation log write. So basically I was being faster because I wasn't actually writing the data to disk. So that that was a good reminder of how easy it is to be overconfident in performance tuning. I think I mean, that in general, I've been doing a lot of focus on benchmarking and performance tuning because that's a personal interest, and it's been very interesting because there are frequently very unexpected opportunities for performance improvements that don't always seem like they're going to be significant and it it's this is a great example of the general performance rule that you really need to profile things to know what what you're doing or what you want to do and where to focus your time.
It's been very interesting code to work on and you know as with any high performance code there's a lot of interesting special cases. For instance, if you're comparing the contents of 2 arrays and they're both sorted arrays you just want to see how many items they have in common.
[00:27:03] Unknown:
It turns out that it matters quite a bit whether you are iterating over the longer array and then the shorter 1 or the shorter 1 and then the longer 1. And what are some of the other overall strategies that Pelosi uses to be able to achieve the types of performance that it is aiming for? And what are some of the current bottlenecks that you're trying to work through?
[00:27:24] Unknown:
Well, I I think a large part of it is we are pretty focused on what you can actually have in memory. So once the system is up, it will have the data in memory. It doesn't currently support picking things off the disk and that requires a fair amount of work on making the in memory representation efficient. We use a modified variant of the roaring bitmap format. The distinction being that ours handles 64 bit ranges instead of just 32 bit ranges but that lets us fit many many gigabytes of well, theoretical gigabytes of zeros and ones into much smaller amounts of memory. So the biggest issue is trying to make sure the data fits and that's 1 of the reasons that we have support for sharding and clustering and so on because at some point you just plain have more data than you have memory. The other and the other bottlenecks I think are mostly at the level of just performance of specific cases where access patterns are inefficient and you know if you're when you're when you're ingesting values the access patterns can make a factor of 10 difference in speed.
So being able to arrange to produce the data in a sorted order for instance can make a huge difference in how quickly it gets written and we're working on some improvements to that because we found some places where I think there's there's some good opportunities for making it faster. And we were having an issue that if you had a very sparse data but you had a lot of it we were having memory issues and we've been working on reducing memory usage in that case pretty significantly.
[00:29:13] Unknown:
And I know that the primary language or actually I think the only language of implementation for Pelosa is actually in Go. So I don't know if you have any thoughts as to the benefits and trade offs that that provides of having that be the implementation target. And I know that you have said that you're still fairly recent to the project, but given the context that you do have, if there are any architectural decisions that you think that you would make differently if you were to start the whole project over today?
[00:29:40] Unknown:
That's a good question. Go is go is a reasonably good choice, I think. I I'm sure we could get several percent faster, maybe a a fair bit faster working in C, but I also don't think it'd be done yet. I'm a reasonably experienced C programmer and I find it useful sometimes to not have to do quite as much of that, but we are definitely seeing some cost to the garbage collector and the allocation. And so a lot of the performance optimization opportunities are basically finding the cases where it really is worth the extra time to outsmart those garbage collector a bit and bypass some of what it would otherwise do for allocation and that that can make a very large difference in performance.
Overall, it's been a fairly good fit. It's it's expressive, but it does allow you to get down to the bit level and write code that where you have a pretty good idea of exactly what will happen. We don't have any in line as we don't have any assembly code in there yet, but we might someday for a few of the particularly expensive loops or whatever, but most of the time the primary expenses are just the sheer amount of data there is to work on. And at that point, it's not a bad fit at all as a language, I think. Architecturally, I think I'm pretty happy with it. It it basically makes sense.
[00:31:13] Unknown:
And as far as any experience that you have of working with end users of Pelosa, what have you found to be some of the common points of confusion or difficulty that they encounter when trying to get something up and running and start using it for their own purposes? Or also any sort of common feedback that you hear on the, open source repository as far as issues that people encounter with either trying to use or contribute to the source?
[00:31:40] Unknown:
I would say that probably the columns versus rows thing is the issue. I am not sure anything else even I feel like if if we have a chart of you know how many of them we get that might well be over 50% of all the questions. It certainly was point of confusion for me. The next most common thing I think I see people asking about is ingest performance and that's just because the first thing you do with the system is try to get all your data into it. And then you have to experiment with the ingest options. And there's relatively straightforward things that work by, you know, parsing csv files or whatever, but which don't always get the best performance. And depending on how much data you have that can be worth, you know, it can be worth a bit of time because if your initial data ingest is going to take 6 hours to run and if you can spend 3 hours making it faster, that may be a good use of your time. And as far as that overall ingestion workflow, do you have any sense as to the
[00:32:48] Unknown:
source on disk size versus the representation in Pelosa after it's been converted to that bitmap format? That will depend a lot on the specific data
[00:32:58] Unknown:
just because it depends on how much data you're compressing down to a bit. If you're starting with something where the source has a column that contains the first paragraph of people's favorite novel and the representation close is going to have a row set for is their favorite novel Moby Dick. You're gonna be saving a lot of space. Actually, possibly not. I think the first sentence of Moby Dick is really short, But if you're if you've got data that's basically bit like already and you're converting it into HelloSA, you're generally going to see some some effective compression because the roaring bitmap format is very efficient for a whole lot of the likely use cases.
I don't have exact numbers, but I know that when I was doing test databases, I was putting gigabytes and gigabytes of data into tables and it was not taking up gigabytes of space on disk. It's quite efficient for a lot of cases. It's not quite accurate, but you can sort of approximate by pretending that it's just storing the ones. And as far as that ingestion,
[00:34:09] Unknown:
does Pelosa actually store the ingested data on disk, or does it just parse it while it's flowing in and then just cut the rest of it to debnull after it's translated the representation into the bitmap format that you're storing in Palossa?
[00:34:23] Unknown:
Yeah. We just yeah. We're just storing the bitmap form of whatever we're asked to store. So if there's other data going into determining it, we mostly don't see that. So for instance, if you've got, let's say, the fairly typical row and column case where you're setting each bit gets a row and column that tells you where in to put the 1 bit. We get in a stream of pairs of 64 bit numbers, and we're producing 0, 0, 2, 0, etc, 655, 350, you send in that to the 16th set of pairs and what we actually store on disk is a run container holding the values from 0 to 65535 and
[00:35:17] Unknown:
that's 32 bits of data plus a little overhead. So it's it's a lot smaller. And what are some of the types of use cases that Pelosa is not well suited for where you would recommend an alternative, where where you would recommend an where you would recommend an alternative tool or architecture?
[00:35:33] Unknown:
Oh strings. Strings would not be a strong point. I I think strings are not a strong point heavily relation you know joint heavy things in a typical relational database are likely to be a poor fit although in some cases it's good. If pelosin will be good at a case where you're looking up a fact about 1 table to pick something out of another because that's something we can easily represent as a bit. If you want to actually combine the data from the 2 tables and build the results that is something where pelosa doesn't even really have a starting point for it and that's something where you that's probably a task that you want to use it in your other database for.
Hello's strength in that case would be you use it to get a list of the ids in your first table that you you are going to be doing all these queries on. And it can be really good for that but then for the actual relational database workload it's not really very very useful. It's
[00:36:39] Unknown:
it is just the index, basically. And is there a sort of general guideline as far as the relative scale of data that you would want to be at before you would bother looking to Palossa for accelerating some of your analysis on it? Or do you find that it's even useful at the, you know, tens or dozens of gigabyte scale?
[00:37:01] Unknown:
I would say in terms of timing, if the determination of what your data you want to look at is taking more than a few milliseconds, it is possible that it starts being useful to have a specialized index. It's so I'm not sure how much data that is. It does depend on what your existing data is like and what your existing database engine is doing. If you haven't put indexes on your regular database first, at least try their indexes. I just just in case that already solves the problem, but if those indexes are not fast enough or if you need to do things like combining those indexes and that's being slow that's the point where we start having a real utility to offer in making those queries faster.
And the other thing and also some kinds of aggregation of data or, you know, as I said, things like which entries in this have the most bits or whatever. Like with the, you know, projects that import other projects. Which projects are imported by the most other projects is something that something like Belloso will handle very well in general and a lot of databases you know to do that they're going to be doing you know count group by and all these aggregate operations and they're actually going to be reading every every row in the table or hitting an index 50 times or something, and Pelosi will probably just look it up and respond immediately.
[00:38:36] Unknown:
So that's that's definitely a strength. And before I ask the last question, I'm curious if you know where the name choice came from.
[00:38:44] Unknown:
Oh, yeah. So well, as you know, sloths are famous for being fast, and Pelosi is just the scientific name for sloth. That's why our Twitter handle is sloth. And if you look, the logo is actually a stylized sloth. It it took me a while to spot that. I was like, what's that weird swirly thing? That's a sloth.
[00:39:05] Unknown:
When I was googling about it a little bit, I came to the Wikipedia entry about the scientific term for Pelosi and I was rather amused.
[00:39:14] Unknown:
We we try to have a sense of humor about things and I I just I really like the the sloth and because, you know, as everyone knows, sloths are They're they're actually quite good swimmers, surprisingly. Yeah. I I did not know that.
[00:39:25] Unknown:
And so what do you have planned for the future of Pelosa in terms of performance improvements or feature additions or maybe any types of planned integration with other storage systems to maybe automatically be able to create these indices as the data is being ingested into the primary storage layer? Alright. Well, performance improvements,
[00:39:45] Unknown:
the most immediate thing is I have some really cool ideas about the way we do ingest of the the value range things the you know representing integers as a series of bitmaps. I have some ideas on that. We just did a performance improvement that reduced memory usage in cases with sparse data. I think for 1 of the workloads we're looking at we went from about 70 gigabytes of memory to about 32 gigabytes of memory or thereabouts, which I was pretty pleased with. For the ingest thing, I mean we've definitely encountered that this is a difficulty that people can run into is how do you figure out what to ingest? How do you integrate with other systems? And the plan as I understand it is to start working on building a managed service for that kind of thing because it's something where the expertise you develop from working on solving that problem for 1 case really translates well to the next case.
So if everyone who wants to do it has to do all that learning from scratch and then the next person comes along and they have to do it all. That's a lot of people spending a lot of time studying on this and might be more efficient to have some
[00:41:06] Unknown:
expertise being shared and we're sort of building a front end on that to help achieve something. And are there any other aspects of Telosa or bitmap indices or the types of analyses that you support that we did discuss yet that you think we should cover before we close out the show?
[00:41:23] Unknown:
I can't think of any immediately, but they've got some really cool blog posts with interesting pictures of and you know graphs of things that we've worked on that I think are really interesting to look at. My my favorite is between Gmail and 1 probably.
[00:41:39] Unknown:
Alright. Well, for anybody who wants to get in touch with you or follow along with the work that you're doing at Palosa, I'll have you add your preferred contact information to the show notes. And as a final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. That is a that is a really hard question. I think there's so much data and there are so many ways to store data
[00:42:04] Unknown:
and you know we we all have databases but I think just about every programmer I know has at least once implemented a flat file data store because the overhead of learning to use SQL was high and they were in a hurry. And I I really feel like we need to be better at telling people that data storage is a thing and that we have good tools for this, that I meet so many developers who don't know about database indexes and I've seen people develop, you know, 2 caching layers on top of something because they didn't know they could put an index on an SQL database. And I I think I I feel like there's a lot of opportunities for education here because it turns out all computers do is process data and knowing what you can do with data and that you can do things at all with data would, I think, help all of us a lot. Yeah. I can definitely second that sentiment of wanting to ensure that developers have a good handle on what's available to them for being able to maintain the data that they're working with in their applications.
[00:43:10] Unknown:
Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with Pelosa. It's definitely a very interesting project and a little bit of a different mind model little bit of a different mental model for being able to think about storing and analyzing data. So I appreciate your insights, and I, appreciate the work that you're doing on Pelosa. And I hope you enjoy the rest of your day. Alright. Thank you. It was very interesting.
Introduction to the Guest and Pelosa
Overview and History of Pelosa
Pelosa's Role in the Data Ecosystem
Query Language and Use Cases
Data Ingestion and Workflow
Modeling Data in Pelosa
Implementation and Architectural Decisions
Common User Challenges and Feedback
Use Cases and Performance Considerations
Future Plans for Pelosa
Final Thoughts and Contact Information