Summary
Processing high velocity time-series data in real-time is a complex challenge. The team at PipelineDB has built a continuous query engine that simplifies the task of computing aggregates across incoming streams of events. In this episode Derek Nelson and Usman Masood explain how it is architected, strategies for designing your data flows, how to scale it up and out, and edge cases to be aware of.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Usman Masood and Derek Nelson about PipelineDB, an open source continuous query engine for PostgreSQL
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by explaining what PipelineDB is and the motivation for creating it?
- What are the major use cases that it enables?
- What are some example applications that are uniquely well suited to the capabilities of PipelineDB?
- What are the major concepts and components that users of PipelineDB should be familiar with?
- Given the fact that it is a plugin for PostgreSQL, what level of compatibility exists between PipelineDB and other plugins such as Timescale and Citus?
- What are some of the common patterns for populating data streams?
- What are the options for scaling PipelineDB systems, both vertically and horizontally?
- How much elasticity does the system support in terms of changing volumes of inbound data?
- What are some of the limitations or edge cases that users should be aware of?
- Given that inbound data is not persisted to disk, how do you guard against data loss?
- Is it possible to archive the data in a stream, unaltered, to a separate destination table or other storage location?
- Can a separate table be used as an input stream?
- Since the data being processed by the continuous queries is potentially unbounded, how do you approach checkpointing or windowing the data in the continuous views?
- What are some of the features that you have found to be the most useful which users might initially overlook?
- What would be involved in generating an alert or notification on an aggregate output that was in some way anomalous?
- What are some of the most challenging aspects of building continuous aggregates on unbounded data?
- What have you found to be some of the most interesting, complex, or challenging aspects of building and maintaining PipelineDB?
- What are some of the most interesting or unexpected ways that you have seen PipelineDB used?
- When is PipelineDB the wrong choice?
- What do you have planned for the future of PipelineDB now that you have hit the 1.0 milestone?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- PipelineDB
- Stride
- PostgreSQL
- AdRoll
- Probabilistic Data Structures
- TimescaleDB
- [Podcast Episode](
- Hive
- Redshift
- Kafka
- Kinesis
- ZeroMQ
- Nanomsg
- HyperLogLog
- Bloom Filter
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello. Welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy them. So check out Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform. If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. Go to data engineering podcast dotcom/linode today to get a $20 credit and launch a new server in under a minute. And go to data engineering podcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And don't forget to go to dataengineeringpodcast.com/chat to join the community and keep the conversation going. Your host is Tobias Macy. And today, I'm interviewing Usman Masood and Derek Nelson about Pipeline DB, an open source continuous query engine for PostgreSQL.
So, Usman, could you start by introducing yourself?
[00:01:09] Unknown:
Yeah. I'm Usman. I work as an engineer at Stripe. Previously, I was involved with Pipeline DB as the chief architect and worked for a couple of years on the project.
[00:01:19] Unknown:
And, Derek, how about yourself?
[00:01:21] Unknown:
So I'm the founder and CEO of, Pipeline. D d.
[00:01:27] Unknown:
And going back to you, Usman, do you remember how you first got involved in the area of data management?
[00:01:33] Unknown:
Yeah. I mean, I think for me, it was kinda just randomly stumbled into it. In school, I focused a lot on, systems y stuff and then built a few data processing pipelines at my first job. And then when I was looking for a new job, saw Piping DB as a potential opportunity. I'd always wanted to work on an open source project or, like, something, where I could kinda give back to the community. Seemed like a cool project. So talked to Derek and, you know, things kind of went from there.
[00:02:03] Unknown:
And, Derek, how did you first get involved in the area of data management?
[00:02:06] Unknown:
So it it all started the the last company I worked at before before I started this company. It was an advertising technology company called Advil. I started there when it was a it was a very small company and ended up growing really big. And I was on the engineering team and I took over the responsibility for basically all of their, data management infrastructure. And as as the company grew and the scale grew explosively, we had to build a lot of things that didn't exist yet. A lot of new interesting technology to handle just the huge scale involved in running an advertising company. And after doing that for a few years, I had learned a lot about products that I wish had existed because we had to build them ourselves at AdRoll. And when I was ready to do move on and do the next thing, I started Pipeline DB based on my experience at AdRoll.
[00:02:57] Unknown:
And so can you give an overview of what Pipeline DB is and some of the motivation and story behind creating it?
[00:03:05] Unknown:
Yeah. Sure. So I'll it it really begins with the experience I had at AdRoll. As I mentioned, we were processing huge amounts of data. By the time by the time I left, we were reaching more than half the Internet per day, with ads. And we were analyzing all of that data in real time, which is a pretty hard problem to solve at that scale. But the key observation was that most of what we wanted to know about this data we were processing, most of the computations required to to learn those things were fairly straightforward. We were counting a lot of things, running aggregation queries, and things like that. And we built a lot of complex application code, custom application code that we that we use to achieve those things.
And there was really no off the shelf product that allowed you to say, okay, take this huge massive continuous stream of data and just run these simple aggregations on it as efficiently as possible. And that's essentially the problem that that Pipeline DB solves. Instead of requiring people to, build their own custom application infrastructure or use some of the more complex stream processing infrastructure out there. It allows them to declare the aggregation queries they wanna run on these massive flows of data. And it then runs those it runs those calculations continuously in real time as data flows into the system. But the in the key innovation is that pipeline DB only stores the output of these aggregation queries. So another key observation we had at Agro was that the input dataset was much, much, much larger by multiple orders of magnitude than the output because we were doing aggregations.
It starts out as huge voluminous stream of of messy log data. You know, 1 1 event per per ad served or 1 event per click serve or whatever. And by the time it gets processed, it it's much, much smaller than that. It's just aggregated down into a relatively small number of rows, you know, bucketed by hour or day or customer or user. And so when you look at when we looked at that huge differential between the size of the input and the size of the output, we saw an opportunity to build a system that just never had to store the raw, granular data.
And because the bottlenecks in most database technologies is generally the disk, that allowed us to avoid an an enormous amount of, disk IO by just aggregating all this data in memory, only writing out the the compact output, to disk. And that makes pipeline DB extremely efficient, even even running it on a single server deployment.
[00:05:56] Unknown:
And later on, I definitely wanna get into some of the scaling aspects in terms of the, system architecture that would be most beneficial for this deployment and some of the discussion around discarding the raw data and some of the options there. But first, I'd like to cover what types of use cases this enables that would otherwise be impractical or intractable and some of the main types of industries that would benefit most from the pipeline technology?
[00:06:26] Unknown:
So the the primary area of use cases is anywhere there's, like, a real time analytics dashboard being used at a company. PipedLine DB is gonna be a really good fit for powering that because the main the main drawback of using this technology or trade off rather is that you have to know these queries in advance. You have to declare them in advance and then as the data runs through them, the results are incrementally computed. So anywhere where you know the queries that you wanna run on these large streams of data in advance, pipeline DB is going to be a really good fit. And in our experience, we've seen that, mostly in the area where there there's reporting dashboards, real time analytics dashboards, things like that. People are using pipeline DB to power those things on basically really small amounts of hardware.
And on on also with with large amounts of, input data, which before pipeline DB was around, that was a it was a much harder problem, to solve for the reasons that I mentioned before is that you had to build larger scale, usually custom application infrastructure to to do these aggregations in real time. Pipeline DB just makes it really, really simple. So in terms of the the things that enables companies to do that they wouldn't otherwise be able to do, it allows them to analyze more data faster, essentially. They're they're able to, in in many cases, analyze data sets that they wouldn't be able to analyze at all before, or or in real time. So software companies with deep technical expertise, they can build this infrastructure and they and they do themselves. So Pipeline DB makes a lot of sense for companies that are producing a lot of data, but they don't they're not necessarily software companies. They don't necessarily have, the experience or the or the staff to build complex stream processing infrastructure. So it it makes it much easier for them. And in terms of, industries where where we see pipeline pipeline d v being used heavily. It's the it's a pretty broad cross section honestly because you can you can think of pretty much any company in the world nowadays.
Companies are more and more becoming data driven companies and so pretty much every company out there will have some sort of software infrastructure that's ultimately powering real time analytics dashboards that other people in the company are using. So the the the set of companies using pipeline to be is extremely broad. We do see increased usage in in a few different verticals. Ad tech is 1 of them, which which isn't surprising because this was this was essentially conceived out of experience from Ad tech. Telecom is is another vertical where we see a lot of usage. And then beyond those 2, it's it's a pretty broad cross section.
[00:09:19] Unknown:
And what are some of the major architectural concepts and design components that users of pipeline DB should be familiar with to be able to use it effectively?
[00:09:31] Unknown:
I think the the first and it is fine. You can add to this if I miss anything. The the first thing is to have a proficiency in SQL, I think, at a high level because everything that you do with pipeline DB, you essentially do with SQL. So you'd wanna you'd wanna have some proficiency, with SQL, know how that works. You would also want to have a good understanding of how pipeline ED scales across cores on a server. We built it to be highly, highly parallel, because stream processing is very CPU intensive because you're running all these computations on huge flows of data coming in. So you'd wanna have understanding of how all the work required to run these continuous queries scales out over the CPUs on the system that you're running pipeline DB on so that you can properly take advantage of, all the all the hardware capacity available, I would say. Is there is is there anything else I I missed, Usman?
[00:10:35] Unknown:
Yeah. I think that covers most of it. I think the the the kind of high level, I think, concept here is is, I think, which Derek has mentioned before as well is understanding if your data can be efficiently summarized. And sometimes in that, you might, like, lose some level of accuracy. Like, for example, if I'm trying to count the number of distinct elements in a stream, if I'm trying to do it, like, a 100% accurate, I kinda have to store on the order of unique elements. I have to store that information somewhere. But if I'm happy with a little bit of an error, if I don't think that's that we look at deal, then you can use, like, some kind of probabilistic some some restructures underneath the hood to approximate those answers. So PipingDB internally will leverage a bunch of these, probabilistic data structures automatically for you so you don't even have to think about what's happening. But you still you still need to have the understanding that when you do summarize data and you're trying to do it efficiently, you might lose some level of accuracy. For the most part, what what I've seen is that when you are dealing with such large volumes of data and you're trying to get aggregate information out of it, you don't really care if, like, you saw a 1000000 elements or a1000000 and 2. Like, that level of accuracy is not that important. You're not taking, like, extremely hardcore business decisions. You're mostly just getting charts and trying to see directionally how things look.
[00:11:51] Unknown:
Yeah. That that that's a really good point on on this question. As Yousswan said, a lot of a lot of the a lot of the infrastructure that allows us to to even run these continuous aggregations on essentially infinite streams of data use probabilistic data structures, like, for for example, an extremely common aggregate operation that users use pipeline DB for is counting, distinct, which means counting the number of distinct input elements. So this this might have a use case in, you know, website analytics if you wanna know for each hour how many distinct users, viewed this URL. And that uses, count distinct in pipeline DB. But to do that with perfect accuracy, as as Usman said, you have to store all of the unique elements you've seen. And then when you get a new element coming in, you say, have I already added this? If not, then increase the distinct count. If so, don't do anything because it's not unique. And we use hyper log log, for count distinct, which is a a very compact data structure that allows you to estimate the cardinality of the, input set with a with very with a very small memory footprint. And the trade off is that it has a small margin of error. So most of the error margins with these, probabilistic data structures in pipeline d v have to give you a sense, have a error margin of usually less less than 1%. Like hyper log log is, I think, by default, 99.2% accurate. And as Usman said, when you're dealing with huge, huge numbers, you don't you don't really care if it's if it's a million or a 1, 000, 002. So that's the kind of error margin that users are generally okay with accepting by Pointe. And
[00:13:41] Unknown:
so given the fact that pipeline DB is implemented as a plug in for PostgreSQL, and there are a number of other very interesting and useful plug ins within that ecosystem, particularly things like, some recent entries, timescale and Citus. I'm wondering what level of compatibility exists between pipeline DB and some of those other useful plug ins?
[00:14:09] Unknown:
So with the 2 that you mentioned, we we get asked about those pretty often. And currently, there's no native integration with any other extension. So any kind of interoperability you would have, between pipeline d v and another extension would be done through the the primitives exposed by Postgres. Since they're all Postgres extensions, you could integrate them using whatever interfaces Postgres provides you, which are things like triggers, logical decoding, streams, replication, things like that. It's possible to integrate these extensions with with those standard primitives.
And I in the future, I think, we're definitely I think we'll almost certainly have a a integration a native integration with Timescale DB just because the the use cases for Timescale DB and for pipeline DB, have a lot of overlap. They're both essentially time series processing engines. TimescaleDB is a bit different in ours is that it in that it stores granular data and allows you to query it very efficiently, whereas we don't store granular data. But the the use cases are are very similar. So we basically solve a lot of those same problems, with 2 fundamentally different approaches. So it it makes a lot of sense for these 2 integrations to be integrated in the future. So I think we'll almost certainly have something in that area within the the coming months. And in terms of Citus, they there's not as much overlap in use cases as there is between us and Timescale DB. So we don't have anything to work in the works in terms of integration with them, but it it is something that we get asked about relatively frequently.
[00:15:52] Unknown:
Yeah. As I was looking through the documentation for Pipeline DB, I was envisioning using timescale for storing the high granularity data after it's been pre aggregated by pipeline DB so that you can sort of have the best of both worlds where you have a constant view of what the current value is, but you still have the ability to go back and do ad hoc analyses after the fact rather than discarding the data or having to pipe it twice into the same database or into or use them in separate databases?
[00:16:22] Unknown:
Right. Yeah. Fundamentally, that's pretty much what everybody wants to do with the the 2 systems together as far as I understand it. And and in a lot of cases, I think that would work really well. For a lot of our users storing the the granular data, just it's not even an option. It's it's not that they don't wanna do it. It's that they can't because they're they're just they're they're dealing with too much data. And so a lot of our biggest users, they'll keep it somewhere. I mean, you're not gonna throw away the granular data but there's they'll store it cheap somewhere cheap like s 3, for example. And then when they wanna run ad hoc queries on it, they'll use something like Hive or Redshift to do batch queries because those don't have to be those don't have to have real time latency. You know, generally speaking, if you don't know the query you wanna run-in advance, then it probably doesn't have a real time requirement anyways. So our biggest users do things that way. But for a lot of use cases, storing all the granular data probably would probably would, be feasible. And I think in those cases, it would be really useful to have both of these extensions work together.
[00:17:24] Unknown:
And in terms of the ingestion capabilities for Pipeline DB, I'm curious what some of the common patterns are for being able to create and populate the data streams that pipeline is processing?
[00:17:37] Unknown:
Yeah. So in terms of creating the streams, that's it's a super simple SQL query. Basically, you just create a stream is is essentially the same as a table. So you just create a table with some columns and then the stream exists. In terms of getting data into the system, that's also pretty simple. You just use regular SQL insert statements or copy. The bigger deployments are almost always using some sort of message bus like Kafka or, AWS Kinesis. And we have connectors for both of those messaging buses that that connect directly to pipeline DB.
So for example, with the, Kafka connector, you you basically point it to your Kafka cluster and map a Kafka topic to a pipeline DB stream. And then you just tell it to go and it will it will just start consuming data in the pipeline DB from from Kafka. Another I think, important observation in terms of how people are ingesting data into the pipeline DB, I would say is a lot of them because Postgres has such rich support and functionality for JSON. A lot of a lot of users are actually creating these streams with only 1 column and it's just other it's just a JSON type. And then so each stream event is just like a JSON dictionary, for example, that gets written to that stream and then all of their continuous queries unpack that schema less event using Postgres' rich and mature JSON support for things like pulling out object keys and unpacking arrays and things like that. So in general, that's how that's how most people are are getting data into the system.
[00:19:20] Unknown:
And so I think the in memory stream table is definitely a very interesting abstraction. And I'm wondering if you can dig dig a bit deeper into how pipeline overall is architected and implemented and some of the evolution of the system since you first began working on it.
[00:19:39] Unknown:
Yeah. Definitely. And I I'm sure Yusuf could could add to whatever to whatever I have here because he was a big part of how this part of the system developed. Let's see in the in the beginning, we would we had what's called a I think we called it a stream buffer which was like a it was a shared memory slot it was a shared memory chunk of of memory space and shared memory is just a a region of of memory on a system that multiple separate processes can interact with. So as as new data would come into the system, the process writing writing writing that event to the system would write to this stream buffer and then other processes would essentially read from it. And as all of the events were read, from the stream buffer, they would get marked as as read by all the continuous queries that need to read them and then that space would be reclaimed by new incoming events. So it was it was more of a it was like a circular sort of queue that would just constantly get get filled up and emptied as new data came in came into the system and as, continuous queries read those events. So that that was, like, the very, very early implementation of the in memory representation of streams. And then it it got much, much better. And I I think, Usman, you might wanna explain this. You probably do a better job than me. We moved to using, an an external dependency. We we went for as long as possible without depending on any external libraries with our code. So there's as easy to install as possible, and you didn't you wouldn't have to install anything else. And this this was such a this is such a critical part of how the system works that we we decided to go with an external messaging library called the 0 0 m q.
[00:21:25] Unknown:
Yeah. We actually played around with this other library called Nanomessage before. Remember that? Yeah. So as Derek was saying, this in memory stream buffer abstraction was kinda pretty important early on. But as we added kinda more parallelism or more concurrency to the system, we started realizing there was actually a pretty difficult to maintain, like, a high performing and bug free implementation of of of this in memory stream buffer. And so then, like, okay, maybe you should investigate other, messaging. At least because essentially, like, at the heart of it, if you think about it, it's it's it's a, like, a messaging, you you need some sort of messaging abstraction. You have incoming data coming in from potentially multiple processes because depending on how many insert threads or how many insert processes you have, that's kind of the number of concurrent writers you have.
And then within pipeline DB, we have 2 kind of layers of processes called workers and combiners, where workers do some initial work on the incoming streaming data, and then they pass the output to another set of processes called combiners. And those combiners do some more processing and then eventually persisted to disk. And there's crosstalk between all these process. So any insert process could potentially write to any worker, and then any worker can potentially write to any combiner. So there's a lot of crosstalk happening. And so, eventually, we start off with a stream buffer, then we made, like, a sharded stream buffer and kinda, like, eventually just gave up on the idea and started using Nanomessage, which is a messaging library made by the same guy who made 0 mq, I believe. And stuck with that for a while, but eventually gave that up for 0mq when we started working on PipingDB Enterprise because 0mq I forget exactly why, but it was better for the the nice thing that it gave us was that you could talk to a local process or a remote process with the same abstraction. And so we could, like, potentially talk to a combiner on a different node completely from a worker process as if we were talking to a combiner on the local machine. So it kinda gave us this idea of, like, even if you have 10 nodes with with pipeline DB running on it, you would kind of think of it as a single giant node with all those process all those combiner processes running on.
[00:23:43] Unknown:
And what are the capacities and options for being able to scale a pipeline system both in terms of vertically by adding more system capacity and horizontally by adding new instances to the cluster?
[00:23:59] Unknown:
Yeah. So the open source pipeline DB core is and always has been designed to run on a single server, And that doesn't necessarily mean that the open source core is like a toy or like a trial version or anything like that. It's it's absolutely not. And and the reason is because we have this system that that allows you to perform all of this computation without ever storing any raw data at all. So that allows you to achieve a level of performance and scalability on a single server that you would never even be able to come close to, relative to a system that has to store all of the granular data.
So the open source core runs on a single server. You can you can all you can add multiple read slaves for for scaling out reads by using Postgres replication, for example. So you can have multiple nodes. All the data is coming into 1 node, and then all the changes being made as a result of that data is replicated out to all these other slaves, and you can fan your read workload out to all of those. In terms of scaling rights, that's where our business model comes in. We have a commercial extension to the open source pipeline dv core that adds horizontal scaling support to the system. So you can create cluster pipeline DB clusters, with this extension that that shards the right workload and the read workload actually out to multiple nodes and the whole thing still just behaves as 1 large logical pipeline d b deployment. So there's nothing new about the protocol or or the interface or anything like that. It's still just an endpoint that your client connects to except the workload and the read and write requests that are happening get sharded out to multiple nodes, which makes the system much more scalable. And and that's how we make money. Primarily we sell, licenses to the clustering engine to companies that really need like just a huge level of scale that they haven't been able to achieve with with 1 instance.
And the the commercial extension also adds high high availability support failover. So it has, like, things like replication can lose multiple nodes and the cluster will continue, operation without any any loss of service interruption. So that's that would that would be the the primary way that people people could scale.
[00:26:20] Unknown:
And and even when it comes to vertical scaling, pipeline DB actually can tremendously scale vertically because of just the system architecture. A lot of the parallelism happens in a manner where no like, typically, when you're trying to scale up up up any system, if there's a lot of contention of the data that they're trying to write to at the same time, that's kind of the main bottleneck. Otherwise, the number of CPU cores can work independently and kind of perform at at the max speed at which it's possible, the way pipeline DB is architected. As I previously mentioned, there's 2 groups of processes that are important. 1 is called the workers, and 1 is called combiners. And the workers don't share anything between each other, so they can literally go as fast as possible. And combiners are the ones which may be bottlenecked on on rights, but the combiners also the way it's architected is that they never 2 combiners will never write to the same row, and so that ensures that we don't have a lot of right contention either. So so if you just get, like, a node with the 64 cores or something, pipeline DB will be very efficient and making sure that it utilizes those 64 cores very efficiently whereas a lot of systems traditionally may or may not be able to do that. Yeah. That's another that's a really good point.
[00:27:33] Unknown:
Given that the inbound data is not persisted to disk until after it's gone through the workers and combiners, I'm wondering what capacity there is for guarding against data loss in the event of a power failure or system reboot or something like that where you don't necessarily have the advantage of the right ahead log to be able to repopulate that stream and some of the strategies for being able to mitigate that loss?
[00:28:00] Unknown:
Yeah. That that's a really good question. So all of the all of the interaction with disks by the combiner processes is still fully transactional. So if a combiner has updated the row that it needs to or the rows that it needs to and that succeeds, the the result is is durable. So it's it will survive a power failure just like any other, row in a regular postgres table. The the air the the area for potential data loss is the the short period of time where a set of rows written to a stream are currently being processed by the worker. They're basically in between the worker and the combiner's final stage of of writing to disk. And that the size of the the the number of events that can be in that batch is a is a tunable size, and that's that's basically the the largest amount of data that you could potentially lose because if there's a power failure between the time a batch of events, is being processed by the worker and by the time it's been persisted with this, those events will be lost.
But you can tune you can tune the size of that batch depending on your requirements and how much data you're you're potentially willing to lose. And then after that point, writes will just fail, because the system would be down. So clients would never be able to write to the system at that point and think that they succeeded. They would be notified that that they failed. So, basically, 1 1 batch is is the amount of data that you can potentially lose.
[00:29:27] Unknown:
And so is the information in the stream tables maintained with a write ahead log, or is that purely in memory and that would be lost as well?
[00:29:37] Unknown:
So, yeah, all write rights to the streams are not persisted to your write ahead log anywhere. They're not they're not they did they're just never, written to disk. The only thing that's persisted to the write ahead log are the changes being made to the to the on disk aggregate output maintained by the combiners. But, yes, stream stream events by design are are never never written to right ahead of law.
[00:30:02] Unknown:
Yeah. Pipelink DB, however, does offer, like, a few different modes of inserting into streams, and Derek can probably speak more about it because I'm sure there is, like, more work that has happened on this. But when you insert into a stream, there's similar to, like, how there's consistency kinda levels in databases. There's also, like, kinda act levels that we have in pipeline DB where you can insert into a stream and you can be like, okay. Just insert and forget about it, or insert and wait till a worker reads it, or insert and wait till a combiner actually has processed it and synced and and and, like, written it to disk. And so there's, like, a few, like, tuning parameters there where depending on how much data loss you're comfortable with, you can tune that that kind of consistency level up or down. Obviously, the more you push it up, the the lower your performance will get. That's just kinda an artifact of how the system works.
But there is, like, tuning parameters in place for users to be able to kind of determine, ahead of time, hey. This is the data loss that I'm buying with versus not.
[00:31:04] Unknown:
Yeah. That that that's a good point. So so there are there are configuration parameters you you can use to ensure that all rights to all streams are fully transactional. They either fail or succeed completely, but there'll never be any silent data loss. Most of our all of our users, as far as I know, actually, they elect not to to use that level of stream insertion consistency because it it just lowers performance. And and they they just they don't users generally don't mind losing a small batch of events. And this is a pretty rare occurrence anyways. You know, we have we have huge users who've been running pipeline DB for months years with no crashes or failures or anything. So, the idea of losing a batch of events due to a power failure or crash is actually it's actually pretty rare. So our users generally are fine with, taking that very, very, very tiny level of risk, which is 1 1 small batch of events. And also because they're they're writing so many events to pipeline DB that losing 1 batch just it's not even it's not even a big deal. It's it's like a rounding error. You know, a lot of users are are writing tens, hundreds, some even millions of events per second.
And by default, 1 1 micro batch is 10, 000 events. So it's it's relatively negligible.
[00:32:21] Unknown:
And for the input stream table spaces, is it possible for any other processes to read from those or is that purely the domain of the workers? And, also, is it possible to use another table that's already existing on disk as an input to a stream?
[00:32:39] Unknown:
So I'll answer the fur the second part of your question first. To to use a regular table as a stream, this kind of goes back to the previous discussion about integration between extensions. You would want you would use the the primitives given to you by Postgres. So you could have a logical decoding process, for example, that listens to all of the changes being made to that table and then writes that result out to the stream, which is actually pretty easy and works fairly well. In terms of in the in the first part of your question, in terms of what else can read from streams. So the main abstraction we've talked about so far, these kind of continuous aggregations, that's that's done by what we call a continuous view.
And a continuous view basically always aggregates data and then outputs the result to disk. The other the other fundamental abstraction that pipeline DB provides is called a continuous transform. A transform is different than a continuous view in that it doesn't persist anything to disk. All it does is, it it reads from a stream, applies whatever query you've used to define it to each event in the stream and then outputs that transformation to another stream. So you can also read from streams with those, and then you can chain all of these things together into arbitrary topologies of continuous computation. You could think of this topology as like a directed acyclic graph of continuous SQL queries because continuous views actually output their the incremental changes being made to the tables that back them. Those changes are actually output to another stream.
So you could have arbitrary combinations of continuous views and transforms all, outputting stuff to different streams, connecting together, and and performing pretty complex operations. And and people have done some really interesting things and come come up with different ways to combine these abstractions in pretty clever ways, which is pretty interesting. But the important thing is that these abstractions provide enough flexibility to do pretty much anything you want. So
[00:34:53] Unknown:
And also just to clarify, I believe that you alluded to the fact earlier that it's possible to have multiple continuous queries running against a single stream so that you can have different aggregates computed on the same source
[00:35:07] Unknown:
data? Yes. Absolutely. So that's that's, that's the norm, in terms of how people use the system. There typically will be a small number of streams and a large number of continuous views and transforms. Each event written to a stream has a small bitmap associated with it, and each slot in the bitmap corresponds to each of the continuous queries that need to read it. And as soon as each continuous query has read this event, they flip the bit. And once all those bits are set to 0, then the then the event is discarded. But, yeah, absolutely. The the the norm is to have a small number of streams fanning out to a to a large number of, continuous queries.
[00:35:51] Unknown:
And since the data being processed by the continuous queries is potentially unbounded as the new data continues to stream through, I'm curious what your approach is as far as checkpointing or windowing the data to determine when it should be persisted to disk.
[00:36:10] Unknown:
Yeah. So this, this is determined by a couple of different tunable parameters here in Uzma. Let me know if I missed anything. So the first 1, I think I've already mentioned that's that's the batch size. So these worker processes that are reading from streams will read that batch size of events, run the continuous aggregation on it, and then output that to the to the combiner process. So that's the first parameter that's that's relevant here. If you have a smaller batch size, you're gonna be you're gonna be executing these worker plans more frequently and vice versa.
And the the second the second configuration parameter is how often the combiner process will commit to disk. So you can tell the combiner process to keep taking these batches from the worker and continue aggregating them in memory. And then every so often, commit that to disk. And the the d that defaults to something, I think, 50 milliseconds, which is pretty often, obviously. And then a lot of larger users will widen that commit interval to something on the order of of seconds so that the combiner just keeps aggregating in memory for as long as possible, and then only commits the results of disk every second or so.
[00:37:27] Unknown:
And with the extent of the capabilities that pipeline offers, I'm sure that there are some either niche features or some very high level features that people might not necessarily go to immediately. So I'm curious if there are any of those that you have found to be particularly useful that are often overlooked initially.
[00:37:48] Unknown:
Yeah. I would say what what immediately comes to mind is, some of the more advanced aggregates that we provide that they use fairly interesting probabilistic data structures like, bloom filters, filter filtered space saving, and things like that. And in in the previous versions of pipeline DB before we hit 1 0, those they were they were kind of named after the actual data structures that, that that powered them, you know, for so filtered space saving, for example, is used to compute top k like heavy hitters. That's something where you wanna say, what are my top traffic sources? Just keep the keep the top 10 in memory or something like that. But it would be called that that 1 was called like FSS ag, for example. And I think a lot of people they're not they they weren't particularly familiar with what these data structures were and and how they could be used to actually be helpful to them. And so I I think we probably alienated some users by not giving those things more intuitive names. And so we actually renamed them in the recent 1 release. We renamed the the aggregates and functions that use these data structures based on what they're actually for. So fss ag is now called topk ag. And then there, there are several others that follow the same example. So I think, a lot of people probably had no idea that these things could be extremely, useful for them. So we're we're seeing more people use these aggregates now that they're named something something more intuitive.
I would say the other the other thing that that is that is very powerful, the users may not know about is the capability. And and this really comes from Postgres less than pipeline DB. The the capability of implementing user defined functions to include in your continuous views and transforms that can do whatever you want, basically. If if there's something that you need to if there's some computation that you need to run on your streams that you can't accomplish with the out of the box functionality that pipeline gives you. You can fairly easily just write functions in any programming language you want and then start using those in your in your continuous queries. And it's actually super easy. I think a lot of people probably are intimidated by that, even though it's it's actually really easy to do with Postgres. Postgres has done a fantastic job of, making that interface really usable for people. But I think a lot of users, they they don't really they don't really consider, taking that approach. So respond. Did I did I miss anything there? Yeah. I think 1 more thing which is, like, a pretty
[00:40:24] Unknown:
cool abstraction and usually, like, if if you try to do it with any other system, it's actually a very, like, complex problem to do is the ability to form, like, dags or, like, nested computations. So they briefly mentioned that there is 2 different types of primitives that Pipedynam DB offers. 1 is a continuous transform, which doesn't really persist anything to disk. It's kinda just like this thing where a stream comes in and a stream comes out, and then there's a view where kind of stream comes in and it persists some state to the database. So you can think of it as, like, a sync.
But but there's another cool thing that, that these views also output is called like, I think we we used to call it, like, an output stream or a delta stream. I'm not sure what the exact nomenclature now is. But the idea is that every single time something changes in a continuous view. So for example, if you are doing something like, hey. I wanna count the number of times this button is clicked every minute. Every single time that count changes, say, it changed from 5 to 10, a new event will be emitted in the output stream of that view, which will say the count change from 5 to 10. And and then that output stream can further be consumed by more transforms and transforms and views. So you can kinda make this giant tree of computation. And I think that's something that, like, a lot a lot of people, when when they're thinking about how to design their their queries, they don't think about it in that form. They'll just kinda have, like, a flat topology where they'll just create, like, 10 different views where you could potentially make it a little bit more efficient by creating maybe, like, 2 views and then have 2 views which are kinda nested underneath it and then maybe a final view which actually reads the output of the output of the beam, something like that. So I think that that that's just a little bit more of, like, a complex way of thinking about things, but it does make the system a lot more efficient because you are reducing the amount of compute that you have to do. Yeah. That that's a that's a really good point. I'm glad you raised Delta streams. I'll just make a quick note on on Delta streams.
[00:42:25] Unknown:
These things are very cool as as Yuusuan said, each change made to a continuous view is essentially omitted into a new stream. So you can think of the streams like, okay, plus 2, plus 1. We just had to change the count from 3 to 4, from 4 to 5, and you can consume that stream of changes to ultimately change the aggregation level of, an upstream continuous view. This is a very, very common pattern with pipeline ED that that people do. So a lot of users will say, okay. Start start aggregating into 1 minute buckets. So for each minute, I have a row with all the aggregate data, but then that table could get pretty big obviously over the course of, you know, weeks or months. So they'll say, okay. Set the time to live on that upstream continuous view to a week. Any data older than a week will be automatically purged in the background, and then you can but you could consume its Delta stream to aggregate that data into a lower aggregation level like day or month, for example, which would allow you to keep it for longer term without the basically the same information over a longer term, without letting the initial upstream continuous view grow without bound.
[00:43:40] Unknown:
And that seems like it would be very useful for people who are leveraging pipeline for a systems monitoring type of use case, and it also brings some possibilities in terms of alerting capability where you're using the continuous view to keep the current value updated, but then you can also use the delta stream. For instance, if you're computing it every minute, compute, another continuous aggregate over the delta stream for the past 10 minutes or something so that you can do something like a standard deviation, where if 1 of the values in that delta stream is outside of a normal bound, use that to, you know, trigger some function which might generate an alert.
[00:44:19] Unknown:
Right. Yeah. Exactly. So that's I would say that probably 75 or 80% of the pipeline use cases are in the areas that I mentioned before, powering real time analytics dashboards, things like that. And then the remaining portion of the use cases are are pretty much exactly what you just outlined, doing things like real time alerting and monitoring, and and things like that. So, yeah, that's that's absolutely something that, that users do.
[00:44:46] Unknown:
And in terms of consuming the output of the continuous views in real time, I'm wondering if there are any built in functions of pipeline DB for being able to push that information to the consumers or if it's something where you would just pull that table on whatever any
[00:45:09] Unknown:
like, let's think you need to like, let's think about a dashboard, for example. The user sets a date range and then that would issue a query to pipeline DB and then the result would just be rendered. So it's it's more on demand. I wouldn't necessarily say that people, like, continuously pull pipeline DB. It's more so that whenever they wanna know the most up to date result, they're they'll just issue a query to the system, whether that's a dashboard or a person or whatever because all of the aggregate data is just always up to date. So every time you query it, you'll get the most up to date result. And it's usually very fast and low latency because you're not scanning and aggregating all this granular data over and over and over again on demand. So that's the that's the primary way that people get data out of the system.
And then there are there are ways to push data out to consumers also. This kinda goes back to the user defined functions topic that I that I briefly talked about. You know, people have written their own functions that push data out to a Redis pubsub channel, for example, that some other application is connected to to to subscribe to the most recent results and things like that. And then, for example, our our Kafka connector that, people use to to natively connect Pipeline DB to Kafka. It's primarily used for consuming data from Kafka into a Pipeline DB stream. It also has the capability to push data back into Kafka though. So, you know, you could you could achieve an alert by monitoring the output of a continuous view in the in the way that you you just described. And then when the value hits some certain threshold, push an event out to Kafka, which will be as which would be consumed by some other, client somewhere.
[00:46:56] Unknown:
And in the process of building and maintaining and growing the pipeline DB project and business, I'm wondering what you found to be some of the most challenging aspects and some of the most interesting or unexpected lessons learned in the process.
[00:47:11] Unknown:
I mean, I think building this kind of infrastructure is inherently very, very challenging, which makes hiring extremely challenging as well. So that that's a huge challenge. And then specifically in terms of the the type of thing that pipeline DB is, the this continuous processing system, that's an extremely difficult problem to solve because the system is always running and it and it just it can't fail. So anytime there have been, like, bugs or issues in the past with a big user, it becomes immediately clear because it kind of cascades across the whole system because it's just constantly running every millisecond of every second of every minute of every hour of every day. Pipeline DB is running as fast as possible. And so it has to work really well and it has to be really efficient because of that.
Because it's not like more traditional systems that solve this problem and that that they're only really doing complex work when someone issues accruing to it. So it's it's constantly perpetually running and that's that's a really, really hard problem to solve. And in terms of the surprising ways that people have used our technology, I think a lot of a lot of the usage has been pretty much exactly what we thought it would would be because we're we're solving a pretty generic problem in terms of aggregating data. That's a just a widespread, ubiquitous problem across all organizations.
So a lot of it, I would say, hasn't been particularly surprising, but there are use cases here and there that do surprise us. I would say, like, 1 in recent memory is an embedded use case. So companies have some sort of product that they may deploy geographically like a CDN or exam for example, and they wanna do some computation or aggregation on data being produced at the edges so that they don't have to send a bunch of granular data back to some central repository. And we're definitely starting to see use cases like that, which are actually fairly surprising, I would say.
Are there any is there anything else I I missed you from?
[00:49:23] Unknown:
I think 1 of the things maybe, at least, like, in the beginning, I think Pipedindb was, like, when we started building it, we didn't focus that much on the time series aspect of things. It kinda was, like, it's, like, an overall generic stream aggregator. But I think over time, it seemed like which I I just kinda make sense is that, you know, to be, like, a lot of the streaming data just ends up being time series data because it's some sort of event stream, and so it does have a time stamp associated with it. So a lot of the the the the the kind of, like, future work that that happened in pipeline maybe, like, time to live, which says that, hey. After a send point, just drop the the data that's older than this time or, like, sliding window queries. Like, I think a lot of work went into that, whereas maybe we hadn't, like, focused on it from the very get go. But, like, in retrospect, it seems pretty obvious that that is what we should have expected.
[00:50:20] Unknown:
Yeah. That that's actually a really good point. I'm glad I'm glad you brought that part up. So I would say overall, pipeline DB is not at this point, I wouldn't call it like a stream processing engine or a streaming analytics engine or anything like that. It's it's absolutely a time series analytics engine. And the reason is that when you when you talk about things like stream processing, there's a particular capability in that discipline that I I would call, you know, event by event ordered processing where users wanna say, if this event happens, then this 1 and then this 1, do this. And most traditional stream processing systems will have that capability as a first class feature. We we've never we never added anything to to facilitate ordered event by event processing. We never wanted to. And so when you look at, like, a stream processing engine minus event by event ordered processing, what you end up with is time series analytics.
So that's definitely what PipelineEDD has come to be, not a stream processing engine, but a time series aggregation engine.
[00:51:29] Unknown:
Yeah. And that brings up 1 other question too in terms of the challenges of being able to process those unbounded time oriented streams of data is some things along the lines of processing out of order events and figuring that out, but I that that's definitely something that's more in terms of the specific use cases of the user.
[00:51:52] Unknown:
Right. Yeah. There there are what we found with with pipeline DB, there there aren't many requirements to to perform any sort of ordered event by event processing. Usually, if people have that requirement, we'll tell them that pipeline DB probably isn't isn't the the tool that they're looking for. Because if you're doing a lot of aggregation, you you almost by definition don't care about, event ordered, you know, event processing. So given that all of our all of our users and customers use Python DB for very high throughput aggregation, there's never really a need for for ordered event processing.
[00:52:30] Unknown:
And you already answered the question a little bit, but are there any other cases where pipeline DB would be the wrong choice and you would actively discourage someone from using it for a given use case?
[00:52:41] Unknown:
Yeah. Yeah. As as I said, definitely, strongly ordered event by event stream processing. Pipeline DB is not gonna be, the right system for the job in those cases. And then there there are a couple more, I would say. Obviously, if you're if you're not using SQL or if you're the continuous queries that you're running on this data can't be expressed within the abstractions given to you by SQL, then pipeline v is definitely not going to be a fit because it's all SQL based. And I would say if if you're doing if you're doing a lot of ad hoc querying, pipeline DB is not gonna be a fit either because as I've mentioned, you all of these aggregation queries, you have to continue you have to know them in advance, declare them in advance, and then the results just start updating from that point forward. So if you're running a lot of, ad hoc queries, then a system like that isn't gonna it's gonna be too restrictive for you. And so we would actively discourage people from using pipeline DB if they had a lot of ad hoc queries they wanted to run. And then by a lot, I mean, the vast majority of the queries running against the system would have to be ad hoc. You know, there's like an an analyst somewhere connected to the database, and every single query that she runs is different from the 1 before, then pipeline to view would be a little bit too restrictive.
I I should also mention though that the output of continuous views, as we've discussed, is stored like just regular SQL tables. So you can further query those however you want. You can you can run further aggregation queries on the output of the continuous view. You can join the results with another table or continuous view. So people do a lot of, further final processing on the continuously updating output of continuous views. But if you really need to run a lot of unique ad hoc queries
[00:54:41] Unknown:
on all of your granular data, then then pipeline g is not not gonna be the right tool for the job. Yeah. I think the other thing to mention here would be that even when you're doing stream aggregation, right, it only makes sense if you're aggregating your data down by a lot. Like, say, if I have a stream where my typical aggregate row, it only processes on the order of, like, 1 or 2 events. And so at that point, like, my aggregation there is basically not really reducing the input size that much. So if I have a 1, 000, 000, 000 events coming in and I end up storing 900, 000, 000 rows, it's not really doing any, like, large aggregation. So pipe maybe, like, shines the most when the input size versus the output size is, like, an order of magnitude or 2 orders of magnitude difference. I think that's that's when pipeline DB is extremely efficient, and and that's kind of, like, the crux use case for it.
[00:55:31] Unknown:
Derek, your comment too reminded me of 1 of the other things that we didn't discuss earlier that we don't necessarily need to dig into, but that because of the fact that the aggregations are being computed within the boundaries of your post grads database, It also enables you to join that information to existing tables so that you can enrich the events as they're coming in with additional facts and metadata that you already have stored from other tables and other applications potentially, which is 1 of the other ways that it stands out from a generic stream processing engine.
[00:56:08] Unknown:
Right. Yeah. As as you mentioned, because everything's happening basically on top of Postgres, you can include other relations with tables and views as part of a continuous views definition. So you can join a stream onto a table and then run your query on that joined, that joined result. So, the example you mentioned is is spot on. What a lot of a lot of people will do is their events coming in may have, like, a a user ID, for example, like a number and then but they wanna run queries on more than just the user ID. They might wanna attach the username, the user account number, or the user's geographical location, whatever the case may be, and they can they can use stream table joins to to achieve that.
[00:56:57] Unknown:
And so now that you've hit the 1 0 milestone, I'm curious what you have planned for the future of pipeline DB.
[00:57:05] Unknown:
Yeah. Yeah. We've been really excited to get the the 1 release out. It's already it's been a little it's been less than 2 months since you released it. It's it's 1 0 is already the most popular version that, of pipeline DB that people are running, which is which is crazy because the other the other releases have been running for multiple years now. So I think packaging it up as an extension really just made things a lot easier for people. And in terms of looking forward, the the next major piece of functionality that we're that we're excited about adding to pipeline DB is partitioning automation.
And what that means is currently continuous view tables are stored as just a single table. And so it does get really, really big, the query performance, in terms of running, retrieving data from it, querying data can degrade over time as the continuous view continues to grow and grow and grow. And for for ones that are really big, that can be problematic. You know, nobody wants to see their query latency, slowly increase over time. So what partitioning automation is going to do is that continuous views will be structured as multiple separate tables. They'll they'll behave as 1 logical table still, but they'll be physically stored on disk as many, many separate partitions of the same data. And what this does is it bounds the query performance by basically whatever the whatever the partition size is. The query latency will never get worse than it is for, a full partition. So what users are gonna be able to do is say, make the partition with 1 day, for example, and then as data's in the continuous view is filling up as soon as the current partition has 1 full day of data, then a new partition will be automatically created and that 1 will be populated moving forward. And when you query them, they'll be combined implicitly by it by the query engine. So as I mentioned, still behaves as 1 giant table, but there'll be physically partitioned on disk, which will be a huge, huge performance win.
And then the other thing I'll mention is integration with, something like Timescale DB. That's that's definitely in the works, and a lot of people have been been asking for that. So I think, we'll be really happy to get something out in that area in that moving forward as well.
[00:59:35] Unknown:
Are there any other aspects of pipeline DB or the work that you're doing at the business aspects of it or anything along those lines that we didn't discuss yet, which you think we should cover before we close out the show?
[00:59:47] Unknown:
I think, I think things have been pretty well covered. Usman, can you think of anything?
[00:59:53] Unknown:
Well, for anyone who wants to get in touch with the both of you, I'll have you add your preferred contact information to the show notes. And with as a final question, I'd like to get each of your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:00:11] Unknown:
I can start off on that, I guess. Yeah. So I think for me and that's something 1 of the things about pipeline DB that I thought always was super exciting was that, in general, when we think about data management or data processing or tools like that, the the thing that's typically discussed is is the performance. Everyone's excited about performance. Hey. How fast is it? How many rows does it do per second? How much can it store? Does it scale this, that? And the thing that's almost never mentioned is what is the usability of this tool? How easy it is for me to install? How easy is it for me to actually, like, write a query? Like, a lot of times, I'll see that even to, like, deploy a new query, you have to write some code, write some Java, write some Scala, like, you deploy it to a server, do a release, ends up taking half a day. And and and I think that's kind of the the biggest thing for me, and I think we've seen some trend as of recently with things like, I don't know, CockroachDB or PipelineDB and a few other new it is coming out which have focused more or I think there was we think DB was 1 of them also. We're focused more not just on the performance but also on the usability aspect of things that let's not just make it fast or something which has a lot of features, but let's also make sure that it truly makes the lives of developers easy. Like, they don't have to spend a lot of time deploying it. They don't have to spend a lot of time actually, like, adding more functionality to it or adding a more new query to it or adding a new node to it. Let's just try to make it as kinda, like, self serving as possible. And I think that's that's, like, we've seen a recent trend towards it, which is pretty good. But I think, traditionally, that's probably the thing that's been the most annoying about this entire, like, space is that it's just very difficult for anyone to get started.
[01:01:54] Unknown:
Yeah. I would say I I can't think of anything I would say beyond that. I think that's I I think I would agree that that's probably the biggest, most, widespread gap in tools in the space because otherwise, you know, honestly, I think I think the the state of tech the technology available in in the space of data management, I my honest take is that it's actually really good. There's been there's been a lot of innovation, over the over the years, particularly in in recent years. And you can pretty much, at this point, find a tool, for processing and managing data that's actually pretty good.
Most of them are highly specialized because they're all, you know, generally all these different systems are focused on doing 1 thing relatively well and that's generally because in order to get scalability and performance, you typically have to sacrifice flexibility. And so the result of that idea is that there are a lot of tons and tons and tons of very specialized tools out there. But honestly, I think, I think that the state of the art in this space has has never been better and there's there's something pretty good out there. Probably many things, many tools that are pretty good out there for pretty much anything you wanna do. And and, yeah, as Usman mentioned, usability is the only real widespread efficiency that that that I would say exists.
[01:03:20] Unknown:
Well, thank you both for taking the time today to join me and discuss discuss the work that you've done with Pipeline DB. It's definitely a very interesting project and 1 that fills a very useful space, particularly for people processing large volumes of data. So I appreciate you taking the time, and I appreciate your efforts on the project, and I hope you enjoy the rest of your days. Thanks for having us. Likewise. Thanks for having us, Tobias.
Introduction to Pipeline DB
Motivation and Story Behind Pipeline DB
Use Cases and Industry Applications
Architectural Concepts and Design Components
Ingestion Capabilities and Data Streams
Evolution and Implementation of Pipeline DB
Scaling Pipeline DB
Data Loss and Mitigation Strategies
Reading and Transforming Data Streams
Checkpointing and Windowing Data
Challenges and Lessons Learned
Future Plans for Pipeline DB
Biggest Gaps in Data Management Tools