Continuously Query Your Time-Series Data Using PipelineDB with Derek Nelson and Usman Masood

Hello. Welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy them. So check out Linode. With 200 gigabit private networking,

scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai.

Go to data engineering podcast dotcom/linode

today to get a $20 credit and launch a new server in under a minute. And go to data engineering podcast.com

to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And don't forget to go to dataengineeringpodcast.com/chat

to join the community and keep the conversation going. Your host is Tobias Macy. And today, I'm interviewing Usman Masood and Derek Nelson about Pipeline DB, an open source continuous query engine for PostgreSQL.

So, Usman, could you start by introducing yourself?

Yeah.

I'm Usman. I work as an engineer at Stripe. Previously, I was involved with Pipeline DB as the chief architect and worked for a couple of years on the project.

And, Derek, how about yourself?

So I'm the founder

and CEO of, Pipeline. D d.

And going back to you, Usman, do you remember how you first got involved in the area of data management?

Yeah. I mean, I think

for me, it was kinda just randomly stumbled into it. In school, I focused a lot on, systems y stuff and then

built a few data

processing pipelines at my first job. And then when I was looking for a new job, saw Piping DB as a potential opportunity. I'd always wanted to work on an open source project

or, like, something,

where I could kinda give back to the community. Seemed like a cool project. So talked to Derek and, you know, things kind of went from there.

And, Derek, how did you first get involved in the area of data management?

So it it all started the the last company I worked at before before I started this company. It was an advertising technology company

called Advil.

I started there when it was a it was a very small company and ended up growing really big. And I was on the engineering team and I took over the responsibility for

basically all of their, data management infrastructure.

And as as the company grew and the scale grew

explosively, we had to

build a lot of things that didn't exist yet. A lot of new interesting technology to handle just the huge scale involved in running an advertising company.

And

after doing that for a few years, I had learned a lot about products that I wish had existed

because we had to build them ourselves at AdRoll. And when I was ready to

do move on and do the next thing, I started Pipeline DB based on my experience at AdRoll.

And so can you give an overview of what Pipeline DB is and some of the motivation and story behind creating it?

Yeah. Sure. So I'll

it it really begins

with

the experience I had at AdRoll.

As I mentioned, we were processing

huge amounts of data.

By the time by the time I left, we were reaching more than half the Internet per day, with ads.

And we were analyzing all of that data

in real time, which is a pretty hard problem to solve at that scale.

But

the key observation was that most of what we wanted to know about this data we were processing,

most of the computations required to to learn those things were fairly straightforward.

We were counting a lot of things, running aggregation queries, and things like that. And we built

a lot of complex

application code, custom application code that we that we use to achieve those things.

And there was really no off the shelf product

that allowed you to say,

okay, take this huge

massive continuous stream of data and just run these simple aggregations on it as efficiently as possible.

And that's essentially the problem that that Pipeline DB solves. Instead of requiring people to,

build their own custom application infrastructure

or use some of the more complex stream processing infrastructure

out there. It allows them to declare

the

aggregation queries they wanna run on these massive flows of data. And

it then runs those it runs those calculations continuously in real time as data flows into the system. But the in the key innovation is that

pipeline DB only stores

the output of these aggregation queries. So another key observation we had at Agro was that the input dataset

was much, much, much larger by multiple orders of magnitude than the output because we were doing aggregations.

It starts out as huge voluminous stream of of messy

log data.

You know, 1 1 event per

per ad served or 1 event per click serve or whatever.

And by the time it gets processed, it it's much, much smaller than that. It's just aggregated down into a relatively small number of rows,

you know, bucketed by

hour or day or customer or user.

And so

when you look at when we looked at that huge differential between the size of the input and the size of the output, we saw an opportunity to build a system that just never had to store

the raw,

granular data.

And because the bottlenecks in most database technologies is generally the disk, that allowed us to avoid

an an enormous amount of,

disk IO by just aggregating all this data in memory, only writing out the

the compact output, to disk. And that makes pipeline DB

extremely efficient, even even running it on a single server deployment.

And later on, I definitely wanna get into some of the scaling aspects in terms of the,

system architecture that would be most beneficial for this deployment

and some of the discussion around

discarding the raw data and some of the options there. But first, I'd like to cover what types of use cases this enables that would otherwise be

impractical or intractable and some of the main types of industries that would benefit most from the pipeline technology?

So the the

primary area of use cases is anywhere there's, like, a real time analytics dashboard being used at a company.

PipedLine DB is gonna be a really good fit for powering that because the main the main drawback of using this technology or trade off rather is that you have to know these queries in advance. You have to declare them in advance and then as the data runs through them, the results are incrementally computed.

So

anywhere where you know

the queries that you wanna run on these large streams of data in advance, pipeline DB is going to be a really good fit. And

in our experience, we've seen that,

mostly in the area where there there's reporting dashboards, real time analytics dashboards,

things like that. People are using pipeline DB to power those things on basically really small amounts of hardware.

And on on also with with large amounts of,

input data, which before pipeline DB was around,

that was a it was a much harder problem, to solve for the reasons that I mentioned before is that you had to build larger scale,

usually custom application infrastructure to to do these aggregations in real time. Pipeline DB just makes it really, really simple. So

in terms of the the things that enables

companies to do that they wouldn't otherwise be able to do, it allows them to analyze more data

faster, essentially. They're they're able to, in in many cases, analyze data sets that they wouldn't

be able to analyze at all before,

or or in real time. So software companies with deep technical

expertise,

they can build this infrastructure and they and they do themselves. So Pipeline

DB makes a lot of sense for companies that are producing a lot of data,

but they don't they're not necessarily software companies. They don't necessarily have,

the experience or

the or the staff to build complex

stream processing infrastructure. So it it makes it much easier for them. And in terms of, industries where

where we see pipeline pipeline d v being used

heavily. It's the it's a pretty broad cross section honestly because you can you can think of pretty much any company in the world nowadays.

Companies are more and more becoming data driven companies and so pretty much every company out there will have

some sort of software infrastructure that's ultimately

powering real time analytics dashboards that other people in the company are using. So the the

the set of companies using pipeline to be is extremely broad. We do see increased usage in in a few different verticals. Ad tech is 1 of them, which which isn't surprising because this was this was essentially conceived out of experience

from Ad tech.

Telecom is is another vertical where we see

a lot of usage. And then beyond those 2, it's it's a pretty broad cross section.

And what are some of the major

architectural concepts and design components that users of pipeline DB should be familiar with to be able to use it effectively?

I think the

the first and it is fine. You can add to this if I miss anything.

The the first thing is to have a proficiency

in SQL, I think, at a high level because

everything

that you do with pipeline DB, you essentially do with SQL. So you'd wanna you'd wanna have some proficiency,

with SQL, know how that works. You would also want to have a good understanding of

how

pipeline ED scales

across cores on a server.

We built it to be

highly, highly parallel,

because stream processing is very CPU

intensive because you're running all these computations on huge flows of data

coming in. So you'd wanna have understanding

of

how all the work required

to run these continuous queries scales out over the CPUs on the system that you're running pipeline DB on so that you can properly take advantage of, all the all the hardware capacity

available,

I would say. Is there is is there anything else I I missed, Usman?

Yeah. I think that covers most of it. I think the the the kind of high level, I think, concept here is

is, I think, which Derek has mentioned before as well is understanding if your data can be

efficiently summarized.

And sometimes in that, you might, like, lose some level of accuracy. Like, for example, if I'm trying to count the number of distinct elements in a stream, if I'm trying to do it, like, a 100% accurate, I kinda have to store on the order of unique elements. I have to store that information somewhere. But if I'm happy with a little bit of an error, if I don't think that's that we look at deal, then you can use, like, some kind of probabilistic some some restructures underneath the hood to approximate those answers. So PipingDB

internally will leverage a bunch of these,

probabilistic data structures automatically for you so you don't even have to think about what's happening. But you still you still need to have the understanding that when you do summarize data and you're trying to do it efficiently, you might lose some level of accuracy. For the most part, what what I've seen is that when you are dealing with such large volumes of data and you're trying to get aggregate information out of it, you don't really care if, like, you saw a 1000000 elements or a1000000 and 2. Like, that level of accuracy is not that important. You're not taking, like, extremely hardcore business decisions. You're mostly just getting charts and trying to see directionally how things look.

Yeah. That that that's a really good point on on this question.

As Yousswan said, a lot of a lot of the a lot of the

infrastructure that allows us to to even run these continuous aggregations on essentially infinite streams of data use probabilistic data structures, like, for for example,

an extremely common aggregate operation that users use pipeline DB for is counting,

distinct, which means counting the number of distinct input elements. So this this might have a use case in, you know, website analytics if you wanna know for each hour

how many distinct users,

viewed this URL. And that uses,

count distinct in pipeline DB. But to do that with perfect accuracy, as as Usman said, you have to store

all of the unique elements you've seen. And then when you get a new element coming in,

you say,

have I already added this? If not, then increase the distinct count. If so, don't do anything because it's not

unique. And we use hyper log log, for count distinct, which is a

a very compact data structure that allows you to estimate the cardinality

of the, input set

with a with very with a very small memory footprint. And the trade off is that it has a small margin of error. So

most of the error margins with these,

probabilistic data structures in pipeline d v have to give you a sense, have a error margin of

usually less less than 1%. Like hyper log log is, I think, by default, 99.2%

accurate. And as Usman said, when you're dealing with huge, huge numbers, you don't you don't really care if it's if it's a million or a 1, 000, 002. So that's the kind of error margin that

users are generally okay with accepting by Pointe. And

so

given the fact that pipeline DB is implemented

as a plug in for PostgreSQL,

and there are a number of other

very interesting and useful plug ins within that ecosystem,

particularly things like,

some recent entries, timescale and Citus. I'm wondering

what level of compatibility

exists between pipeline DB and some of those other useful plug ins?

So with the 2 that you mentioned, we we get asked about those pretty often.

And currently, there's no native integration with any other extension.

So any kind of interoperability

you would have, between pipeline d v and another extension would be done through the the primitives

exposed by Postgres.

Since they're all Postgres extensions,

you could

integrate them using whatever interfaces Postgres provides you, which are things like triggers,

logical decoding, streams,

replication, things like that. It's possible to

integrate these extensions with with those standard primitives.

And I in the future, I think,

we're definitely

I think we'll almost certainly have a a integration a native integration with Timescale DB

just because

the the use cases for Timescale DB and for pipeline DB, have a lot of overlap. They're both essentially

time series

processing engines. TimescaleDB is a bit different

in ours is that it in that it stores granular data and allows you to query it very efficiently, whereas we don't store granular data. But the the use cases are are very similar. So we basically solve a lot of those same problems,

with 2 fundamentally different approaches. So

it it makes a lot of sense for these 2 integrations to be

integrated in the future. So I think we'll almost certainly have something in that area within the the coming months. And in terms of Citus,

they there's not as much overlap in use cases

as there is between us and Timescale DB. So we don't have anything to work in the works in terms of integration with them, but it it is something that we get asked about relatively frequently.

Yeah. As I was looking through the documentation for Pipeline DB, I was envisioning

using

timescale for storing the high granularity data

after it's been pre aggregated by pipeline DB so that you can sort of have the best of both worlds where you have a constant view of what the current value is, but you still have the ability to go back and do ad hoc analyses after the fact rather than discarding the data or having to

pipe it twice into the same database or into or use them in separate databases?

Right. Yeah. Fundamentally, that's pretty much what everybody

wants to do with the the 2 systems together as far as I understand it. And and in a lot of cases, I think that would work really well. For a lot of our users storing the the granular data, just it's not even an option. It's it's not that they don't wanna do it. It's that they can't

because they're they're just they're they're dealing with too much data. And so a lot of our biggest users, they'll keep it somewhere. I mean, you're not gonna throw away the granular data but there's they'll store it cheap somewhere cheap like s 3, for example. And then when they wanna run ad hoc queries on it, they'll use something like Hive or Redshift

to do batch queries because those don't have to be

those don't have to have real time latency. You know, generally speaking, if you don't know the query you wanna run-in advance, then it probably doesn't have a real time requirement anyways. So our biggest users do things that way. But for a lot of use cases,

storing all the granular data probably would probably would, be feasible. And I think in those cases, it would be really useful to have both of these extensions work together.

And in terms of the ingestion capabilities for Pipeline DB, I'm curious what some of the common patterns are for being able to create and populate the data streams that pipeline is processing?

Yeah. So in terms of creating the streams,

that's it's a super simple SQL query. Basically, you just create a stream is is essentially the same as a table. So you just create a table with some columns

and then the stream exists. In terms of getting data into the system, that's also pretty simple. You just use regular

SQL insert statements or copy.

The bigger deployments

are almost

always using

some sort of message bus like Kafka

or, AWS

Kinesis.

And we have connectors for both of those

messaging buses that that connect directly

to pipeline DB.

So for example, with the,

Kafka connector,

you you basically point it to your Kafka cluster and map a Kafka topic

to a pipeline DB stream. And then you just tell it to go and it will it will just start consuming data in the pipeline DB from from Kafka.

Another

I think, important observation

in terms of how people are ingesting data into

the pipeline DB, I would say is

a lot of them because Postgres has such rich

support and functionality for JSON.

A lot of a lot of users

are actually creating these streams with only 1 column and it's just other it's just a JSON type.

And then

so each stream event is just like a JSON dictionary,

for example,

that gets written to that stream and then all of their

continuous queries unpack that schema less event using Postgres'

rich and mature JSON support for things like pulling out object keys and unpacking

arrays and things like that. So in general, that's how that's how most people are are getting data into the system.

And so I think the in memory

stream table is definitely a very interesting abstraction. And I'm wondering if you can dig dig a bit deeper into

how pipeline overall is architected and implemented and some of the evolution of the system

since you first began working on it.

Yeah. Definitely.

And I I'm sure Yusuf could could add to whatever to whatever I have here because he was a big part of how this part of the system developed. Let's see in the in the beginning,

we would

we had what's called a I think we called it a stream buffer which was like a it was a shared memory slot it was a shared memory

chunk of of memory space and shared memory is just a a region of of memory on a system that multiple separate processes can interact with.

So

as as new data would come into the system, the process writing

writing writing that event to the system would write to this stream buffer and then other processes would essentially

read from it. And

as all of the events

were read,

from the stream buffer, they would get marked as as read by all the continuous queries that need to read them and then that space would be reclaimed by new incoming events. So it was it was more of a it was like a circular

sort of queue that would just constantly get get filled up and emptied as new data came in came into the system

and as, continuous queries read those events. So that that was, like, the very, very early implementation

of the in memory representation of streams. And then

it it got much, much better. And I I think, Usman, you might wanna explain this. You probably do a better job than me. We moved

to using,

an an external dependency. We we went for as long as possible without depending on any external libraries

with our code. So there's as easy to install as possible, and you didn't you wouldn't have to install anything else. And this this was such a

this is such a critical part of how the system works that we we decided to go with an external

messaging library called

the 0 0 m q.

Yeah. We actually played around with this other library called

Nanomessage before. Remember that?

Yeah. So as Derek was saying, this in memory stream buffer abstraction was kinda pretty important early on. But as we added kinda more parallelism

or more concurrency to the system, we started realizing there was actually

a pretty difficult to maintain, like, a high performing and bug free implementation

of of of this in memory stream buffer.

And so then, like, okay, maybe you should investigate other,

messaging. At least because essentially, like, at the heart of it, if you think about it, it's it's it's a, like, a messaging,

you you need some sort of messaging abstraction. You have incoming data coming in from potentially multiple processes because depending

on how many insert threads or how many insert processes you have, that's kind of the number of concurrent writers you have.

And then within pipeline DB, we have 2 kind of layers of

processes called workers and combiners, where workers do some initial work on the incoming streaming data, and then they pass the output to another set of processes called combiners.

And those combiners do some more processing and then eventually persisted to disk. And there's crosstalk between all these process. So any insert

process could potentially write to any worker, and then any worker can potentially write to any combiner. So there's a lot of crosstalk happening. And so,

eventually, we start off with a stream buffer, then we made, like, a sharded stream buffer

and kinda, like, eventually just gave up on the idea and started using Nanomessage,

which is

a messaging library made by the same guy who made 0 mq, I believe. And stuck with that for a while, but eventually

gave that up for 0mq when we started working on PipingDB

Enterprise because

0mq I forget exactly why, but it was better for

the the nice thing that it gave us was that you could talk to a local process or a remote process with the same abstraction. And so we could, like, potentially talk to a combiner on a different node completely

from a worker process as if we were talking to a combiner on the local machine. So it kinda gave us this idea of, like, even if you have 10 nodes with with pipeline DB running on it, you would kind of think of it as a single giant node with all those process

all those combiner processes running on.

And what are the capacities and options for being able to scale a pipeline system both in terms

of vertically by adding more system capacity

and horizontally

by adding new

instances to the cluster?

Yeah. So the open source pipeline DB core is and always has been designed to run on a single server,

And that doesn't necessarily mean that the open source core is like a toy or like a trial version or anything like that. It's it's absolutely not. And and the reason is because we have this system that that allows you to perform all of this computation without ever storing any raw data at all. So that allows you to achieve a level of performance and scalability on a single server that you would never even be able to come close to,

relative to a system that has to store all of the granular data.

So the open source core runs on a single server. You can you can all you can add multiple read slaves for for scaling out reads by using Postgres replication,

for example. So you can have multiple nodes. All the data is coming into 1 node, and then all the changes being made as a result of that data is replicated out to all these other slaves, and you can fan your read workload out to all of those.

In terms of scaling rights, that's where our business model comes in. We have a commercial extension to the open source pipeline dv core that adds horizontal scaling support to the system. So you can create cluster pipeline DB clusters,

with this extension

that that shards the right workload and the read workload actually

out to multiple nodes

and the whole thing still just behaves as 1 large logical

pipeline d b deployment. So there's nothing new about the protocol

or or the interface or anything like that. It's still just an endpoint that your client connects to except

the workload and the read and write requests that are happening get sharded out to multiple nodes, which makes the system

much more scalable. And and that's how we make money. Primarily we sell, licenses to the clustering engine to companies that really need like just a huge level

of scale that they haven't been able to achieve with with 1 instance.

And the the commercial extension also adds high high availability

support failover. So it has, like, things like replication

can lose multiple nodes and the cluster will

continue,

operation

without any any loss of service interruption.

So that's that would that would be the the primary way that people

people could scale.

And and even when it comes to vertical scaling, pipeline DB actually

can tremendously

scale vertically because of just the system architecture.

A lot of the parallelism

happens in a manner where no

like, typically, when you're trying to scale up

up up any system, if there's a lot of

contention of the data that they're trying to write to at the same time, that's kind of the main bottleneck. Otherwise, the number of CPU cores can work independently and kind of perform

at at the max speed at which it's possible, the way pipeline DB is architected.

As I previously mentioned, there's 2 groups of processes that are important. 1 is called the workers, and 1 is called combiners. And the workers don't share anything between each other, so they can literally go as fast as possible. And combiners are the ones which may be bottlenecked on

on rights, but the combiners also the way it's architected is that they never 2 combiners will never write to the same row,

and so that ensures that we don't have a lot of right contention either. So so if you just get, like, a node with the 64 cores or something, pipeline DB will be very efficient and making sure that it utilizes those 64 cores very efficiently whereas a lot of systems traditionally may or may not be able to do that. Yeah. That's another that's a really good point.

Given that the inbound data is not persisted to disk until after it's gone through the workers and combiners, I'm wondering what capacity there is for guarding against data loss in the event of a power failure or

system reboot or something like that where you don't necessarily have the advantage of the right ahead log to be able to repopulate that stream

and some of the strategies for being able to mitigate that loss?

Yeah. That that's a really good question. So

all of the all of the interaction with disks by the combiner processes is still fully transactional. So if a combiner has updated

the row that it needs to or the rows that it needs to and that succeeds,

the the result is is durable. So it's it will survive a power failure just like any other, row in a regular

postgres table.

The the air the the area for potential data loss is the the short period of time where

a set of rows written to a stream are

currently being processed by the worker. They're basically in between the worker and the combiner's final stage of of writing to disk. And that

the size of the the the number of events that can be in that batch is a is a tunable size, and that's

that's basically the the largest amount of data that you could potentially lose because if there's a power failure between the time a batch of events,

is being processed by the worker and by the time it's been persisted with this, those events will be lost.

But you can tune

you can tune the size of that batch depending on your requirements and how much data you're you're potentially willing to lose. And then after that point, writes will just fail, because the system would be down. So clients would never be able to write to the system at that point and think that they succeeded. They would be notified that that they failed. So, basically, 1 1 batch is is the amount of data that you can potentially lose.

And so is the information in the stream tables

maintained with a write ahead log, or is that purely in memory and that would be lost as well?

So, yeah, all write rights to the streams are not persisted to your write ahead log anywhere. They're not they're not they did they're just never, written to disk. The only thing that's persisted to the write ahead log are the changes being made

to the

to the on disk aggregate output maintained by the combiners.

But, yes, stream stream events

by design are are never never written to right ahead of law.

Yeah. Pipelink DB, however, does offer, like, a few different modes of inserting into streams, and Derek can probably speak more about it because I'm sure there is, like, more work that has happened on this. But

when you insert into a stream, there's similar to, like, how there's consistency

kinda levels in databases. There's also, like, kinda act levels that we have in pipeline DB where you can insert into a stream and you can be like, okay. Just insert and forget about it, or insert and wait till a worker reads it, or insert and wait till a combiner actually has processed it and synced and and and, like, written it to disk. And so there's, like, a few, like, tuning parameters there where depending on how much data loss you're comfortable with, you can tune that

that kind of consistency level up or down. Obviously, the more you push it up, the

the lower your performance will get. That's just kinda an artifact of how the system works.

But there is, like, tuning parameters in place for users to be able to kind of determine,

ahead of time, hey. This is the data loss that I'm buying with versus not.

Yeah. That that that's a good point. So so there are there are configuration parameters you you can use

to ensure that all rights to all streams are fully

transactional.

They either fail or succeed completely, but there'll never be any silent data loss. Most of our all of our users, as far as I know, actually,

they elect not to to use that level of stream insertion consistency because it it just lowers performance. And and they they just they don't

users generally don't mind losing a small batch of events. And this is a pretty rare occurrence anyways. You know, we have we have huge users who've been running pipeline DB for months years with no crashes or failures or anything. So,

the idea of losing a batch of events due to a power failure or crash is actually it's actually pretty rare. So our users generally are fine with,

taking that

very, very, very tiny level of risk, which is 1 1 small batch of events. And also because

they're they're writing so many events to pipeline DB

that losing 1 batch just it's not even it's not even a big deal. It's it's like a rounding error. You know, a lot of users are

are writing tens, hundreds, some even millions of events per second.

And by default, 1 1 micro batch is 10, 000 events. So it's it's relatively

negligible.

And for the input stream table spaces, is it possible for any other processes to read from those or is that purely the domain of the workers?

And,

also, is it possible to use another table that's already existing on disk as an input to a stream?

So I'll answer the fur the second part of your question first.

To to use

a regular table as a stream, this kind of goes back to the previous discussion about integration between extensions. You would want you would use the

the primitives given to you by Postgres. So you could have a logical decoding process, for example, that listens

to all of the changes being made to that table and then writes that result out to the stream, which is actually pretty easy and works

fairly

well. In terms of in the in the first part of your question, in terms of what else can read from streams.

So the main abstraction we've talked about so far, these kind of continuous aggregations, that's that's done by what we call a continuous view.

And a continuous view

basically always aggregates data and then

outputs the result to disk.

The other

the other fundamental abstraction that pipeline DB

provides is called a continuous transform.

A transform is different than a continuous view in that it doesn't persist

anything to disk.

All it does is,

it it reads from a stream,

applies whatever query you've used to define it

to each event in the stream and then outputs that transformation

to another stream.

So you can also read from streams with those, and then you can chain all of these things together

into arbitrary topologies of continuous

computation. You could think of

this topology as like

a directed acyclic graph of continuous SQL queries

because continuous views actually output their the incremental changes being made to the tables that back them. Those changes are actually output to another stream.

So you could have

arbitrary

combinations of continuous views and transforms

all,

outputting stuff to different streams, connecting together, and and performing

pretty complex operations. And

and

people have done some really interesting things and come come up with different ways to combine these abstractions

in pretty clever ways, which is pretty interesting. But the important thing is that these abstractions

provide enough

flexibility

to do pretty much anything you want. So

And also just to clarify, I believe that you alluded to the fact earlier that it's possible to have multiple continuous queries running against a single stream so that you can have different aggregates computed on the same source

data? Yes. Absolutely. So that's

that's, that's the norm, in terms of how people use the system. There

typically will be a small number of streams and a large number of continuous views and transforms.

Each event

written to a stream has a small bitmap

associated with it, and each slot in the bitmap corresponds

to

each of the continuous queries

that need to read it. And as soon as each continuous query

has read this event, they flip the bit. And once all those bits are set to 0, then the then the event

is discarded. But, yeah, absolutely. The the the norm is to have a small number of streams fanning out to a to a large number of,

continuous queries.

And since the data being processed by the continuous queries is potentially

unbounded

as the new data continues to stream through, I'm curious what your approach is

as far as checkpointing

or windowing the data to determine when it should be persisted to disk.

Yeah. So this,

this is determined

by a couple of different tunable

parameters here in Uzma. Let me know if I missed anything. So the first 1,

I think I've already mentioned that's that's the batch size. So these worker processes that are reading from streams will read that batch size of events,

run the continuous aggregation on it, and then output that to the to the combiner process. So that's the first

parameter that's that's relevant here. If you have a smaller batch size, you're gonna be you're gonna be executing these worker

plans more frequently and vice versa.

And the the second the second configuration parameter is

how often the combiner

process will commit to disk. So

you can tell the combiner process to keep taking these batches

from the worker and continue aggregating them in memory. And then every so often, commit that to disk. And the the d that defaults to something, I think, 50 milliseconds,

which is pretty often, obviously. And then a lot of larger users will widen that commit interval to something on the order of of seconds so that the combiner just keeps aggregating in memory for as long as possible, and then only commits the results of disk every second or so.

And with the extent of the capabilities that pipeline offers, I'm sure that there are some either niche features or some very high level features that people might not necessarily

go to immediately. So I'm curious if there are any of those that you have found to be particularly useful that are often overlooked initially.

Yeah. I would say what what immediately comes to mind is,

some of the more advanced

aggregates that we provide that they use

fairly interesting probabilistic data structures like,

bloom filters, filter filtered space saving,

and things like that. And in in the previous versions of pipeline DB before we hit 1 0, those they were they were kind of named after the actual data structures that,

that that powered them, you know, for so filtered space saving, for example, is used to compute top k like heavy hitters. That's something where you wanna say, what are my top traffic sources?

Just keep the keep the top 10 in memory

or something like that. But it would be called that that 1 was called like FSS

ag, for example. And I think a lot of people

they're not they they weren't particularly familiar with what these data structures were and and how they could be used to actually be helpful to them. And so I I think we probably alienated

some users by not giving those things more

intuitive names. And so we actually renamed them in the recent 1 release. We renamed

the the aggregates and functions that use these data structures based on what they're actually for. So fss ag

is now called topk ag. And then there, there are several others that follow the same example. So I think, a lot of people probably had no idea that these things could be extremely,

useful for them. So we're we're seeing more people

use

these aggregates now that they're named something something more intuitive.

I would say the other

the other thing that that is that is very powerful, the users may not know about is the capability.

And and this really comes from Postgres

less than pipeline DB. The the capability of

implementing user defined functions to include in your continuous views and transforms that can do

whatever you want, basically. If if there's something that you need to if there's some computation that you need to run on your streams that you can't accomplish with the out of the box functionality that pipeline gives you. You can fairly easily just write functions in any programming language you want and then start using those

in your in your continuous queries. And it's actually super easy. I think a lot of people probably are intimidated by that, even though it's it's actually really easy to do with Postgres. Postgres has done a fantastic job

of,

making that interface really usable

for people. But I think a lot of users, they they don't really they don't really consider,

taking that approach. So respond. Did I did I miss anything there? Yeah. I think 1 more thing which is, like, a pretty

cool abstraction and usually, like, if if you try to do it with any other system, it's actually a very, like, complex

problem to do is the ability to form, like,

dags or, like, nested

computations.

So they briefly mentioned that there is 2 different types of primitives that Pipedynam DB offers. 1 is a continuous transform, which doesn't really persist anything to disk. It's kinda just like this thing where a stream comes in and a stream comes out, and then there's a view where kind of stream comes in and it persists some state to the database.

So you can think of it as, like, a sync.

But but there's another cool thing that, that these views also output is called like, I think we we used to call it, like, an output stream or a delta stream. I'm not sure what the exact nomenclature now is.

But the idea is that every single time

something changes in a continuous view. So for example, if you are doing something like, hey. I wanna count the number of times this button is clicked every minute.

Every single time that count changes, say,

it changed from 5 to 10,

a new event will be emitted in the output stream of that view, which will say the count change from 5 to 10. And and then that

output stream can further be consumed by more transforms and transforms and views. So you can kinda make this giant tree of computation. And I think that's something that, like, a lot a lot of people, when when they're thinking about how to design their

their queries,

they don't

think about it in that form. They'll just kinda have, like, a flat topology where they'll just create, like, 10 different views where you could potentially

make it a little bit more efficient by creating maybe, like, 2 views and then have 2 views which are kinda nested underneath it and then maybe a final view which actually reads the output of the output of the beam, something like that. So I think that that that's just a little bit more of, like, a complex way of thinking about things, but it does make the system a lot more efficient because you are reducing the amount of compute that you have to do. Yeah. That that's a that's a really good point. I'm glad you raised Delta streams. I'll just make a quick note on on Delta streams.

These things are very cool as as Yuusuan said,

each change made to a continuous view is essentially omitted into a new stream. So you can think of the streams like, okay, plus 2, plus 1. We just had to change the count from 3 to 4, from 4 to 5,

and you can consume

that stream of changes to ultimately change the aggregation level of,

an upstream continuous view. This is a very, very common

pattern with pipeline ED that that people do. So a lot of users will say, okay. Start start aggregating

into 1 minute buckets. So for each minute, I have a row with all the aggregate data, but then

that table could get pretty big obviously over the course of, you know, weeks or months. So they'll say, okay. Set the time to live on that upstream continuous view to a week. Any data older than a week will be automatically

purged in the background,

and then you can but you could consume

its Delta stream to aggregate that

data

into a lower aggregation level like day or month, for example, which would allow you to keep it for longer term without the basically the same information over a longer term,

without letting the initial upstream continuous view grow without bound.

And that seems like it would be very useful for people who are leveraging pipeline for a systems monitoring type of use case, and it also

brings some possibilities in terms of alerting capability where you're using the continuous view to keep the current value updated, but then you can also use the delta stream.

For instance, if you're computing it every minute,

compute,

another continuous aggregate over the delta stream for the past 10 minutes or something so that you can do something like a standard deviation,

where if 1 of the values in that delta stream is outside of a normal bound, use that to, you know, trigger some function which might generate an alert.

Right. Yeah. Exactly. So that's I would say that

probably 75 or 80% of the pipeline use cases are in the areas that I mentioned before,

powering real time analytics dashboards, things like that. And then the remaining

portion of the use cases are are pretty much exactly what you just outlined,

doing things like real time alerting and monitoring,

and and things like that. So, yeah, that's that's absolutely something that,

that users do.

And in terms of consuming the output of the continuous views in real time, I'm wondering if there are any built in functions

of pipeline DB for being able to push that information to the consumers

or if it's something where you would just pull that table on whatever any

like,

let's think you need to like, let's think about a dashboard, for example. The user sets a date range and then that would issue a query

to pipeline DB and then the result would just be rendered. So it's it's more on demand. I wouldn't necessarily say that people,

like, continuously pull

pipeline DB. It's more so that whenever they wanna know the most up to date result, they're they'll just issue a query to the system, whether that's a dashboard or a person or whatever because all of the aggregate data is just always up to date. So every time you query it, you'll get the most up to date result. And it's usually very fast and low latency because you're not scanning and aggregating

all this granular data over and over and over again on demand. So that's the that's the primary way that people get data out of the system.

And then there are there are

ways to push data out to consumers also. This kinda goes back to the user defined functions

topic that I that I briefly talked about. You know, people have written

their own functions that push data out to a Redis pubsub channel, for example, that some other application is connected to to

to subscribe to the most recent results and things like that. And then, for example, our our Kafka connector

that, people use to

to natively connect Pipeline DB to Kafka. It's primarily used for consuming data from Kafka into a Pipeline DB stream. It also has the capability to push data back into Kafka though. So, you know, you could you could achieve an alert

by

monitoring the output of a continuous view in the in the way that you you just described. And then when the value hits some certain threshold, push an event out to Kafka, which will be as which would be consumed by some other,

client somewhere.

And in the process

of building and maintaining and growing the pipeline DB project and business, I'm wondering what you found to be some of the most challenging aspects and some of the most interesting or unexpected lessons learned in the process.

I mean, I think

building this kind of infrastructure is inherently

very, very challenging, which makes hiring extremely

challenging

as well.

So that that's a huge challenge. And then

specifically

in terms of the the type of thing that pipeline DB is, the this continuous

processing

system, that's an extremely difficult problem to solve because the system is always running and it and it just it can't fail. So anytime there have been, like, bugs or issues in the past with a big user, it becomes immediately clear because it kind of

cascades across the whole system because it's just constantly running every

millisecond of every second of every minute of every hour of every day. Pipeline DB is running as fast as possible. And so it has to work really well and it has to be really efficient

because of that.

Because

it's not like more traditional systems that solve this problem and that that they're only really doing

complex work when someone issues accruing to it. So it's it's constantly perpetually running and that's that's a really, really hard problem

to solve. And in terms of the

surprising ways that people have used our technology,

I think a lot of a lot of the usage has been

pretty much exactly what we thought it would would be because we're we're solving a pretty generic problem in terms of aggregating data. That's a just a widespread,

ubiquitous problem across all organizations.

So

a lot of it, I would say, hasn't been particularly surprising,

but there are use cases here and there that do surprise us. I would say, like, 1 in recent memory is an embedded use case. So

companies have some sort of product that they

may deploy geographically like a CDN or exam for example, and they wanna do

some computation or aggregation on data being produced at the edges so that they don't have to send a bunch of granular data back to some central repository.

And we're definitely starting to see use cases like that, which are actually

fairly surprising, I would say.

Are there any is there anything else I I missed you from?

I think

1 of the things maybe, at least, like, in the beginning, I think Pipedindb was,

like, when we started building it,

we didn't focus that much on the time series aspect of things.

It kinda was, like, it's, like, an overall generic

stream aggregator. But I think over time, it seemed like which I I just kinda make sense is that, you know, to be, like, a lot of the streaming data just ends up being time series data because it's some sort of event stream, and so it does have a time stamp associated with it. So a lot of the the

the the the kind of, like,

future work that that happened in pipeline maybe, like, time to live, which says that, hey. After a send point, just drop the the data that's older than this time or, like, sliding window queries. Like, I think a lot of work went into that, whereas maybe we hadn't, like, focused on it from the very get go. But, like, in retrospect,

it seems pretty obvious that that is what we should have expected.

Yeah. That that's actually a really good point. I'm glad I'm glad you brought that part up. So I would say overall, pipeline DB is not

at this point, I wouldn't call it like a stream processing engine or a streaming analytics engine or anything like that. It's it's absolutely a time series

analytics engine. And the reason is that when you when you talk about things like stream processing,

there's

a particular capability

in that discipline

that I I would call, you know, event by event

ordered processing where users wanna say, if this event happens, then this 1 and then this 1, do this. And

most traditional stream processing systems will have that capability as a first class feature. We we've never

we never added anything

to to facilitate ordered event by event processing. We never wanted to. And so when you look at, like, a stream processing engine

minus event by event

ordered processing,

what you end up with is time series analytics.

So that's definitely what PipelineEDD has come to be,

not a stream processing engine, but a time series aggregation engine.

Yeah. And that brings up 1 other question too in terms of the challenges of being able to process those unbounded time oriented streams of data

is some things along the lines of

processing out of order events and figuring that out, but I that that's definitely something that's more in terms of the specific use cases of the user.

Right. Yeah. There there are what we found with with pipeline DB, there there aren't many

requirements

to to perform any sort of ordered event by event processing.

Usually, if people have that requirement, we'll tell them that pipeline DB probably isn't isn't the the tool

that they're looking for. Because if you're doing a lot of aggregation, you you almost by definition don't care about,

event ordered, you know, event processing. So

given that all of our all of our users and customers use Python DB for very high throughput aggregation,

there's

never really a need for for ordered event processing.

And you already answered the question a little bit, but are there any other cases where pipeline DB would be the wrong choice and you would actively discourage someone from using it for a given use case?

Yeah. Yeah. As as I said,

definitely,

strongly ordered event by event

stream processing. Pipeline DB is not gonna be,

the right system for the job in those cases. And then there there are a couple more, I would say.

Obviously, if you're if you're not using SQL or if you're the continuous queries that you're running on this data can't be expressed within

the abstractions given to you by SQL, then pipeline v is definitely not going to be a fit because it's all SQL based. And I would say

if

if you're doing if you're doing a lot of

ad hoc querying,

pipeline DB is not gonna be a fit either because

as I've mentioned,

you all of these

aggregation queries, you have to continue you have to know them in advance, declare them in advance, and then the results just start updating from that point forward. So if you're running a lot of,

ad hoc queries,

then a system like that isn't gonna

it's gonna be too restrictive for you. And so we would actively discourage

people from using pipeline DB if

they had a lot of ad hoc queries they wanted to run. And then by a lot, I mean,

the vast majority of the queries running against the system would have to be ad hoc. You know, there's like an an analyst somewhere connected to the database, and every single query that she runs

is

different from the 1 before,

then pipeline to view would be a little bit too restrictive.

I I should also mention though

that the output of continuous views, as we've discussed, is stored

like just regular

SQL

tables. So you can further query those however you want. You can you can run further aggregation queries on the output of the continuous view. You can join the results with another table or continuous view. So people do a lot of,

further final processing

on the

continuously updating output of continuous views.

But if you really need to run a lot of unique ad hoc queries

on all of your granular data, then then pipeline g is not not gonna be the right tool for the job. Yeah. I think the other thing to mention here would be that even when you're doing stream aggregation, right, it only makes sense if you're aggregating your data down by a lot. Like, say, if I have a stream where my typical aggregate row, it only processes on the order of, like, 1 or 2 events. And so at that point, like, my aggregation there is basically not really reducing the input size that much. So if I have a 1, 000, 000, 000 events coming in and I end up storing

900, 000, 000 rows, it's not really doing any, like, large aggregation. So pipe maybe,

like, shines the most when the input size versus the output size is, like, an order of magnitude or 2 orders of magnitude difference. I think that's that's when pipeline DB is extremely efficient, and and that's kind of, like, the crux use case for it.

Derek, your comment too reminded me of 1 of the other things that we didn't discuss

earlier that we don't necessarily need to dig into, but

that because of the fact that the aggregations are being computed within the boundaries of your post grads database,

It also enables you to join that information to existing tables so that you can

enrich the events as they're coming in with additional facts and metadata that you already have stored from other tables and other applications potentially, which is 1 of the other ways that it stands out from a generic stream processing engine.

Right. Yeah. As as you mentioned, because everything's happening

basically on top of Postgres,

you can include other relations with tables and views

as part of a continuous views definition. So you can join a stream

onto a table

and then run your query on that joined,

that joined result. So,

the example you mentioned is is spot on. What a lot of a lot of people will do is their events coming in may have, like, a a user ID, for example, like a number and then but they wanna run queries on

more than just the user ID. They might wanna attach the username,

the user account number,

or the user's geographical

location, whatever the case may be, and they can they can use stream table joins

to to achieve that.

And so now that you've hit the 1 0 milestone,

I'm curious what you have planned for the future of pipeline DB.

Yeah. Yeah. We've been really excited to get the the 1 release out. It's already

it's been a little

it's been less than 2 months since you released it. It's it's 1 0 is already the most popular version that,

of pipeline DB that people are running, which is which is crazy because the other the other releases have been running for

multiple years now. So I think packaging it up as an extension

really just made things a lot easier for people.

And in terms of looking forward,

the the next

major

piece of functionality

that we're that we're excited about adding to pipeline DB is partitioning

automation.

And what that means is currently

continuous view tables are stored as just a single table. And so it does get really, really big,

the query performance,

in terms of

running,

retrieving data from it, querying data

can degrade over time as the continuous view continues to grow and grow and grow. And for for ones that are really big, that can be problematic. You know, nobody wants to see their query latency,

slowly increase

over time.

So what partitioning automation is going to do is

that continuous views will be structured as

multiple separate tables. They'll they'll behave as 1 logical table still, but they'll be physically

stored on disk

as many, many separate partitions

of the same data. And what this does is it bounds the query performance

by basically whatever the whatever the partition size

is. The query latency will never get worse than it is for,

a full

partition. So what users are gonna be able to do is say, make the partition with

1 day, for example, and then as data's

in the continuous view is filling up

as soon as the current partition has

1 full day of data,

then a new partition will be automatically created and that 1 will be populated

moving forward. And when you query them, they'll be combined

implicitly by it by the query engine. So as I mentioned, still behaves as 1 giant table, but there'll be physically partitioned on disk, which will be a

huge, huge performance win.

And then the other thing I'll mention

is integration with, something like Timescale DB. That's that's definitely

in the works, and a lot of people have been been asking for that. So I think,

we'll be really happy to get something out in that area

in that moving forward as well.

Are there any other aspects of pipeline DB or the work that you're doing at the business aspects of it or anything along those lines that we didn't discuss yet, which you think we should cover before we close out the show?

I think, I think things have been pretty well covered. Usman, can you think of anything?

Well, for anyone who wants to get in touch with the both of you, I'll have you add your preferred contact information to the show notes. And with as a final question, I'd like to get each of your perspectives

on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I can

start off on that, I guess. Yeah. So I think

for me and that's something 1 of the things about pipeline DB that I thought always was super exciting was that, in general, when we think about data management

or data processing or tools like that, the

the thing that's typically discussed is is the performance. Everyone's excited about performance. Hey. How fast is it? How many

rows does it do per second? How much can it store? Does it scale this, that? And the thing that's almost never mentioned is what is the usability of this tool? How easy it is for me to install? How easy is it for me to actually, like, write a query? Like, a lot of times, I'll see that even to, like, deploy a new query, you have to write some code, write some Java, write some Scala,

like, you deploy it to a server, do a release, ends up taking half a day. And and and I think that's kind of the the biggest thing for me, and I think we've seen some trend as of

recently with things like, I don't know, CockroachDB

or PipelineDB and a few other new it is coming out which have focused more or I think there was we think DB was 1 of them also. We're focused more not just on the performance but also on the usability aspect of things that let's not just make it fast or something which has a lot of features, but let's also make sure that it truly makes the lives of developers

easy. Like, they don't have to spend a lot of time deploying it. They don't have to spend a lot of time actually, like, adding more functionality to it or adding a more new query to it or adding a new node to it. Let's just try to make it as kinda, like, self serving as possible. And I think that's that's, like, we've seen a recent trend towards it, which is pretty good. But I think, traditionally, that's probably the thing that's been the most annoying about this entire,

like, space is that it's just very difficult for anyone to get started.

Yeah. I would say I I can't think of anything I would say beyond

that. I think that's I I think I would agree that that's probably the biggest, most,

widespread gap in tools in the space because otherwise, you know, honestly, I think I think the

the state of

tech the technology available in in the space of data management, I my honest take is that it's actually really good. There's been there's been a lot of innovation,

over the over the years, particularly in in recent years. And you can pretty much, at this point, find

a tool,

for processing and managing data that's actually pretty good.

Most of them are

highly specialized because they're all, you know, generally all these different systems are focused on doing

1 thing relatively well and that's generally because

in order to get scalability and performance, you typically have to sacrifice flexibility. And so the result of that idea is that there are a lot of

tons and tons and tons of very specialized tools out there. But honestly, I think, I think that

the state of the art in this space has has never been better and there's

there's something pretty good out there.

Probably many things, many tools that are pretty good out there for pretty much anything you wanna do. And and, yeah, as Usman mentioned,

usability is the only real widespread

efficiency that

that that I would say exists.

Well, thank you both for taking the time today to join me and discuss discuss the work that you've done with Pipeline DB. It's definitely a very interesting project and 1 that fills a very useful space, particularly for people processing large volumes of data. So I appreciate you taking the time, and I appreciate your efforts on the project, and I hope you enjoy the rest of your days. Thanks for having us. Likewise. Thanks for having us, Tobias.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links