Simplifying Continuous Data Processing Using Stream Native Storage In Pravega with Tom Kaitchuck

Hello. Welcome to the Data Engineering podcast, the show about modern data management. When When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy them. So check out Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai.

Go to data engineering podcast dotcom/linode

today to get a $20 credit and launch a new server in under a minute. And go to data engineering podcast.com

to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And don't forget to go to dataengineeringpodcast.com/chat

to join the community and keep the conversation going. Your host is Tobias Macy, and today I'm interviewing Tom Kachuk about Provega, an open source data storage platform optimized for persistent streams.

Tom, could you start by introducing yourself? Yeah. So so I've been working on streaming systems for for basically my entire career. I sort of inadvertently started out in this by by joining

Amazon as an intern,

and subsequently

worked there for for 7 years. And

I started on a team that was called Distributed Systems Engineering at the time, but eventually became known as, Amazon Web Services. And I did the back end for, the first version of, the second version of SQS

and the simple workflow service, which ended up not being quite as popular externally, but but is very widely used within a lot of the other services as as internal infrastructure component. And,

I ended up,

under a manager there who then went to Microsoft and was working at Azure. And he called me several years after he left, and I was working at Google at the time.

And he pitched, you know, we we need to we need to create this this product. And he he described

the need for a a log storage system and,

why it was essential. And, essentially, he he joined, EMC at the time and convinced them to to spin up a team to build this this sort of infrastructure.

And, he brought me aboard as well as Fabio Junkera, who's who's famous for,

zookeeper and,

bookkeeper. And, we, you know, put together a team and and and built it from there. So that's that's sort of how we got we got into it. There's a lot of of of details behind it, but the core the core advantage of Provega is that it it treats data as a stream in it and is itself a streaming platform.

So that's sort of where everything else follows from, is from that 1 PID. Yeah. And if you just glance at it very quickly, in some ways, it looks similar to platforms such as Kafka or Pulsar that are a sort of persistent log system. But as you get a little closer and look through the other capabilities, it becomes obvious that there are a lot more

capabilities and options and use cases

that Per Vega enables. So I'm wondering if you can discuss some of the main types of workloads that it was originally designed for and some of the, primary ways that people are using it?

So so the biggest thing that that is sort of distinctive about for Vega as opposed to, you know, Procter and Pulsar

is that it's sort of managing

your archival storage natively.

So Provega divides things into sort of tier 1 and tier 2, and this is sort of a common pattern that you see with people using CockBear, Pulse, etcetera,

where you'll you'll stream some data in and it'll be on, you know, that system for a little bit. And then a retention period will kick in and it will will drop it and, you know, you'll need to have some some process that's spooling that data

and and and dumping it into HDFS or something like that or or, you know, s 3 or whatever. So Per Vega's

managing that in internally. And

the the reason it's doing that is because that's actually part of the architecture. It's not sort of some secondary

thing where it sort of jumps

and and and and it's bolted on and sort of

transfers the data out. It actually is designed to hold in sort of a quote, unquote infinite storage volume provided you know you have disk capacity. It can just continuously stream. It's not

separating out

your your high your low latency stuff from your high latency stuff. It it gives you a common interface for both of those. So that's that's the that's the first thing that most people see. But the other thing that it does is Provega gives you strong consistency. And the reason it does that is because it it actually is a stream under under the covers. Right? A lot of times we talk about streams and we say, here's a bunch of messages, and those constitute a string. But they don't really. They're just a set of discrete messages, and when you start getting down into it, they can be reordered and they can be duplicated and so forth and so on. On the Prevega server, all data is continuously treated as a contiguous string of bytes.

It doesn't actually know where 1 event begins and the next ends.

So there's no such thing as

duplication or reordering because from Provega's point of view, those are tantamount to corruption. Because it can't tell where 1 begins and another ends, so it can't possibly duplicate. The other thing that comes directly from having strong consistency

is that you can build up more complicated primitives. So the simplest you can do is is, comparing set operator. You can append data but have that data be conditional.

That that I wanna append some data to the stream, but conditionally

on this metadata value being that, or the length of the stream being y.

So you can you can do these these operations conditionally, and we can build a bunch of APIs on top of that. And 1 of them that we build that's that's probably the most widely used version of that is is transactions.

So what you can do is you can write a bunch of data

to a stream

and say, I want all of this to go in or none of it. I want it to be atomic. And that and that that is unbounded in the in the amount of data you can put down there. Because what's actually happening under the covers is that data is being written to a to a segment.

We've tried streams in the segment. It's written to a segment and that segment is is when you call commit via metadata operation

conditionally merged into,

the the stream that it that it's a part of. So you can perform this this atomic operation over the over large volumes of data with very little overhead.

And so

you mentioned that the representation of the stream on disk is just the sequence of bytes, and it also has these concepts of segments for being able to adapt

to different usage patterns or volumes of data going through the system. So I'm wondering if you can talk a bit more about the actual on disk representation

and some of the benefits that it provides

in terms of persisting the stream

and how that manifests

in some of these higher order capabilities, such as the transactions or exactly once delivery and things like that? Okay. Yeah. So streams are are are broken into segments, and the the reason for that is so that they can they can scale up or or down as as needed. So if you can think about a stream as we have we have multiple interfaces, but when you're dealing with events like you would in a in, you know, Flanker or if you're connecting with Flanker or Spark or something, You have a bunch of discrete events, and you will send those events to different segments that are all part of the same logical stream. But those segments

will certain events will need to go to 1 segment, certain events will need to go to another. And so the way we manage that is we define a sort of a a key space. If you can imagine, like, we take a string and we sort of hash it to a floating point between 0 and 1, and then we can divide up that space and say, okay. Events that have their keys, you know, associated with 0.5 will go to this segment, and 0.75 will go to the sub. That also gives us a continuity.

So across time, if we can we can change the number of segments, but you can still you can imagine, like, for any given key, there's a there's a continuous order of what segments they should go to.

And so as long as you read them in that same order that that they were written to, order is maintained even though the number of of actual segments is changing out from underneath you. So a segment might might be on 1 machine, but then it might scale up and split into 2 and then be split across 2 machines, etcetera, etcetera. So you can you can dynamically adjust how much resources a stream has access to by scaling the number of segments. In terms of how the data is laid out within a segment, it's actually very simple. It's,

when it when it ends up in tier 2, the data is written aside from sort of a separate metadata file that's maintained for other reasons. It's maintained as files that just contain the raw data that was written for that stream. There's no framing or overhead or anything. Depending upon your tier 2 implementation, which like, say you're in HDFS, for example, it might be it might be split up into multiple files, but they're just contiguous and and 1 follows right after the other. And in terms of the way that the segmentation happens for adapting to the

usage patterns. I know that looking through the documentation, it mentions that it will use different sort of time allocations

and split 1 stream into multiples in the event that there are more consumers

than can be supported with just 1 stream, or if you have more writers and less consumers, it will, recombine the different streams. So how does that logic function in terms of determining

what the sort of transition points are and any tuning capabilities that are available to operators or application developers? For each stream, there's a configuration file, and that configuration

specifies,

we call scaling calls. The scaling policy contains,

like, a target rate that you wanna have on a on a first segment basis,

and that can be defined in terms of, you know, sort of kilobytes per second, or it can be defined in terms of the discrete events if your if your stream is defined in terms of that. And for the,

automatic tiering of the storage subsystem,

that's 1 of the other things that seems very compelling about Prevega is because you have that single API

for data that is both on cluster

and,

you know, life cycled out to some other storage mechanism. I'm wondering if you can talk through

some of the challenges of providing that unification

and, any routines that you do to try and hide some of the latency issues of fetching that data back and any of the types of life cycle policies that are available to operators?

When data comes into Prevega, it's written to tier we call tier 1, which is a high high speed,

low

append latency

storage system. It's notable that it's it's never read from there, and if it was, that would incur incur very significant costs. So it's it's a very key to the architecture that we that we don't read from the tier 1 storage unless

we're doing

recovery form field, post failure.

So so data is written to tier 1. Once it has been written

and enough data has accumulated for a given segment, that data is then written as a as a large block to tier 2. And usually,

that takes the form of an append, where it's appending onto an existing file. Now this means that when a read comes in, the data may or may not be in tier 2. If it's not in tier 2, it has to be served from Pervega. When the data is in Prevega, we need to be able to serve reads to it. The data may or may not be in tier 2. If the data is not in tier 2, we have to serve it out of Prevega's local storage. That is a Rocks DB that is configured to prefer to stay in RAM, but if it if it has to, it will spoke to disk. And notably, it has its its journaling turned off. The reason for that is because we're using we're using the tier 1 storage for for durability. So we can read data out of Hervega's

cache here that's in this proxy DB and serve read requests there. But obviously, the majority of read requests won't be able to be served out of the cache because the cache is inherently small. What we can do is we can have a read request go to tier 2, pull in a large amount of data more than the user needs, and then return extra data or or hold it in in memory waiting for the next recall. The advantage of of this in the streaming architecture is that the reads are incredibly predictable. By doing this simple read ahead optimization, we can effectively reduce the latency

of the reader to to 0 because we always have the data, in RAM ready to go by time to issue the next request. And when you're performing requests to the tier 2 storage to retrieve a particular block, and you mentioned that when it is persisted to that tier 2, it's often written as an append operation.

So I'm wondering if there's any complexity

or,

resource limitations that are imposed by having to perform seeks from the beginning of a given file if you have large chunks or,

how that is managed in terms of being able to fetch the different segments. I was just assuming that, when it's persisted to tier 2, since you're creating an append operation, I'm envisioning it as being all in a single file rather than maybe a series of contiguous files. So I'm just wondering in terms of the actual representation

within that storage medium, if there's a need to open a given file and then seek to a particular

segment. So you don't need to move around to look at more than 1 segment. You can just read contiguously and all the data in the segment

and all the data in the file corresponds to the same segment and is contiguous. So there's no no processing that's needed. But some tier 2 implementations

don't support append operations or an append operation involves reading the old data and then concatenating it with the new data and then writing it back. For those tier 2 implementations,

we simply don't do an append. We simply create a new file, write the new data there, and then name the files,

in order such that if you read 1 and then the next and the next and the next, it gives you the the the all of the data in order. And another question that I have that's not necessarily

specifically relevant to the storage tiering, but more in terms of the representation of the stream is if you're trying to seek to a particular point in time, for instance, if you're retrieving discrete events, if there is any sort of metadata that gets stored for being able to identify at what point in a stream or where in a segment you would need to move a point or 2 to be able to read a specified set of events or, specified time frame? Yeah. That's actually a good question. So as I said, the server doesn't understand where event boundaries are. So it can't help you locate the beginning of an event, for instance, or it can't help you seek randomly. So the way that's implemented is we have this concept called, stream cuts. So anytime you can call an API to grab a stream cut, and you can do this from the writer or from the reader. And that will represent the current position in the stream. So on the writer's side, it's it's it's where data would be written right now if you're gonna write data. From the reader's side, it's where your readers are at at the current moment. And then you can, in the future, jump to a particular stream cut. So you can start a reader a group of readers and say, we're gonna start reading it at this stream cut and provide it. So that lets you define things in terms of boundaries that are meaningful to the application. And so for somebody who is building an application

on top of Prevega, I'm wondering if you can talk through a bit more of the

interfaces that are available and some of the architectural or design patterns that would work well for building something on top of Pavega as its own streaming engine? Yes. So so we have we have several top level interfaces.

The 1 we've sort of been talking most about is the the event oriented ones. We have an event reader and event writer that lets you stream a sequence of events and you can define their format and so on. But they're in discrete units

and scale up and down. There's others. For instance, there's the most the simplest is the is the byte stream writer, which gives you a a sort of a Java input stream and a Java output stream and you can treat it as a as a contiguous stream of bytes. Obviously, being a a byte output stream, having multiple writers at the same time doesn't make sense because how would the data be interleaved? Like, there's no clear like, if you randomly shuffle the bytes into each other, it wouldn't make sense. But you can use it in that way if you have a a a single endpoint on on each side. So that's useful for some application. Another API we have is called a revision stream client, and this is a very low level API that that probably most applications won't use, but ends up getting used in a lot of, sort of infrastructure components that wanna write more sophisticated APIs on top of it. So the way it works is it takes in, you know, blobs that can be written, and you write any number of them, and at any time, you can you can conditionally add a a new 1 based on a version number. So essentially, there's there's a version number that gets incremented every time you you you add 1 of these things, and you can you can add another 1,

conditionally. And the main

use case for that API is actually a higher level API that we surface called state synchronizer.

And state synchronizer has an interesting model. If you have an application and that application

has a has an in memory object and you want that in memory object to be, you know, sort of replicated identically across your fleet, but also make updates at the same time, you would use state synchronizer. And so the way that works is you define a class for each type of update that can be applied to your state object, then define a serialization function for each of those update objects. So you find your serialization function, and all the updates that you apply

are applied conditionally using the revision stream client API so that they're conditional upon you having received all of the previous updates. And once those new updates go in, you get an acknowledgment back and you apply them locally. So this guarantees that all the instances of the application that are using the same state synchronizer will receive all the same updates in the same order. And assuming your update function is deterministic, will give you the same in memory object. So you can use this to, for instance, synchronize config files or, you know, schemas or things like that that are relatively low update frequency, but, very important that they be consistent across across your fleet. And so I know that when I was looking through the documentation,

it mentions that Prevega is useful for things like leader election and, as you mentioned, configuration

updates.

But I'm interested in what you were just saying about being able to use it for storing schema records,

particularly given that there are systems such as the I forget the specific name for it, but the schema service built on top of Kafka by the Confluent folks, and I believe that Pulsar has a similar capability. So, is that something that you could use in Provega as well as far as using the state synchronization for updating maybe an Avro schema

to ensure that it is conformant with any events that are being written to a given stream? Yeah. So the pattern there would be you have some stream and you have a schema associated with it, and that scheme is gonna change over time. What you would do is you would take a stream cut and say, this is a a point in time that I'm defining. And then you could associate that stream cut and say, you know, before this stream cut, the schema was was x, and after the stream cut, the schema was y. And for all the hosts that need to know that information, there's many of them and they're distributed. So a good way to do that would be to put it in put that information in a state synchronizer, and then everyone will have access to it. And any updates

that you that you make to that mapping

will be observed by by all of the hosts. That right there is

just thinking through it, that sounds like 1 of the most amazing things I've heard in a long time as far as how to build a system

scheme of representation as it evolves over time.

Yeah.

Because I know that for various systems, there are different schema representations,

but particularly for things like Avro or Parquet where the schema is inherent to the record, but it can change as you're going through time and needing to be able to have clients that can intuit that fact and process

the the events,

particularly if you're trying to do a full replay and rebuild the data warehouse maybe from, you know, 0.0 in time, having that scheme association

directly associated with the transition points at which it changes and evolves,

I I imagine, would greatly simplify the overall logic of the processing

system. Yeah. Yeah. And so with all of that, I'm wondering if you can dig a bit more into the overall system architecture of how Prevega

is

built and particularly when you're deploying it in a production

environment with multiple instances in the cluster, how that cluster manifests, and how the architecture

plays out within the fleet of servers? So Prevega's architecture is a bit interesting in that its client is is thicker than people are used to. Most services have a very thin client and and are easy to to, you know, churn out new client APIs in new languages, which is definitely an advantage. Prevega does not have that advantage,

And the reason is because it has a very the the server is not does not have a deep understanding of of the data. The server is dealing with continuous streams of bytes. So it's up to the client to to impose everything on top of that in terms of, you know, if you want to to deal with versions, if you want to deal with events and serialization and formats

and, you know, scaling, all of that is isn't done in the client. So the client's actually fairly sophisticated. The way that the way that the the client works is there's a there's a process we call the controller, which is essentially a metadata and discovery service that tells the client for any given stream and and segment of that stream where it needs to connect. And the servers coordinate to, to manage ownership of segments. So there's only 1 server that has ownership of any given segment at a time. And the client then directly connects to the appropriate server and starts sending data. So then if you want to, for example, scale a stream, what ends up happening is that there's a the controller's monitoring the rate of traffic on on the stream. And if it decides that, you know, more segments are needed, it will

send a notification

to the servers

to create new segments and we call seal the old segment. So new segments will be created and anyone attempting to to write to any of the old segments will get an error, and the client will will interpret that error

and say, hey. You know, that means something that the scaling event has occurred. And we'll ask the controller what are the new segments and then locate them and connect to the the appropriate server. So which servers are involved in a stream can change dynamically over time, and the client is is connecting to them directly

without going through some sort of proxy that's that's making it that that's hiding the fact that there are multiple servers there. This is somewhat of a challenge in certain deployment environments because a lot of times people want to deploy a service and and put it behind a a VIP and sort of say, okay. Well, there's only 1 endpoint, but that doesn't that doesn't work with with Vervega. It needs to be able to uniquely address the client needs to be able to uniquely address each of the servers. And so in terms of scaling

the number of servers up or down, the client would then need to be updated with the configuration for those new instances, or is there some sort of gossip protocol that can allow it to pick up that new information from the cluster itself? It would find that out from the controller. So so what would happen is you could add new hosts to the the system, and initially, they wouldn't have any traffic going to them because nothing nothing was assigned. But then the controller might take some action

to, you know, sort of reassign some data from the existing host to the new host. And when that happens, any client that was operating on was connecting to the to the the established host that lost that segment that they were attempting to to read or write to would receive an error, and they would the client would go to the controller and say, where is the segment located now? And it would receive the new the IP of the new this new server and connect to it. So it it would happen automatically. The client does not need to be aware of any of the servers. As long as the client is aware of at least 1 controller, it should be able to discover everything else. And in terms of the resources

for the servers themselves, particularly the storage layer,

since the tier 1 is essentially

a temporary store that's primarily just for being able to rebuild state, does that mean that there's not really any need to,

linearly grow the amount of storage as the cluster

was seeing in the documentation is for using it in conjunction with other streaming engines, such as Flink, and I believe that there might be some plans for other engines such as Spark or Heron, things like that. So I'm curious,

1, the use cases where you would want to use an additional streaming engine on top of Prevega, and what's involved in adding support for Prevega to those different engines. Yeah. So we definitely work very closely with Blink. They're sort of the, stream processing platform we've spent the most time with and and coordinated with very closely. And the reason why you'd wanna use Prevega and something like Blink is that, you know, Prevega

stores your data and it can and it can scale up and down and do all of those sorts of things, but it can't provide you any sort of data processing capability. It doesn't give you the ability to to query things or the ability to aggregate data and perform computation on it, which Flink is ideally suited for. And Flink actually doesn't doesn't natively have any storage capability.

So it's sort of a, a perfect match. Right? If you combine Flink and Provega, you know, you get all the advantages of Provega storage, but then you have Flink's ability to perform things like SQL queries on top of it. And you don't have to have to sort of think about how you would engineer

that, you know, on a Prevega screen. And the story is similar for for Spark or Heron. Certainly, Flink is is the most mature connector,

for Prevega. It's, you know, hosted in our repo and is and is by has by far the most usage.

And then in terms of the actual semantics of the stream itself, I know that

exactly once is sort of the holy grail in these types of systems and can often be difficult to achieve,

and Prevega

advertises support for the exactly once semantics. And you also mentioned earlier

the transactional capabilities.

So can you talk a bit more about how the exactly what semantics

are manifested

in terms of the,

way that the streams are represented

and some of the ways that transactions

can be beneficial

when combined with something like Flink for doing the stream analysis?

Yeah. Sure. So Trevega actually offers a a a stronger guarantee in terms of exactly once than than almost anything out there. And the the reason is precisely because of of transactional support. So a lot of systems are able to do exactly once processing,

assuming you get the data into their system exactly once.

And then, you know, once that occurs, it will be processed once,

assuming that the processor follows the appropriate protocol. Prevega can actually do better in that it can guarantee that you can get the data in exactly once because it has a transaction. So what this means is you can write data in a transaction

called commit, and then all that data is made available to the data processor.

The data processor can process it,

and if it fails, you need to resume from the last point where

it produced output. Now in a conventional system, that's very hard to determine. But if the point that it's produced output is determined by a transaction,

because, for example, you're connecting and and writing your data out to Prevega, then you know the point at which the transaction committed. And so what you can do is you can restart from from exactly that that point where the where the where your

where your output has been persisted. And the way that you do that is

when you are reading data, it's handing you a marker that we call position, and that position object corresponds to the location in the stream. So when a host dies,

you can reinstantiate it and say, okay, it's going to resume from this position and provide the position object. So this allows you to do something where you have, let's say, a Flink job

that is processing data, doing some complicated aggregation,

sending in there for VegaStream,

having it go to a different Flink cluster

and run a different job on it and have that all the way end to end exactly once. In the absence of a transaction, there would be there would be no way to do that because you the data would be emitted by the first blink instance

and then,

you know, be partially emitted and then crash. And so it needs to resume from its link needs to resume from its previous checkpoint,

but it already emitted some of the data. So what you would need to do is you need to undo those sends. But the only way to do that is to have not really sent them and to have not to have a transaction and not have committed the transaction yet. So by having transactions, we can we can do end to end guarantees,

across across multiple multiple processing steps.

And in terms of handling out of order events, which is another big challenge

in systems that are you processing

unbounded streams of data,

would the transactional capabilities

and the segmenting

of the streams

give you some avenue for being able to handle some of that out of order capacity for certain time boundaries? So it's not real so the way Prevega handles order is the server itself is always going to guarantee order. The the server itself doesn't understand what events are. And so

if everything that goes in there is gonna come out in the exact same order. But that's only true on a per segment basis. As soon as you talk about having multiple segments and things scaling up and down and number of segments changing, the story gets more

complicated. And so the way that we do that and really the way that we allow you to have order guarantees

when you have many segments and the number of segments is scaling dynamically

is when you write any given event, it has associated with it a we call a routing key, and that routing key is determines

the the the number of routing keys you have determines how much parallelism the system could have.

So a given routing key cannot be split across multiple readers in a reader group. You can have many routing keys and have them and have the different routing keys end up on different readers, but a given routing key will not have its traffic subdivided. So that gives us a a clear way to define a notion of ordering, which is that we say all events

for a given routing key are ordered with respect

to other events with the same routing key. And we can scale up and down, but for any given routing key, we hash that routing key into a key space

And those keys that key space is a contiguous sequence of segments that are responsible for it. And so as long as you read those same segments in the same order, you get that guarantee. And going back to the operational characteristics

and also in terms of application design

or

system

design, what are some of the edge cases that users

and engineers should be aware of?

So there's 2. 1 is

the the 1 I sort of already mentioned, which is the the client is thick and needs to be able to address

each of the the the servers, the regular servers directly. So the client needs to be able to directly connect to each of them, so there can't be some sort of firewall that's only allowing, like, 1 endpoint

to in in between the client and the server. The other thing that is a little bit tricky is that a lot of the containerization

systems

very naively don't expect that you really store things on disk,

or that you do and that they're not important.

But since Prevega is storing data

until the data ends up in tier 2, that data is stored on disks that are that are necessary for replication. You need to make sure that you don't configure a policy whereby you don't configure a policy whereby the framework will simply restart or or or wipe,

multiple of the of the of the of the storage serve servers at the same time. So any sort of upgrade that's going to replace disks needs to be done on a very granulated basis whereby re replication has time to occur in between.

And in terms of your overall experience

of building and working on Prevega? What have you found to be some of the most interesting or useful or challenging lessons that you've learned in the process? I've learned I've learned things are harder than I thought. Probably the thing that complicated things most, in terms of Prevega's architecture is scaling. It it just it touches everything.

You can't

you can't end up you can't do any additional feature without interacting with scaling, like, immediately.

So so you're always sort of thinking about that. The other thing that I think that I was sort of an experiment, honestly, for for us was

I came up with very early on in the project a a set of thread safety guidelines

that said we're going to follow these everywhere.

And, of course, Paraega being very multi threaded code, we we wanted to make sure we didn't have race conditions. So we we hammered out a set of guidelines that said, you know, this is how multi threading should be done. And was very explicit about it. And

those had to be modified a little bit along the way, but they really worked out and they prevented a tremendous amount of bugs. So I'm I'm very pleased with with the way that turned out. And what are some of the cases where you would recommend against somebody using Prevega for a stream oriented system?

Anytime you're going to need to,

perform a lot of, like, random seek SOT reads. Prevega

it has very good latency on write because it's it's writing to an uncontended disk, and it has very good latency on read because you're reading from

prefetched data.

But the latency of a of a random read that is not 1 that can be mitigated by by prefetching it because we don't, This is not sequential. It's going to be

essentially whatever your tier two's read latency is.

And most implementations such as HDFS and s 3, that's quite significant.

So if your architecture requires that you do,

any significant number of points where you where you jump around or you you reread

a particular event in sort of a random time after you initially read it. That doesn't make sense. You that should be that should be offloaded to a different system. And 1 other topic that we didn't touch on yet is

because of the fact that you have this capacity

for

unbounded

storage of your stream

for the entire history because of that, automatic tiering,

it, in some cases, can remove the necessity to have multiple processing systems to

do

both short term and long term analysis of the data as is commonly built with something like the Lambda architecture. So I don't know if you want to talk a bit about that and some of the design patterns

that Prevega enables to replace those,

system characteristics.

Yes. That that's exactly right. So so from our perspective, Lambda architecture is sort of a dirty word. We we we prefer what's what's becoming known as the the kappa architecture,

you're not using archival storage, you are. You're still you're still, you know, erasure coding and using old storage and stuff like so on. But you're using the same interface to access

the data that's in that system as you are for the new data. So you don't have to rewrite your processing logic twice, and you don't have to worry about trying to to to sync them up. I've gone to multiple Flink Forward conferences and and every year,

there's somebody who's

described

just how difficult of a problem it was for them to

coordinate trans doing a backfill job where they transition from working off an archive to working off live streaming data

and managing

that hand off between the 2 systems and and getting it right so that events aren't missed or skipped in between. That can be that can be very difficult to engineer.

So it's better to to simply avoid

having to write your logic twice and and managing that hand off by by having a common interface to both both your cold and your your hot

data. And then you you can just use you you can just use a common a common set of processing logic and and work continuously from from the beginning of time until until present. What do you have planned for the future of Provego?

1 feature we're working on that that would probably make a big difference to to a lot of, customers is watermarking.

So we build we we're basing our we're basing our design on watermarking on LYNX model where the writers supply a particular

point in time mark, a time mark when they when they are

writing the data initially.

And the readers can see these these marks when they're when they're reading the data, and in particular, they can get an upper bound.

They can get the information that all writers when they were at this point in the stream

produced a time stamp that was at least x.

So you can guarantee that you won't see events from below x in the future. And are there any other aspects of Prevega or Stream architectures

that we didn't discuss yet, which you think we should cover? So Corvega has

capability in its wire protocol that really reduces latency

on append.

And what it does is

it sends a header effectively that says, okay, the next following bytes are going to be an append to such and such segment. And then the client can start can start appending bytes. But that it does it actually sends the header before

it actually has all of the data

or has a large amount of data. So what this allows

the client to do is to just immediately

start writing data,

because it knows that it's a streaming system. Right? There's more data coming. It's a streaming stream of data, so of course there'll be more. So it just goes ahead and writes the header and starts writing data. And the data's already over the wire

and

down the, sort of down to the storage layer by the time the end of the data comes and the and

the client says, okay, and and that's that's the last of it. And it it closes out that block, as it were. So that lowers latency because the data's already transferred before before the final bits are sent. And the way that works, essentially the client tells

the server in advance, here's how much data is I'm going to write, and then following that there'll be another header. And

following that there'll be a header saying, oh, and here was the actual length of the data, and there might be additional bytes.

And it will the server will take all of that data and append it atomically.

Because all all rights all rights are big or atomic. Alright.

Is there anything else that we should touch on before we, before we close out the show? You mentioned leader election. I could talk about that if you want. Yeah. Sure. Let's talk about the,

leader election capabilities

and how it compares to some of the

synchronization and locking primitives that are available in things like ZooKeeper, which I know you also build on top of. So ZooKeeper

has sort of inherent scaling limitations. Right? In both in terms of amount of data stored and and rates. But the thing that it provides

that's very powerful

is, consistency.

So Bookkeeper in turn uses

ZooKeeper's consistency to provide what they call write tensing,

where

if 1 server is writing and another server wants to take over, they simply write some additional data there and then the first server's writes will fail. Prevega follows a similar model where the controller manages

which server is responsible for which data

and

that mapping is written in ZooKeeper. But the the actual

hand off, the consistency is maintained by by fencing.

And the way that works is the the new server that's taking over a particular segment will simply

close

close the record in tier 1, and then it will create a new 1. So the the first server can't can't succeed in doing any writes. So what this allows us to do is have consistency all the way up at the application level, where we can make sure our pins are atomic and make sure that even in the event of you know, network partitions where multiple hosts coming in and out, that we will have a consistent

set of values that are written to to a screen. And state synchronizer obviously uses this to achieve its consistency,

but it can also be used to do other things. And 1 of the things that we have is a is a sample that we put up on our on our GitHub is a is a leader election.

That is a leader election algorithm.

So

every a bunch of hosts all all,

you know, will go to Borrega and and write some data. And 1 of them will will, you know, win because you can do compare and set operations. Right? And

then that 1 can become the leader and

at that point, you can have things like heartbeat mechanisms, which update,

which which write data regularly and if if a heartbeat does not occur, it will,

some other host can write

that it's declaring

the first host dead and is taking over for itself, for instance. But the first host can't succeed

in heart beating

if that data's been written because the heartbeats can be conditional

upon

it having seen all of the data in in the stream. And so it will, it it will obviously not attempt a heartbeat if it has seen that it has been declared dead. So you can use that mechanism to to maintain a a leader election ability.

That's an interesting approach to that problem because most of the systems that I've seen that have some form of leader election require

a quorum vote to establish which instance is the leader and ensure that there are enough instances available to perform that election.

And so that also raises the question of

how that manifests as far as both per Vega and systems built on top of it being able to make certain trade offs along the axis of the cap theorem. Right. So in terms of cap theorem, Vervega chooses,

consistency.

And

that doesn't mean that availability is necessarily

bad. It's just it's the same as it would be in in most other systems where

if you are on the side

the main systems, if you're on the majority side of the network partition,

then you can continue working and the minority can't. For Vegas,

somewhat similar in that if if you're on the side,

if you're if you're reading and writing from from a particular segment,

if you're on the side of the network partition that that owns that segment, then you can continue working. And if you're not, then you can't.

And the ownership of the segment

is is similarly maintained by the controller, and that's ultimately

that map is in is in zookeeper.

So

you can that map might might be updated, but it can only be updated on the majority side. That's interesting. Yeah. It's funny how often people consider the CAP theorem as being very sort of binary or trinary in terms of which

axes it supports, but it's much more of a gradient along the different,

axes of the, consistency, availability, and partition tolerance, which is it's always interesting to

discuss

the

ways that systems make those trade offs. Yeah. Definitely.

And and I think, you know, there's a lot you can do in terms of a lot of people sort of give up and say, oh, we're we're choosing consistency or we're choosing availability and completely give up on the other. But even if you're not willing to make any sacrifices on 1, you can usually make a fair amount of headway on the other. Alright. Well, for anybody who wants to follow along with the work that you're doing or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Interesting. Yeah. I think I think the biggest gap is actually on on ingest. So there's very good stream processing capabilities with things like Flink. I think Vervega is showing that there's very good

storage capabilities

and certainly

persistent

systems like object stores are very mature in themselves.

But ingest is actually very

immature

and is very fragmented. There's there's many different

systems that all work

very differently

and they haven't even yet gotten to the point where there are consistent patterns that are sort of universally applied. So I think I think that's

that's the area that that is lacking. Alright. Well, thank you very much for taking the time today to join me and discuss the work you've been doing on Per Vega. It's definitely a very interesting project and 1 that I

am excited to see how it progresses into the future. And as new streaming engines add capacity for it, I'm sure it'll have some transformative effects. So thank you for that, and I hope you enjoy the rest of your day. Thank you very

much.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links