Stateful, Distributed Stream Processing on Flink with Fabian Hueske

Hello. Welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy them. So check out Linode. With 200 gigabit private networking,

scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai.

Go to data engineering podcast dotcom/linode

today to get a $20 credit and launch a new server in under a minute. And go to data engineering podcast.com

to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And don't forget to go to data engineering podcast.com/chat

to join the community and keep the conversation going. Your host is Tobias Macy. And today, I'm interviewing Fabian Huska, co author of the upcoming O'Reilly book, stream processing with Apache Flink, about his work on Apache Flink, the stateful streaming engine. So, Fabian, could you start by introducing yourself?

Yeah. First of all, thanks for having me. My name is, yeah, Fabian. I'm a Apache,

committer and PMC member of Apache Flank.

I'm also cofounder of DataArtisans, which is a company that was founded by the original creators of Apache Flink.

And what we're doing is

basically

spending most of our a lot of our time on developing and fostering Flink and its community.

And do you remember how you first got involved in the area of data management?

Yeah. So,

that's,

actually a while back. Like, when I when I started my career in,

IT,

I was was taking

taking part in an education program with IBM Germany.

Half of the time there, I was studying at a university, and the other half, I was doing internships

at different departments of IBM.

And, for 1 of these internships, I had the chance to, visit the IBM Algonquin Research Center in San Jose, California.

And there, I was working with researchers in the,

yeah, data management department. So that back then, it was DB 2, the relational database.

And so can you give a bit of an overview about what the Flink project is and how it got started?

Yeah. Sure. So that's,

actually a bit of a longer story. So,

the roots of Flink date back to, 2008, 2009.

So,

yeah, when I was about to finish my studies,

my former adviser from IBM Almaden,

called me and asked me that, like, would I be interested in in joining him

and becoming 1 of his PhD students. He was just about to become a professor at Technische Universitat

in Berlin in Germany.

Yeah. I agreed to that, moved to Berlin. And,

in the 1st year,

we basically,

got a or in the in the 1st years, we got a funding for research project for

data management at the cloud.

And, yeah, we started to build a system

back then then with a goal to combine the best properties of Apache Hadoop

and,

relational database systems.

So back then, there was, like, a a debate going on that,

Hadoop is a major step backwards because it's kinda like like from relational database systems point of view, kinda dumb.

However, very scalable. So

what we built with our system was basically a system that had a kind of like a data flow, a programming API similar to what

Spark and,

cascading and all the other APIs had and also what Flink is today.

And the system was, like, was a distributed scalable system, capable of processing

data. So, basically, much like a dupe with, user defined functions that like a map function that would do whatever you whatever you wanted to do. But it also featured an optimizer and an execution engine that was very similar to what the database system was doing internally, like, including

the way that database systems, do their memory management and also including algorithms like,

for example,

that work in memory and then, gracefully split to disk in case memory is not sufficient.

Yeah. The system was called Stratosphere

because it was data management on the cloud on on the clouds. And in 2014, we donated the code base to the Apache Software Foundation and became, Apache Flink. Yeah. Since then, Flink changed a lot. Today, Flink is rather known for its stream processing capabilities

than for being a batch processor, which is still true. So all these things still work.

Yeah. So what is Flink doing? Flink is, yeah, the suited system to perform stateful computations over data streams,

and, the Flink project aims to address, like, all types of stream processing use cases.

And we see streams

in their broadest sense, which means that it, includes bounded streams, unbounded streams,

live data, recorded data. So, basically, yeah, live live streams and also batch data.

Flink scales to

thousands of nodes, can maintain terabytes of state, and also,

we have enough users that process trillions of events events per

day. So there are a lot of things in there to unpack. To start with,

1 of the things that I'm curious about is the ways that state is managed in the cluster and within the computations

because

particularly

for

unbounded streams of transactions,

statefulness, I imagine, can get quite complicated when you're processing in a distributed fashion.

Yeah.

So Flink maintains

all the states locally on the local machines, so there's no

central

entity, like a database where we, like, would like,

do distributed transactions to. So all the, like, the the computations are distributed. So, you have, for instance, like, some something similar to a map function running distributed in your cluster. On each of these nodes that that runs the map function, there is something that we call, a state back end.

And,

this this is like a pluggable component in Flink, so you can configure this,

the state back end to,

manage the state on the heap of the JVM. Flink is implemented in Java. So you can, like, have in memory speed there,

however, being bound to the size of the JVM,

Or you can plug in a state back end that is bay based on RocksDB, which is, like, key value store that is writing data to disk. So it can also,

manage much more data by

writing it to disk, of course, with

a little bit higher latency there. So all the state is,

always locally

managed. So the functions have local access to the state, which is very good for performance. But then again,

this is, of course, a a challenge when it comes to fault tolerance because on discrete systems, things go wrong all the time.

So your process dies or the node dies. And for that, Flink has a mechanism to do consistent checkpoints of of the state and is able to load that data, basically, to reset an application and load the data from the from the from the checkpoint into the,

into the application and then continue as if nothing ever happened.

Flink, as you mentioned, has been around for a number of years now, and there have been a number of other streaming systems that have been built up in recent years from various different companies trying to suit different use cases. So I'm curious

where Flink fits both in the context of other streaming systems and also into the broader modern data landscape, particularly given the fact that things have changed quite a bit since you first began working on it. Yeah. Sure. So, yeah, as you said, there's a there's a bunch of streaming

stream or stream processing software systems out there.

So first of all, what is what is Flink not? So Flink is not a modern event lock or a system to store your stream or to distribute a stream. So it's not similar to, like, the Kafka event lock or,

Apache, for instance. So it's not meant for storing and disputing streams.

It is a processing engine, so and it's much more similar to to to Spark and, also Storm to to to some

extent. So it's similar or it's similar to Spark because it has also the edge edge processing capabilities,

similar to Spark. But, the Flink project today is focusing much more on stream processing

and is an event at time stream processor. So we're not doing microbatches as,

Apache Spark, for instance,

but we're processing events 1 by 1 without,

scheduling you microbatches.

Compared to storm, which is also, like, event at a time

architecture,

Flick added stream process capabilities

much later, and, provides,

event time processing. I guess we'll talk about that a bit later. And, also, better better consistency guarantee. So exactly 1 state consistency.

And,

higher throughput,

spark storm is somewhat bound by its, forward forward runs mechanism,

when it comes to high throughput.

And what are some of the primary use cases that end users of Flink are leveraging it for?

Yeah. So I said before, Flink is, like, trying to

capture the full,

the the full area of stream processing. So what are the kinds of stream processing applications that we see

people are running Fling,

for? So 1 of 1 of these use cases

is, data pipelining or low latency ETL, like moving data from 1 location to another with low latency. This is a very, very common use case, although not probably not the most

advanced, I'd say.

Streaming analytics is, of course, another issue another big issue. And then there is also this area of, event driven applications,

which

somewhat go in the direction of microservices. So we're building an application that receives

events

usually from an event log such as such as Kafka Pulsar.

And then this application runs some business logic that that reacts to to individual events. It receives in demand and then performs some some kind of business logic, puts some of the information into into into its state,

forward some information, maybe a notification or in event that is processed by another application,

a little bit downstream.

So it's like,

these these event driven applications are a little bit more like applications

that would have been designed with a, like, a a separated,

like like, business applications that would store data in a relational database.

So the they have a little bit more, like, the transactional nature of strip processing where analytics is, yeah, of course, more analytical.

And in order to be able to support

these low latency processes as well as scaling to large numbers of nodes and large volumes of data, I'm wondering if you can discuss a bit about the underlying architecture

of Flink and how that has evolved over the years that you've been working on it.

Yeah. So

I mean, Flink is a distributed system, so it, like, follows the classical

master worker pattern here. I said before, it's mostly implemented in Java. Some are also in in in Scala, but those are my mostly

APIs.

Flink

processes

data as as data flows. So,

it's like

distributed

parallel pipelines.

So,

the the the data is, like, distributed,

processed by by these individual operators, and whenever needed whenever you need to organize the data by key, it is, like, hash partitioned and then

shuffled over the network to the to the corresponding node that processes that key.

On

the for for for the network shuffling, I said we're following, like, this, event at a time processing, which means that we are,

of course,

not sending individual events over the network, but we're always, like, collecting them in a in a network buffer that we then ship for for for better throughput, of course. But the data is, like, eagerly pushed forward. It's not like in, in a in a batch engine that would first collect all the data and then tell the receiver to to to download to to load some of the data. But instead, the data is, like, eagerly pushed forward from the sender to the to the receiver.

On each receiver,

I already briefly talked about the state back ends.

The the events are are received from the network buffer, the unpacked, and then forward into the user functions.

And then

state is usually

also shut per key in Flink. So,

each each of state of each key is only processed on a on a sing on a single machine.

So the events are routed to the machine that is responsible for the key. And then the it's like the state back end's

responsibility to to look up the right state for this key. So set, we have the state back ends for for the JBMP.

So in that case, this this the state is accessed,

read or written,

by by putting it putting

the the Java objects directly into the heap, or the data is, deserialized and serialized

to to get it, out of or into a RocksDB.

And given that you have been working on and evolving this project for a number of years, Are there any aspects

of the current implementation

that you would do differently if you were to start over today and didn't have the weight of legacy behind the existing infrastructure and,

APIs and design patterns?

Yeah. Sure. So as I said, so Flink started as a batch engine and, like, the streaming I mean, streaming was added later.

Well, to be fair, like, most of the execution was

always streaming or, like, streaming from the start. So, I said before, we've,

designed

Stratosphere back then and, like, data fling with database

principles in mind. So,

the

the the research group that I was working on at university was a database research group, so we had, like, naturally, like, a database background there.

And database systems also tend

to have this, pipeline processing model where, like, the,

data is always pushed from the from the roots of the of the

processing of of the execution plan towards the the the root node, so eagerly pushing forward. And we had, like, the, same same design right from the start. So the network layer was always, like, pipelined.

Oh, on on on top of that, we built, like, basically 2 separate engines for,

for batch processing and streaming,

and also, like, 2 different APIs. So in Flink, the core dataset API for the, for the batch processing and the stream API for the streaming.

And yeah. So if I would, like, design

things

new,

or starting from a green field, I would, try to have these things more more tightly integrated. So Flink is going to that direction. So that's, like, 1 of the points on the road map to to unify batch and stream processing.

But, yeah, so often it's a bit harder to change something that is already there than, like, starting from a a point field. Like, this tighter integration of handling bounded streams and unbounded streams would be nice.

And in terms of the bounded versus unbounded streams, are there differences in terms of how end users need to think about the way that they approach those problems for their own processing?

Yeah. Well, yes and no.

So

Flink has, besides, like, these APIs that I mentioned, the dataset and the data stream API, Flink has also,

2 relational APIs, which is like SQL

and,

a link style table API.

Link is like a level stands for language integrated queries.

So this is basically a way

to not define a query as a as a as a string in a programming language. So when you write a secret query, you typically

compose it as a string.

In the programming language would have, like, the query operators embedded in the in the programming language. Like, you have some kind of a of a table object, and then you, call select or group by on this

or with a within within method that is embedded in the programming language. So this is, like, linked there, and this or, like, how the table API link looks like. So we have these 2

relational APIs, and, there these these 2 are unified APIs for batch and stream processing.

So depending on what kind of, like, what how the data looks like that is that is, processed, how the tables look like, if there's a, like, bounded batch tables or unbounded stream tables.

The curious

compiled differently. Like,

in the batch case, it's compared into a dataset program, in the streaming case, into a data stream program. But the semantics of both programs is exactly the same regardless of whether the data

is read from some kind some kind of persistent store or whether the data is, like, life arriving.

And the principles that these APIs here are the streaming APIs the the stream processing for these relational APIs

photos is,

the 1 of, basically maintaining a materialized view.

So the the the query is like a continuous query, and then as more data arrives, the result is

continuously updated.

So this is like the the the part where users,

well,

don't really have or don't have to worry that much about the difference between bounded and unbounded streams.

But then, again, then there is, like, the the dataset API and the data stream API, which clearly separate

batch and stream processing.

And for unbounded streams, there's the possibility

for events to arrive out of order in terms of the time stamps associated with those events or for different machines that are originating those events to have different

clock alignment, which can cause some difficulty in terms of event ordering.

And you mentioned that Flink has native support for being able to support some of these out of order events or delayed events. So I'm curious how that functions and some of the ways that that simplifies

the work of the end user who's trying to process these event streams.

Yeah. So Flink supports

event time processing. And

what

we basically mean by that is that data is not processed with respect to the

local

work log on each processing machine. So

let's let's take for instance the example of, doing a simple

window count per 10 minutes.

So,

if you want to, like, count how many events

our application sees every 10 minutes, then there is, like, different ways to do that.

So the first question here is basically, how does the machine measure what 10 minutes are?

The easiest way is, of course, to always look at the local machine clock on the on the on the data processor,

and that will give us some time. And after 10 minutes, we can say, okay. Now 10 minutes over.

I can I can compute the count, and that just emits the number of records that I've seen

during the last 10 minutes?

Of course, this is not very precise and not non non deterministic because because it depends on the the speed at which data is ingested,

is the back pressure in the application, how fast is the,

machine that is, doing my computation, and so on. So looking looking at the, machine clock is sufficient for some applications if you're in only want to have, like, approximate counts. But if you want to meet meet precise

and deterministic results, then this is not sufficient. So this mode is called processing time,

in Flink.

And then there is event time. And, in event time, each event has a time stamp associated

with the event. So as you said,

this time stamp usually,

usually encodes

when this event happened or when event was produced.

Yeah. So

we have now

time stamps

on on on each record. So we basically know when each of these,

events happened, and, like, having the time

in the record is already good. But now the question is, how can we, like when do we know or how can we know when we can, like, close a computation? How can we be sure that no more record is arriving with the time stamp that we would need for this computation?

And Flink uses the, concept of watermarks here. So this is, like, very similar to what the Google Dataflow model is doing. So, basically, it's pretty much exactly the same. And what a watermark is? A watermark is like some kind of metadata record that also has a timestamp.

And when the system receives

this this watermark

record,

then it the the the timestamp of the watermark basically tells the system that it does not have to expect any more data

with the time stamp smaller than the watermark's time stamp. So this gives us basically a way to measure progress or time progress in an in an opera operator. So when operator receives a watermark that says, now it is, like, 12 o'clock, then we know that we can, like, perform all the computations

that can be done up to 12 o'clock.

So this is basically the way that Flink performs event time processing, and the watermarks can be used for depending on how tightly

you configure these watermarks to

to be produced. So, like, high how much slack you basically give for out of order records

to arrive, there is like a

there's a trade off between the latency of results. The more slack you give, then the longer it would take,

the longer Flink will wait for data to arrive and the higher the latency for the result will be, but also the more complete the result will be. So you with watermarks, you have a very, very good way

of trading off results latency and result completeness.

Going back to the statefulness of the cluster, you mentioned that that's primarily

maintained local to the individual processing nodes. So I'm curious how you handle cases

where

there might be potential conflicts in the computation

across different nodes

or

also how you handle the exactly what semantics in a distributed fashion with that local state management?

Yeah. Sure. So first, talking about the the the conflicts. So

in in Flank, all state or most of the state is always organized

per key.

And this is this is usually most most most of the relevant state.

And, for this keyed state, I said before that all the records that have the same same key have to be routed to this node. So they will basically arrive in some sequential order at the node, and they'll be processed

sequentially 1 after the other. So there's

not really any any right conflict

there because they're, yeah, processed 1 by 1 1 after the other. And if you, for instance, have a constraint that you would that you're saying, you know, it's it's a dispute system, so, like, giving giving, guarantees

about the order is always difficult.

Data is arriving from different nodes. If you would like to, enforce that the changes are done in a certain in a certain sequence, then you can always accept the event, put it to state,

have a look at its timestamp. So usually you wanna,

perform the

modification space on the timestamp.

And then you can, based on the watermark, you know, when you can,

take out what element from the state and then apply the change. So that would be, like, the way how to,

handle conflicts.

To the question of, like, how does Flink give exactly once guarantees for for for state?

So this is this is done in Flink with its

checkpointing mechanism.

And,

a checkpoint is a consistent

snapshot

of the application of the whole

of of of the data of the whole application.

So this

means that when the system decides to to, perform a checkpoint,

the worker will send a notification

to all the sources,

so to the operators that ingest the data,

and tell them, hey. I want you to perform a checkpoint. And then the source, the source operator

will

record its current reading position. So for instance, if you're reading from Kafka, this would be like the reading offset in all the partitions of the operator,

and then it would inject a special marker into,

into its output. This is again a metadata record, a little bit similar to a to a watermark.

And then this

this record flows downstream

exactly at the same it stays at the same position. It's in in in in the data stream. And when it is received by the next operator,

then the operator will

perform a copy of its own state

and send the metadata record, this check mark barrier, to its

downstream tasks. So,

with this checkpoint barrier mechanism, we can basically ensure that,

all the operators perform the checkpoint at the logically same time. So

we have the all the state of the we have we have the reading positions at the sources, and we have the state of all of the operators,

and we store all of that in a in a distributed file system like HDFS or s 3 or any

persistent data store, basically.

And if then something goes wrong or a failure happens,

Flink restarts the application and, initializes the state of all the operators, like the sources

and all the processing operators

with the state that was previously checkpointed.

So we basically go back to exactly the same position that we've been before and continue processing as if nothing happened. So

to some extent, that is, like, similar as

loading a safe state in a computer game. So before you enter,

like, tricky level or whatever,

you you

save your current

game state, then then you go into the level. If something goes wrong, you can always go back and try again. And that's basically exactly the same that that that fling is doing here.

And in terms of scaling

the cluster of nodes, what are some of the limiting factors in terms of

the number of nodes or,

any sort of geographical distribution capabilities

that Flink supports and some of the factors that operators should be thinking about as they're designing the network topology and deployment topology for their Flink cluster?

Yeah. So

we have,

our Flink there there are some Flink users who who run applications on 10, 000, of course. So, like, the biggest users are probably Netflix,

Alibaba, and Tencent.

So they're running the applications on, like, 10, 000, of course, as I said with, multiple

terabytes of state.

And so these applications process trillions of event events per day, which is, like, a couple of millions every second.

In order to get, like,

get to this scaling numbers, what Flink users have to do is they have to, like,

carefully

configure their check pointing mechanism, like the,

how often the checkpoints are performed,

what the step back end is that they want to choose.

There are some certain

certain features of the check pointing mechanism, like as,

asynchronous

check pointing, incremental check points, and so on that you can that you can enable. Most applications

don't

run across clusters. So, like, it should I haven't seen, like, I

mean, besides some demos, I haven't really seen, like, geo disputed streaming applications being being being run

on Flink.

So in terms of, like, how to how to configure the cluster, that's a good question, actually. I I don't really have a good answer to that 1.

That that's totally fair.

And then in terms of cluster topologies, another thing that I'm curious about is

particularly

for

events that you might want to have either multiple execution paths or multiple uses for those particular records.

If there's any support for things such as fan out and fan in for the

flow of the data through the Flink cluster.

Yes. So,

the

programming model of Flink or the APIs,

certainly allowed to, like, branch and merge

data flows again.

So in the in the data stream API, you have this might have an object that, like,

you know, it's a data stream. You can just read multiple times from that. So if you're, like, a data stream object, you play multiple map functions on it, not not sequentially, but, like, in parallel,

then the API will take care of, like,

fanning out the stream and sending it to each of the individual map map operators.

And then there are also, like, operators

that can consume multiple streams.

So you can then also

merge these streams again together, like,

implement your own show your own join logic or just having a union operator that just unions all the events that are coming from from either stream.

And going back to the cluster level, I'm curious what are some of the failure modes that you have seen and some of the challenges that operators should be aware of in terms of being able to maintain the overall health of the cluster?

I think Flink is kind of similar then to to to to many of the other disputed systems. So,

as I said,

there there is different kinds of failures there. You can you can have, like, a

a worker failure where just the worker process goes down or the worker machine goes down.

This is addressed by the, like, regular check pointing mechanism,

check pointing and recovery mechanism.

So Flink will then bring up another worker, process somewhere in the cluster depending on what kind of, like, resource manager you

have or you're you're you're running on, whether you're running on Kuenidis or, yeah, on a messes.

And,

we'll, yeah, bring up the

worker

and restart the application, load the stage into the application.

If you have, like, a a master failure, then the master holds

the meta information for an application in its memory. So,

and if you configure Flink to be resilient also for worker failures for master failures, then this data, this meta information is also check persisted during a checkpoint,

also written into a dispute file system.

And then Flink puts,

puts a pointer to this data into

Apache ZooKeeper. So,

in order to have, like, a highly available cluster, Flink is currently relying on ZooKeeper.

So basically start pointing to the meta information in ZooKeeper, and then if the

master node goes down, the new master only needs to know, like, an application ID and the

information how to connect to Zookeeper.

So it then connects to Zookeeper, looks up the pointers to to the data that needs to read from Zookeeper,

fetches all the metadata, and the the workers basically

are restarted

as as in a regular failure mode.

So these are, like, the roughly the failure scenarios that,

that can happen. In terms of scaling,

yeah, I said data is,

shot up per key in in Flink applications typically.

So if you

get into a situation where,

you're running out of, like,

disk or memory space depending on your state back end,

you can always,

scale out you you can scale out an application

basically by

restarting the application with a higher parallelism,

and then Fling will distribute the state

to or is able to distribute the state to a larger number of worker nodes. So the state is then, like, repartitioned

and, sent to the to the different worker worker nodes.

So if you're running out of, like, storage for your state,

there's also ways to, to scale the application up or also down

if you need fewer resources.

And so in terms of building and maintaining the Flink project and community, what have you found to be some of the most challenging or complicated aspects?

Yeah. So,

like, from a

technical point of view, of course, like, correct behavior and failure cases is always a challenging thing,

like, how to ensure that, you get all the messages that you need to get and,

all these classical distributed systems, problems.

Serialization and backwards compatibility

are also

tricky,

especially if you're have like a like, in state processing, like, the state of an application can

it might it might change over time, but, if you,

you often want to just keep that state. Right?

And, if you upgrade to your Flink version,

you would like to be still able to read read out that state, like, how how the data is, laid out, how the serialization format is, and all these things need to be carefully designed in order to have some some sort of backwards compatibility there. So these are, like, the technical challenges. You also mentioned the community. This is, in fact, what I would say, like,

maybe even harder. Like, scaling a community is really hard. We're,

currently, the point where Flink is gaining lots of lots of traction. We have lots of people commit

who would like to contribute to Flink.

We're getting many,

contribution proposals for for new features that, need to be reviewed.

And there all of that eats up,

like, a lot of time. It's it's it's well invested time, I think.

But,

at the same time, like,

we we at DataArtisans, we also, of course, have some kind of an agenda or road map that how where we would like to bring Flink, and then it's,

difficult to decide,

like, how to how to invest the time and or where where where to invest the time, so to say. And, of course, like, with growing community, you also need to put

somehow some some some kind of processes into place that,

reduce the effort

basically where the where the bottlenecks are, so to say. So, you

have to

somehow find a find find a way that is still

convenient, to contribute to the project, but at the same time also

try to

reduce,

or to to to to streamline the

the reviewing and

merging processes for for for the committers.

And in terms of the

data that gets ingested and processed within Flink, is it primarily treated as just an arbitrary

array of bytes and it's up to the application logic to provide any sort of,

meaning or schema to it? Or does Flink have built in support for being able to,

store and validate schemas

and type information for the data objects that's flowing in? And I'm also curious as far as the types of serialization or file formats that are most well supported.

Flickr's, like, support for JSON and Apache Avro, for instance.

Like, in

looking at 5 formats.

So so Flink has a has a has a rich like, let let let me put it that way.

Flink has a has a rich connector,

ecosystem.

So we can we have connectors to read data from Apache Kafka, from

Kinesis. We can talk to very different types of file systems.

There's the JDBC connector, Cassandra,

and Elasticsearch.

And all of these systems

have their that that, like,

are either agnostic to the storage format. Like, in in Kafka, you just get raw byte messages that you somehow need to know how you can deserialize them. Many people are using Apache Afro or JSON there, and Flink also has support for that. So, within a Flink application,

the API is constructed in a way that you're always dealing with Java or Scala objects.

And, Flink has its own

type system and serializer system to to serialize this this data.

So if you're,

if you have

a have a POJO

object that you'd like to use in your application,

then,

Flink can either get generate serializers and

deserializers

for this object,

or you can also ask Flink to just treat it as an as an Agro object and then use ABRA to see as a decent as it.

For the interface for the end users and the application developers who are trying to deploy on top of Flink,

what does that process look like in terms

of the way that the artifact is deployed onto the cluster

and some of the life cycle management that is available for those end users?

Yes. So

Flink applications are are bundled in in JAR files.

So

when you implement your

your application, you're

as I said, we have APIs for Java and Scala.

You

compile your application into into a into a jar

file. And this jar file then,

there's different ways, so you can bring that on the cluster to execute it. We have a command line client,

that you can just use from the shell. There's a web interface that you can use to upload your your shuffle into the to to to the master and or to the job manager, how it's called and fling,

and then start the execution from the web interface. But there's also a also a REST interface that can be used to manage the for life cycle of applications. So you can upload a jar file. You can start a job from the live from the REST interface. You can cancel that. You can,

yeah, basically do all the things that you can also do with the web interface since the web interface is just

using the rest interface.

Given that Flink has been around for a while and is being picked up by more and more companies, I'm curious what you have found to be some of the most interesting or unexpected ways that it's being leveraged?

Yeah. There are actually

a few a few fun stories to to to tell. So,

so we've heard, like, from from from a company, from 1 of the Flink users

that they're using Flink's or Flink's check button feature to run

AWS

and always migrate the application to the, AWS region with the cheapest spot instance

prices. So you can also use a checkpoint basically to migrate an application

from 1 cluster to the other. So you just

to form a a manual checkpoint. We call that a safe point in Flink.

You basically get the save, get the the the state of the full application, then you can bring up a cluster,

Flink cluster somewhere

else, and then just load the state

into that cluster. And they basically use that to always chase the cheapest spot instance prices. So, obviously, the application did not have, like, super low latency requirements

because if you keep on doing that, there's always some kind of oh, well, some downtime involved in, like,

bringing up the the the application somewhere else.

But

that was, like, 1 of the fun ways that I've seen, like, this

checkpoint feature being used.

And,

yeah, another cool story was when we, we learned from another user who had the the back end of a of a social network

as a as as a bunch of flinch flinch applications

connect by Kafka topics.

So,

today, like, this architecture might not be that uncommon, like, having this event driven applications,

but, that was, like, 2 years ago or so. And, it's kinda, like,

mind blowing to see how somebody would use

a stateful stream processor to build, yeah, on the back end of a of a social network.

In terms of the road map that you have

going forward, I'm curious what are some of the improvements or new features that you have planned that you're most excited by?

The the the fleet community is, like, always working on improving performance of check funding and recovery. So that is, like, an continuously ongoing ongoing effort.

Also, the

secret secret or relation APIs receive lots of contributions,

these days. For example, we just recently received a proposal

for a better integration with Apache Hive's metadata catalog. So this will basically then

allow Flink to

just read data that is,

maintained

or or stored in Hive catalog.

Yeah. The I think an important

point

for Flink's long time vision is, of course, like, this

complete unification of batch and stream processing,

joining the APIs,

the data set and the data stream API into, like, a common API that can be used to process bounded and unbounded streams

and also do do this on the on the execution level.

This is something that I'm really looking forward to.

And as you mentioned,

with the growth of the community, there come people who are

either trying to request or submit new features or capabilities or enable Flink to serve different use cases. So I'm curious if there are

any of those that you've dealt with recently that you are explicitly

not planning to support.

Yeah. So that's,

that's that's a

really good question. So

I haven't really seen

proposals that would be

very that that that would not be in the

interest of Flink yet. So, I mean, of course, there have been some out, can really recall them.

But, like, most of the proposals are

meaningful and,

coming from users

who are running on some fling in their in their environments. So

the the tricky thing there is always, like,

to also manage the expectations correctly and

not accept too many of these these these proposals

or somehow be able to when when you when you say, yes, you can you can do that, also have, like, the

the resources to actually being able to review the changes later on.

In general, I think, as said, like, Flink is trying to, like, cover the full spectrum of stream processing applications.

Something that is

I would say Flink is currently not the best choice for us, like, talk theories and data science,

machine learning use cases.

However, I would not rule out that at some point the community might also invest in that area.

At DataArtisans,

1 of the services that you provide to the Flint community

is training. So I'm curious in terms of people who have gone through those training sessions, if there are any common

concepts or paradigms that they are particularly challenged by.

In general, like, stream processing or the APIs that Flink provides for stream processing, like, all these

concept of what is possible with state and time.

This is

often

requires

a

a mind shift, so to say. So when you're coming from an from a background where

you're where you've, like,

implemented

somewhat stateless batch applications,

and then you're

moving on to,

to to work on streaming applications.

Like, the the the mind shift that not all that you don't have all the data available, that you need possibly to wait for more data or to decide when you can perform a computation.

This is something that is I would not say people are struggling with it,

but,

it it's certainly a price like a yeah. As I said, a mindSHIFT.

How to deal with, watermarks, how to, yeah. As I said, decide when you can perform, like, a computation and so on.

And are there any aspects of Flink and stream processing in general that they tend to find most interesting or exciting as they're going through those trainings?

Many many of the,

many many people or many many attendees of our trainings

are very interested in, in in SQL. And, I guess that's also because

SQL makes many things

a lot easier because

when you're using SQL, you don't have to care about state. You don't have to care, like, that much about time. Of course, you need to, like, somehow

keep in mind or

or monitor how much state your query is consuming,

but, you don't have to implement the

handling handling handling of state. So SQL is certainly,

and relational and and the relational API is certainly of a very,

very large interest

for Flink users and also the people who attend the Flink training.

And are there any other aspects of Flink or stream processing

or the work that you're doing at DataArtisans

that we didn't cover yet, which you think we should discuss before we close out the show?

No. I think it's,

that was, yeah.

We we we covered most of it. Yeah.

And so for anybody who wants to get in touch with you or follow the work that you're up to, I'll have you add your preferred contact information to the show notes.

And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology for data management today.

Yeah. So something that I

think is

where we're still a bit of a gap is is like, unified storage for

streaming and, like, recorded

streaming data.

So, the goal here is basically to, like, store

streams without retention, like, that purpose of streams,

in some kind of a storage system that, gives,

efficient read access to to the historic parts and the life parts of a of a stream,

but also, like,

is is able to to to make the make make the switching point from historic to to to live

data basically,

invisible for the for the system con con consuming from this data.

This is important because many of the streaming implications that we are dealing with this,

or whenever you're dealing with the stateful streaming application,

it's very often the case that you would like to

bootstrap the state

for your for for your application before you go on on the live data. And this bootstrapping basically happens on the

happens on the historic data. And,

in Flink right now, this is a bit of a challenge because

also because, like,

the batch and streaming

are not fully fully,

unified yet,

but it's also difficult because of, like,

lack of support of of of of,

systems

that that are able to to provide this,

unified interface to record and streaming data.

There are some some

some some projects on the way to to address these these challenges.

But from from my perspective, they do not seem to have, like, gained much popularity

yet. So,

1 example for instance is, is Prevega,

which is an open source system developed by Intel MC, and it's exactly doing this. Like, it's, like, an event lock similar to Apache Kafka,

also with providing transaction guarantees.

And,

but it also persists the data

into some kind of code storage,

system,

where the historic data can render the b red. And,

Pro Vega has very good integration with Apache Flink. So having more systems like this available

would, I think allow for many interesting use cases.

Alright. Well, thank you very much for taking the time today to join me

and, discuss the work that you're doing with Apache Flink. It's definitely a very interesting system and an interesting project. So I appreciate

the time and effort that you put into that. So,

thank you again, and I hope you enjoy the rest of your day.

Yeah. Thank you. Thanks for having me. It was fun.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links