Summary
Modern applications and data platforms aspire to process events and data in real time at scale and with low latency. Apache Flink is a true stream processing engine with an impressive set of capabilities for stateful computation at scale. In this episode Fabian Hueske, one of the original authors, explains how Flink is architected, how it is being used to power some of the world’s largest businesses, where it sits in the lanscape of stream processing tools, and how you can start using it today.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Fabian Hueske, co-author of the upcoming O’Reilly book Stream Processing With Apache Flink, about his work on Apache Flink, the stateful streaming engine
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what Flink is and how the project got started?
- What are some of the primary ways that Flink is used?
- How does Flink compare to other streaming engines such as Spark, Kafka, Pulsar, and Storm?
- What are some use cases that Flink is uniquely qualified to handle?
- Where does Flink fit into the current data landscape?
- How is Flink architected?
- How has that architecture evolved?
- Are there any aspects of the current design that you would do differently if you started over today?
- How does scaling work in a Flink deployment?
- What are the scaling limits?
- What are some of the failure modes that users should be aware of?
- How is the statefulness of a cluster managed?
- What are the mechanisms for managing conflicts?
- What are the limiting factors for the volume of state that can be practically handled in a cluster and for a given purpose?
- Can state be shared across processes or tasks within a Flink cluster?
- What are the comparative challenges of working with bounded vs unbounded streams of data?
- How do you handle out of order events in Flink, especially as the delay for a given event increases?
- For someone who is using Flink in their environment, what are the primary means of interacting with and developing on top of it?
- What are some of the most challenging or complicated aspects of building and maintaining Flink?
- What are some of the most interesting or unexpected ways that you have seen Flink used?
- What are some of the improvements or new features that are planned for the future of Flink?
- What are some features or use cases that you are explicitly not planning to support?
- For people who participate in the training sessions that you offer through Data Artisans, what are some of the concepts that they are challenged by?
- What do they find most interesting or exciting?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Flink
- Data Artisans
- IBM
- DB2
- Technische Universität Berlin
- Hadoop
- Relational Database
- Google Cloud Dataflow
- Spark
- Cascading
- Java
- RocksDB
- Flink Checkpoints
- Flink Savepoints
- Kafka
- Pulsar
- Storm
- Scala
- LINQ (Language INtegrated Query)
- SQL
- Backpressure
- Watermarks
- HDFS
- S3
- Avro
- JSON
- Hive Metastore
- Dell EMC
- Pravega
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello. Welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy them. So check out Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform. If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. Go to data engineering podcast dotcom/linode today to get a $20 credit and launch a new server in under a minute. And go to data engineering podcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And don't forget to go to data engineering podcast.com/chat to join the community and keep the conversation going. Your host is Tobias Macy. And today, I'm interviewing Fabian Huska, co author of the upcoming O'Reilly book, stream processing with Apache Flink, about his work on Apache Flink, the stateful streaming engine. So, Fabian, could you start by introducing yourself?
[00:01:11] Unknown:
Yeah. First of all, thanks for having me. My name is, yeah, Fabian. I'm a Apache, committer and PMC member of Apache Flank. I'm also cofounder of DataArtisans, which is a company that was founded by the original creators of Apache Flink. And what we're doing is basically spending most of our a lot of our time on developing and fostering Flink and its community.
[00:01:37] Unknown:
And do you remember how you first got involved in the area of data management?
[00:01:41] Unknown:
Yeah. So, that's, actually a while back. Like, when I when I started my career in, IT, I was was taking taking part in an education program with IBM Germany. Half of the time there, I was studying at a university, and the other half, I was doing internships at different departments of IBM. And, for 1 of these internships, I had the chance to, visit the IBM Algonquin Research Center in San Jose, California. And there, I was working with researchers in the, yeah, data management department. So that back then, it was DB 2, the relational database.
[00:02:19] Unknown:
And so can you give a bit of an overview about what the Flink project is and how it got started?
[00:02:26] Unknown:
Yeah. Sure. So that's, actually a bit of a longer story. So, the roots of Flink date back to, 2008, 2009. So, yeah, when I was about to finish my studies, my former adviser from IBM Almaden, called me and asked me that, like, would I be interested in in joining him and becoming 1 of his PhD students. He was just about to become a professor at Technische Universitat in Berlin in Germany. Yeah. I agreed to that, moved to Berlin. And, in the 1st year, we basically, got a or in the in the 1st years, we got a funding for research project for data management at the cloud.
And, yeah, we started to build a system back then then with a goal to combine the best properties of Apache Hadoop and, relational database systems. So back then, there was, like, a a debate going on that, Hadoop is a major step backwards because it's kinda like like from relational database systems point of view, kinda dumb. However, very scalable. So what we built with our system was basically a system that had a kind of like a data flow, a programming API similar to what Spark and, cascading and all the other APIs had and also what Flink is today. And the system was, like, was a distributed scalable system, capable of processing data. So, basically, much like a dupe with, user defined functions that like a map function that would do whatever you whatever you wanted to do. But it also featured an optimizer and an execution engine that was very similar to what the database system was doing internally, like, including the way that database systems, do their memory management and also including algorithms like, for example, that work in memory and then, gracefully split to disk in case memory is not sufficient.
Yeah. The system was called Stratosphere because it was data management on the cloud on on the clouds. And in 2014, we donated the code base to the Apache Software Foundation and became, Apache Flink. Yeah. Since then, Flink changed a lot. Today, Flink is rather known for its stream processing capabilities than for being a batch processor, which is still true. So all these things still work. Yeah. So what is Flink doing? Flink is, yeah, the suited system to perform stateful computations over data streams, and, the Flink project aims to address, like, all types of stream processing use cases.
And we see streams in their broadest sense, which means that it, includes bounded streams, unbounded streams, live data, recorded data. So, basically, yeah, live live streams and also batch data. Flink scales to thousands of nodes, can maintain terabytes of state, and also, we have enough users that process trillions of events events per
[00:05:14] Unknown:
day. So there are a lot of things in there to unpack. To start with, 1 of the things that I'm curious about is the ways that state is managed in the cluster and within the computations because particularly for unbounded streams of transactions, statefulness, I imagine, can get quite complicated when you're processing in a distributed fashion.
[00:05:36] Unknown:
Yeah. So Flink maintains all the states locally on the local machines, so there's no central entity, like a database where we, like, would like, do distributed transactions to. So all the, like, the the computations are distributed. So, you have, for instance, like, some something similar to a map function running distributed in your cluster. On each of these nodes that that runs the map function, there is something that we call, a state back end. And, this this is like a pluggable component in Flink, so you can configure this, the state back end to, manage the state on the heap of the JVM. Flink is implemented in Java. So you can, like, have in memory speed there, however, being bound to the size of the JVM, Or you can plug in a state back end that is bay based on RocksDB, which is, like, key value store that is writing data to disk. So it can also, manage much more data by writing it to disk, of course, with a little bit higher latency there. So all the state is, always locally managed. So the functions have local access to the state, which is very good for performance. But then again, this is, of course, a a challenge when it comes to fault tolerance because on discrete systems, things go wrong all the time.
So your process dies or the node dies. And for that, Flink has a mechanism to do consistent checkpoints of of the state and is able to load that data, basically, to reset an application and load the data from the from the from the checkpoint into the, into the application and then continue as if nothing ever happened.
[00:07:12] Unknown:
Flink, as you mentioned, has been around for a number of years now, and there have been a number of other streaming systems that have been built up in recent years from various different companies trying to suit different use cases. So I'm curious where Flink fits both in the context of other streaming systems and also into the broader modern data landscape, particularly given the fact that things have changed quite a bit since you first began working on it. Yeah. Sure. So, yeah, as you said, there's a there's a bunch of streaming
[00:07:42] Unknown:
stream or stream processing software systems out there. So first of all, what is what is Flink not? So Flink is not a modern event lock or a system to store your stream or to distribute a stream. So it's not similar to, like, the Kafka event lock or, Apache, for instance. So it's not meant for storing and disputing streams. It is a processing engine, so and it's much more similar to to to Spark and, also Storm to to to some extent. So it's similar or it's similar to Spark because it has also the edge edge processing capabilities, similar to Spark. But, the Flink project today is focusing much more on stream processing and is an event at time stream processor. So we're not doing microbatches as, Apache Spark, for instance, but we're processing events 1 by 1 without, scheduling you microbatches.
Compared to storm, which is also, like, event at a time architecture, Flick added stream process capabilities much later, and, provides, event time processing. I guess we'll talk about that a bit later. And, also, better better consistency guarantee. So exactly 1 state consistency. And, higher throughput, spark storm is somewhat bound by its, forward forward runs mechanism, when it comes to high throughput.
[00:09:06] Unknown:
And what are some of the primary use cases that end users of Flink are leveraging it for?
[00:09:13] Unknown:
Yeah. So I said before, Flink is, like, trying to capture the full, the the full area of stream processing. So what are the kinds of stream processing applications that we see people are running Fling, for? So 1 of 1 of these use cases is, data pipelining or low latency ETL, like moving data from 1 location to another with low latency. This is a very, very common use case, although not probably not the most advanced, I'd say. Streaming analytics is, of course, another issue another big issue. And then there is also this area of, event driven applications, which somewhat go in the direction of microservices. So we're building an application that receives events usually from an event log such as such as Kafka Pulsar.
And then this application runs some business logic that that reacts to to individual events. It receives in demand and then performs some some kind of business logic, puts some of the information into into into its state, forward some information, maybe a notification or in event that is processed by another application, a little bit downstream. So it's like, these these event driven applications are a little bit more like applications that would have been designed with a, like, a a separated, like like, business applications that would store data in a relational database. So the they have a little bit more, like, the transactional nature of strip processing where analytics is, yeah, of course, more analytical.
[00:10:41] Unknown:
And in order to be able to support these low latency processes as well as scaling to large numbers of nodes and large volumes of data, I'm wondering if you can discuss a bit about the underlying architecture of Flink and how that has evolved over the years that you've been working on it.
[00:11:00] Unknown:
Yeah. So I mean, Flink is a distributed system, so it, like, follows the classical master worker pattern here. I said before, it's mostly implemented in Java. Some are also in in in Scala, but those are my mostly APIs. Flink processes data as as data flows. So, it's like distributed parallel pipelines. So, the the the data is, like, distributed, processed by by these individual operators, and whenever needed whenever you need to organize the data by key, it is, like, hash partitioned and then shuffled over the network to the to the corresponding node that processes that key. On the for for for the network shuffling, I said we're following, like, this, event at a time processing, which means that we are, of course, not sending individual events over the network, but we're always, like, collecting them in a in a network buffer that we then ship for for for better throughput, of course. But the data is, like, eagerly pushed forward. It's not like in, in a in a batch engine that would first collect all the data and then tell the receiver to to to download to to load some of the data. But instead, the data is, like, eagerly pushed forward from the sender to the to the receiver.
On each receiver, I already briefly talked about the state back ends. The the events are are received from the network buffer, the unpacked, and then forward into the user functions. And then state is usually also shut per key in Flink. So, each each of state of each key is only processed on a on a sing on a single machine. So the events are routed to the machine that is responsible for the key. And then the it's like the state back end's responsibility to to look up the right state for this key. So set, we have the state back ends for for the JBMP. So in that case, this this the state is accessed, read or written, by by putting it putting the the Java objects directly into the heap, or the data is, deserialized and serialized to to get it, out of or into a RocksDB.
[00:13:09] Unknown:
And given that you have been working on and evolving this project for a number of years, Are there any aspects of the current implementation that you would do differently if you were to start over today and didn't have the weight of legacy behind the existing infrastructure and, APIs and design patterns?
[00:13:28] Unknown:
Yeah. Sure. So as I said, so Flink started as a batch engine and, like, the streaming I mean, streaming was added later. Well, to be fair, like, most of the execution was always streaming or, like, streaming from the start. So, I said before, we've, designed Stratosphere back then and, like, data fling with database principles in mind. So, the the the research group that I was working on at university was a database research group, so we had, like, naturally, like, a database background there. And database systems also tend to have this, pipeline processing model where, like, the, data is always pushed from the from the roots of the of the processing of of the execution plan towards the the the root node, so eagerly pushing forward. And we had, like, the, same same design right from the start. So the network layer was always, like, pipelined.
Oh, on on on top of that, we built, like, basically 2 separate engines for, for batch processing and streaming, and also, like, 2 different APIs. So in Flink, the core dataset API for the, for the batch processing and the stream API for the streaming. And yeah. So if I would, like, design things new, or starting from a green field, I would, try to have these things more more tightly integrated. So Flink is going to that direction. So that's, like, 1 of the points on the road map to to unify batch and stream processing. But, yeah, so often it's a bit harder to change something that is already there than, like, starting from a a point field. Like, this tighter integration of handling bounded streams and unbounded streams would be nice.
[00:15:07] Unknown:
And in terms of the bounded versus unbounded streams, are there differences in terms of how end users need to think about the way that they approach those problems for their own processing?
[00:15:20] Unknown:
Yeah. Well, yes and no. So Flink has, besides, like, these APIs that I mentioned, the dataset and the data stream API, Flink has also, 2 relational APIs, which is like SQL and, a link style table API. Link is like a level stands for language integrated queries. So this is basically a way to not define a query as a as a as a string in a programming language. So when you write a secret query, you typically compose it as a string. In the programming language would have, like, the query operators embedded in the in the programming language. Like, you have some kind of a of a table object, and then you, call select or group by on this or with a within within method that is embedded in the programming language. So this is, like, linked there, and this or, like, how the table API link looks like. So we have these 2 relational APIs, and, there these these 2 are unified APIs for batch and stream processing.
So depending on what kind of, like, what how the data looks like that is that is, processed, how the tables look like, if there's a, like, bounded batch tables or unbounded stream tables. The curious compiled differently. Like, in the batch case, it's compared into a dataset program, in the streaming case, into a data stream program. But the semantics of both programs is exactly the same regardless of whether the data is read from some kind some kind of persistent store or whether the data is, like, life arriving. And the principles that these APIs here are the streaming APIs the the stream processing for these relational APIs photos is, the 1 of, basically maintaining a materialized view.
So the the the query is like a continuous query, and then as more data arrives, the result is continuously updated. So this is like the the the part where users, well, don't really have or don't have to worry that much about the difference between bounded and unbounded streams. But then, again, then there is, like, the the dataset API and the data stream API, which clearly separate batch and stream processing.
[00:17:26] Unknown:
And for unbounded streams, there's the possibility for events to arrive out of order in terms of the time stamps associated with those events or for different machines that are originating those events to have different clock alignment, which can cause some difficulty in terms of event ordering. And you mentioned that Flink has native support for being able to support some of these out of order events or delayed events. So I'm curious how that functions and some of the ways that that simplifies the work of the end user who's trying to process these event streams.
[00:18:00] Unknown:
Yeah. So Flink supports event time processing. And what we basically mean by that is that data is not processed with respect to the local work log on each processing machine. So let's let's take for instance the example of, doing a simple window count per 10 minutes. So, if you want to, like, count how many events our application sees every 10 minutes, then there is, like, different ways to do that. So the first question here is basically, how does the machine measure what 10 minutes are? The easiest way is, of course, to always look at the local machine clock on the on the on the data processor, and that will give us some time. And after 10 minutes, we can say, okay. Now 10 minutes over.
I can I can compute the count, and that just emits the number of records that I've seen during the last 10 minutes? Of course, this is not very precise and not non non deterministic because because it depends on the the speed at which data is ingested, is the back pressure in the application, how fast is the, machine that is, doing my computation, and so on. So looking looking at the, machine clock is sufficient for some applications if you're in only want to have, like, approximate counts. But if you want to meet meet precise and deterministic results, then this is not sufficient. So this mode is called processing time, in Flink.
And then there is event time. And, in event time, each event has a time stamp associated with the event. So as you said, this time stamp usually, usually encodes when this event happened or when event was produced. Yeah. So we have now time stamps on on on each record. So we basically know when each of these, events happened, and, like, having the time in the record is already good. But now the question is, how can we, like when do we know or how can we know when we can, like, close a computation? How can we be sure that no more record is arriving with the time stamp that we would need for this computation? And Flink uses the, concept of watermarks here. So this is, like, very similar to what the Google Dataflow model is doing. So, basically, it's pretty much exactly the same. And what a watermark is? A watermark is like some kind of metadata record that also has a timestamp.
And when the system receives this this watermark record, then it the the the timestamp of the watermark basically tells the system that it does not have to expect any more data with the time stamp smaller than the watermark's time stamp. So this gives us basically a way to measure progress or time progress in an in an opera operator. So when operator receives a watermark that says, now it is, like, 12 o'clock, then we know that we can, like, perform all the computations that can be done up to 12 o'clock. So this is basically the way that Flink performs event time processing, and the watermarks can be used for depending on how tightly you configure these watermarks to to be produced. So, like, high how much slack you basically give for out of order records to arrive, there is like a there's a trade off between the latency of results. The more slack you give, then the longer it would take, the longer Flink will wait for data to arrive and the higher the latency for the result will be, but also the more complete the result will be. So you with watermarks, you have a very, very good way of trading off results latency and result completeness.
[00:21:40] Unknown:
Going back to the statefulness of the cluster, you mentioned that that's primarily maintained local to the individual processing nodes. So I'm curious how you handle cases where there might be potential conflicts in the computation across different nodes or also how you handle the exactly what semantics in a distributed fashion with that local state management?
[00:22:07] Unknown:
Yeah. Sure. So first, talking about the the the conflicts. So in in Flank, all state or most of the state is always organized per key. And this is this is usually most most most of the relevant state. And, for this keyed state, I said before that all the records that have the same same key have to be routed to this node. So they will basically arrive in some sequential order at the node, and they'll be processed sequentially 1 after the other. So there's not really any any right conflict there because they're, yeah, processed 1 by 1 1 after the other. And if you, for instance, have a constraint that you would that you're saying, you know, it's it's a dispute system, so, like, giving giving, guarantees about the order is always difficult.
Data is arriving from different nodes. If you would like to, enforce that the changes are done in a certain in a certain sequence, then you can always accept the event, put it to state, have a look at its timestamp. So usually you wanna, perform the modification space on the timestamp. And then you can, based on the watermark, you know, when you can, take out what element from the state and then apply the change. So that would be, like, the way how to, handle conflicts. To the question of, like, how does Flink give exactly once guarantees for for for state? So this is this is done in Flink with its checkpointing mechanism.
And, a checkpoint is a consistent snapshot of the application of the whole of of of the data of the whole application. So this means that when the system decides to to, perform a checkpoint, the worker will send a notification to all the sources, so to the operators that ingest the data, and tell them, hey. I want you to perform a checkpoint. And then the source, the source operator will record its current reading position. So for instance, if you're reading from Kafka, this would be like the reading offset in all the partitions of the operator, and then it would inject a special marker into, into its output. This is again a metadata record, a little bit similar to a to a watermark.
And then this this record flows downstream exactly at the same it stays at the same position. It's in in in in the data stream. And when it is received by the next operator, then the operator will perform a copy of its own state and send the metadata record, this check mark barrier, to its downstream tasks. So, with this checkpoint barrier mechanism, we can basically ensure that, all the operators perform the checkpoint at the logically same time. So we have the all the state of the we have we have the reading positions at the sources, and we have the state of all of the operators, and we store all of that in a in a distributed file system like HDFS or s 3 or any persistent data store, basically.
And if then something goes wrong or a failure happens, Flink restarts the application and, initializes the state of all the operators, like the sources and all the processing operators with the state that was previously checkpointed. So we basically go back to exactly the same position that we've been before and continue processing as if nothing happened. So to some extent, that is, like, similar as loading a safe state in a computer game. So before you enter, like, tricky level or whatever, you you save your current game state, then then you go into the level. If something goes wrong, you can always go back and try again. And that's basically exactly the same that that that fling is doing here.
[00:25:53] Unknown:
And in terms of scaling the cluster of nodes, what are some of the limiting factors in terms of the number of nodes or, any sort of geographical distribution capabilities that Flink supports and some of the factors that operators should be thinking about as they're designing the network topology and deployment topology for their Flink cluster?
[00:26:17] Unknown:
Yeah. So we have, our Flink there there are some Flink users who who run applications on 10, 000, of course. So, like, the biggest users are probably Netflix, Alibaba, and Tencent. So they're running the applications on, like, 10, 000, of course, as I said with, multiple terabytes of state. And so these applications process trillions of event events per day, which is, like, a couple of millions every second. In order to get, like, get to this scaling numbers, what Flink users have to do is they have to, like, carefully configure their check pointing mechanism, like the, how often the checkpoints are performed, what the step back end is that they want to choose.
There are some certain certain features of the check pointing mechanism, like as, asynchronous check pointing, incremental check points, and so on that you can that you can enable. Most applications don't run across clusters. So, like, it should I haven't seen, like, I mean, besides some demos, I haven't really seen, like, geo disputed streaming applications being being being run on Flink. So in terms of, like, how to how to configure the cluster, that's a good question, actually. I I don't really have a good answer to that 1.
[00:27:33] Unknown:
That that's totally fair. And then in terms of cluster topologies, another thing that I'm curious about is particularly for events that you might want to have either multiple execution paths or multiple uses for those particular records. If there's any support for things such as fan out and fan in for the flow of the data through the Flink cluster.
[00:28:00] Unknown:
Yes. So, the programming model of Flink or the APIs, certainly allowed to, like, branch and merge data flows again. So in the in the data stream API, you have this might have an object that, like, you know, it's a data stream. You can just read multiple times from that. So if you're, like, a data stream object, you play multiple map functions on it, not not sequentially, but, like, in parallel, then the API will take care of, like, fanning out the stream and sending it to each of the individual map map operators. And then there are also, like, operators that can consume multiple streams.
So you can then also merge these streams again together, like, implement your own show your own join logic or just having a union operator that just unions all the events that are coming from from either stream.
[00:28:56] Unknown:
And going back to the cluster level, I'm curious what are some of the failure modes that you have seen and some of the challenges that operators should be aware of in terms of being able to maintain the overall health of the cluster?
[00:29:11] Unknown:
I think Flink is kind of similar then to to to to many of the other disputed systems. So, as I said, there there is different kinds of failures there. You can you can have, like, a a worker failure where just the worker process goes down or the worker machine goes down. This is addressed by the, like, regular check pointing mechanism, check pointing and recovery mechanism. So Flink will then bring up another worker, process somewhere in the cluster depending on what kind of, like, resource manager you have or you're you're you're running on, whether you're running on Kuenidis or, yeah, on a messes.
And, we'll, yeah, bring up the worker and restart the application, load the stage into the application. If you have, like, a a master failure, then the master holds the meta information for an application in its memory. So, and if you configure Flink to be resilient also for worker failures for master failures, then this data, this meta information is also check persisted during a checkpoint, also written into a dispute file system. And then Flink puts, puts a pointer to this data into Apache ZooKeeper. So, in order to have, like, a highly available cluster, Flink is currently relying on ZooKeeper.
So basically start pointing to the meta information in ZooKeeper, and then if the master node goes down, the new master only needs to know, like, an application ID and the information how to connect to Zookeeper. So it then connects to Zookeeper, looks up the pointers to to the data that needs to read from Zookeeper, fetches all the metadata, and the the workers basically are restarted as as in a regular failure mode. So these are, like, the roughly the failure scenarios that, that can happen. In terms of scaling, yeah, I said data is, shot up per key in in Flink applications typically.
So if you get into a situation where, you're running out of, like, disk or memory space depending on your state back end, you can always, scale out you you can scale out an application basically by restarting the application with a higher parallelism, and then Fling will distribute the state to or is able to distribute the state to a larger number of worker nodes. So the state is then, like, repartitioned and, sent to the to the different worker worker nodes. So if you're running out of, like, storage for your state, there's also ways to, to scale the application up or also down if you need fewer resources.
[00:32:00] Unknown:
And so in terms of building and maintaining the Flink project and community, what have you found to be some of the most challenging or complicated aspects?
[00:32:10] Unknown:
Yeah. So, like, from a technical point of view, of course, like, correct behavior and failure cases is always a challenging thing, like, how to ensure that, you get all the messages that you need to get and, all these classical distributed systems, problems. Serialization and backwards compatibility are also tricky, especially if you're have like a like, in state processing, like, the state of an application can it might it might change over time, but, if you, you often want to just keep that state. Right? And, if you upgrade to your Flink version, you would like to be still able to read read out that state, like, how how the data is, laid out, how the serialization format is, and all these things need to be carefully designed in order to have some some sort of backwards compatibility there. So these are, like, the technical challenges. You also mentioned the community. This is, in fact, what I would say, like, maybe even harder. Like, scaling a community is really hard. We're, currently, the point where Flink is gaining lots of lots of traction. We have lots of people commit who would like to contribute to Flink.
We're getting many, contribution proposals for for new features that, need to be reviewed. And there all of that eats up, like, a lot of time. It's it's it's well invested time, I think. But, at the same time, like, we we at DataArtisans, we also, of course, have some kind of an agenda or road map that how where we would like to bring Flink, and then it's, difficult to decide, like, how to how to invest the time and or where where where to invest the time, so to say. And, of course, like, with growing community, you also need to put somehow some some some kind of processes into place that, reduce the effort basically where the where the bottlenecks are, so to say. So, you have to somehow find a find find a way that is still convenient, to contribute to the project, but at the same time also try to reduce, or to to to to streamline the the reviewing and merging processes for for for the committers.
[00:34:27] Unknown:
And in terms of the data that gets ingested and processed within Flink, is it primarily treated as just an arbitrary array of bytes and it's up to the application logic to provide any sort of, meaning or schema to it? Or does Flink have built in support for being able to, store and validate schemas and type information for the data objects that's flowing in? And I'm also curious as far as the types of serialization or file formats that are most well supported.
[00:35:01] Unknown:
Flickr's, like, support for JSON and Apache Avro, for instance. Like, in looking at 5 formats. So so Flink has a has a has a rich like, let let let me put it that way. Flink has a has a rich connector, ecosystem. So we can we have connectors to read data from Apache Kafka, from Kinesis. We can talk to very different types of file systems. There's the JDBC connector, Cassandra, and Elasticsearch. And all of these systems have their that that, like, are either agnostic to the storage format. Like, in in Kafka, you just get raw byte messages that you somehow need to know how you can deserialize them. Many people are using Apache Afro or JSON there, and Flink also has support for that. So, within a Flink application, the API is constructed in a way that you're always dealing with Java or Scala objects.
And, Flink has its own type system and serializer system to to serialize this this data. So if you're, if you have a have a POJO object that you'd like to use in your application, then, Flink can either get generate serializers and deserializers for this object, or you can also ask Flink to just treat it as an as an Agro object and then use ABRA to see as a decent as it.
[00:36:27] Unknown:
For the interface for the end users and the application developers who are trying to deploy on top of Flink, what does that process look like in terms of the way that the artifact is deployed onto the cluster and some of the life cycle management that is available for those end users?
[00:36:48] Unknown:
Yes. So Flink applications are are bundled in in JAR files. So when you implement your your application, you're as I said, we have APIs for Java and Scala. You compile your application into into a into a jar file. And this jar file then, there's different ways, so you can bring that on the cluster to execute it. We have a command line client, that you can just use from the shell. There's a web interface that you can use to upload your your shuffle into the to to to the master and or to the job manager, how it's called and fling, and then start the execution from the web interface. But there's also a also a REST interface that can be used to manage the for life cycle of applications. So you can upload a jar file. You can start a job from the live from the REST interface. You can cancel that. You can, yeah, basically do all the things that you can also do with the web interface since the web interface is just using the rest interface.
[00:37:44] Unknown:
Given that Flink has been around for a while and is being picked up by more and more companies, I'm curious what you have found to be some of the most interesting or unexpected ways that it's being leveraged?
[00:37:56] Unknown:
Yeah. There are actually a few a few fun stories to to to tell. So, so we've heard, like, from from from a company, from 1 of the Flink users that they're using Flink's or Flink's check button feature to run AWS and always migrate the application to the, AWS region with the cheapest spot instance prices. So you can also use a checkpoint basically to migrate an application from 1 cluster to the other. So you just to form a a manual checkpoint. We call that a safe point in Flink. You basically get the save, get the the the state of the full application, then you can bring up a cluster, Flink cluster somewhere else, and then just load the state into that cluster. And they basically use that to always chase the cheapest spot instance prices. So, obviously, the application did not have, like, super low latency requirements because if you keep on doing that, there's always some kind of oh, well, some downtime involved in, like, bringing up the the the application somewhere else.
But that was, like, 1 of the fun ways that I've seen, like, this checkpoint feature being used. And, yeah, another cool story was when we, we learned from another user who had the the back end of a of a social network as a as as a bunch of flinch flinch applications connect by Kafka topics. So, today, like, this architecture might not be that uncommon, like, having this event driven applications, but, that was, like, 2 years ago or so. And, it's kinda, like, mind blowing to see how somebody would use a stateful stream processor to build, yeah, on the back end of a of a social network.
[00:39:41] Unknown:
In terms of the road map that you have going forward, I'm curious what are some of the improvements or new features that you have planned that you're most excited by?
[00:39:52] Unknown:
The the the fleet community is, like, always working on improving performance of check funding and recovery. So that is, like, an continuously ongoing ongoing effort. Also, the secret secret or relation APIs receive lots of contributions, these days. For example, we just recently received a proposal for a better integration with Apache Hive's metadata catalog. So this will basically then allow Flink to just read data that is, maintained or or stored in Hive catalog. Yeah. The I think an important point for Flink's long time vision is, of course, like, this complete unification of batch and stream processing, joining the APIs, the data set and the data stream API into, like, a common API that can be used to process bounded and unbounded streams and also do do this on the on the execution level.
This is something that I'm really looking forward to.
[00:40:50] Unknown:
And as you mentioned, with the growth of the community, there come people who are either trying to request or submit new features or capabilities or enable Flink to serve different use cases. So I'm curious if there are any of those that you've dealt with recently that you are explicitly not planning to support.
[00:41:12] Unknown:
Yeah. So that's, that's that's a really good question. So I haven't really seen proposals that would be very that that that would not be in the interest of Flink yet. So, I mean, of course, there have been some out, can really recall them. But, like, most of the proposals are meaningful and, coming from users who are running on some fling in their in their environments. So the the tricky thing there is always, like, to also manage the expectations correctly and not accept too many of these these these proposals or somehow be able to when when you when you say, yes, you can you can do that, also have, like, the the resources to actually being able to review the changes later on.
In general, I think, as said, like, Flink is trying to, like, cover the full spectrum of stream processing applications. Something that is I would say Flink is currently not the best choice for us, like, talk theories and data science, machine learning use cases. However, I would not rule out that at some point the community might also invest in that area.
[00:42:25] Unknown:
At DataArtisans, 1 of the services that you provide to the Flint community is training. So I'm curious in terms of people who have gone through those training sessions, if there are any common concepts or paradigms that they are particularly challenged by.
[00:42:43] Unknown:
In general, like, stream processing or the APIs that Flink provides for stream processing, like, all these concept of what is possible with state and time. This is often requires a a mind shift, so to say. So when you're coming from an from a background where you're where you've, like, implemented somewhat stateless batch applications, and then you're moving on to, to to work on streaming applications. Like, the the the mind shift that not all that you don't have all the data available, that you need possibly to wait for more data or to decide when you can perform a computation. This is something that is I would not say people are struggling with it, but, it it's certainly a price like a yeah. As I said, a mindSHIFT.
How to deal with, watermarks, how to, yeah. As I said, decide when you can perform, like, a computation and so on.
[00:43:40] Unknown:
And are there any aspects of Flink and stream processing in general that they tend to find most interesting or exciting as they're going through those trainings?
[00:43:50] Unknown:
Many many of the, many many people or many many attendees of our trainings are very interested in, in in SQL. And, I guess that's also because SQL makes many things a lot easier because when you're using SQL, you don't have to care about state. You don't have to care, like, that much about time. Of course, you need to, like, somehow keep in mind or or monitor how much state your query is consuming, but, you don't have to implement the handling handling handling of state. So SQL is certainly, and relational and and the relational API is certainly of a very, very large interest for Flink users and also the people who attend the Flink training.
[00:44:35] Unknown:
And are there any other aspects of Flink or stream processing or the work that you're doing at DataArtisans that we didn't cover yet, which you think we should discuss before we close out the show?
[00:44:47] Unknown:
No. I think it's, that was, yeah. We we we covered most of it. Yeah.
[00:44:53] Unknown:
And so for anybody who wants to get in touch with you or follow the work that you're up to, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology for data management today.
[00:45:11] Unknown:
Yeah. So something that I think is where we're still a bit of a gap is is like, unified storage for streaming and, like, recorded streaming data. So, the goal here is basically to, like, store streams without retention, like, that purpose of streams, in some kind of a storage system that, gives, efficient read access to to the historic parts and the life parts of a of a stream, but also, like, is is able to to to make the make make the switching point from historic to to to live data basically, invisible for the for the system con con consuming from this data. This is important because many of the streaming implications that we are dealing with this, or whenever you're dealing with the stateful streaming application, it's very often the case that you would like to bootstrap the state for your for for your application before you go on on the live data. And this bootstrapping basically happens on the happens on the historic data. And, in Flink right now, this is a bit of a challenge because also because, like, the batch and streaming are not fully fully, unified yet, but it's also difficult because of, like, lack of support of of of of, systems that that are able to to provide this, unified interface to record and streaming data.
There are some some some some projects on the way to to address these these challenges. But from from my perspective, they do not seem to have, like, gained much popularity yet. So, 1 example for instance is, is Prevega, which is an open source system developed by Intel MC, and it's exactly doing this. Like, it's, like, an event lock similar to Apache Kafka, also with providing transaction guarantees. And, but it also persists the data into some kind of code storage, system, where the historic data can render the b red. And, Pro Vega has very good integration with Apache Flink. So having more systems like this available would, I think allow for many interesting use cases.
[00:47:31] Unknown:
Alright. Well, thank you very much for taking the time today to join me and, discuss the work that you're doing with Apache Flink. It's definitely a very interesting system and an interesting project. So I appreciate the time and effort that you put into that. So, thank you again, and I hope you enjoy the rest of your day.
[00:47:50] Unknown:
Yeah. Thank you. Thanks for having me. It was fun.
Introduction and Sponsor Message
Interview with Fabian Huska: Introduction and Background
Overview of Apache Flink
State Management in Flink
Flink in the Context of Other Streaming Systems
Primary Use Cases of Flink
Flink's Architecture and Evolution
Challenges and Improvements in Flink
Handling Event Time and Out-of-Order Events
Exactly-Once Semantics and State Management
Scaling and Failure Modes in Flink
Technical and Community Challenges
Data Ingestion and Serialization Formats
Deploying Flink Applications
Interesting Use Cases of Flink
Future Roadmap and Features
Community Contributions and Limitations
Training and Common Challenges
Conclusion and Contact Information