Summary
Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
- Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!
- Your host is Tobias Macey and today I'm interviewing Yingjun Wu about the RisingWave database and the intricacies of building a stream processing engine on S3
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what RisingWave is and the story behind it?
- There are numerous stream processing engines, near-real-time database engines, streaming SQL systems, etc. What is the specific niche that RisingWave addresses?
- What are some of the platforms/architectures that teams are replacing with RisingWave?
- What are some of the unique capabilities/use cases that RisingWave provides over other offerings in the current ecosystem?
- Can you describe how RisingWave is architected and implemented?
- How have the design and goals/scope changed since you first started working on it?
- What are the core design philosophies that you rely on to prioritize the ongoing development of the project?
- What are the most complex engineering challenges that you have had to address in the creation of RisingWave?
- Can you describe a typical workflow for teams that are building on top of RisingWave?
- What are the user/developer experience elements that you have prioritized most highly?
- What are the situations where RisingWave can/should be a system of record vs. a point-in-time view of data in transit, with a data warehouse/lakehouse as the longitudinal storage and query engine?
- What are the most interesting, innovative, or unexpected ways that you have seen RisingWave used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on RisingWave?
- When is RisingWave the wrong choice?
- What do you have planned for the future of RisingWave?
Contact Info
- yingjunwu on GitHub
- Personal Website
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
Links
- RisingWave
- AWS Redshift
- Flink
- Clickhouse
- Druid
- Materialize
- Spark
- Trino
- Snowflake
- Kafka
- Iceberg
- Hudi
- Postgres
- Debezium
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Dagster: ![Dagster Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/jz4xfquZ.png) Data teams are tasked with helping organizations deliver on the premise of data, and with ML and AI maturing rapidly, expectations have never been this high. However data engineers are challenged by both technical complexity and organizational complexity, with heterogeneous technologies to adopt, multiple data disciplines converging, legacy systems to support, and costs to manage. Dagster is an open-source orchestration solution that helps data teams reign in this complexity and build data platforms that provide unparalleled observability, and testability, all while fostering collaboration across the enterprise. With enterprise-grade hosting on Dagster Cloud, you gain even more capabilities, adding cost management, security, and CI support to further boost your teams' productivity. Go to [dagster.io](https://dagster.io/lp/dagster-cloud-trial?source=data-eng-podcast) today to get your first 30 days free!
- Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png) This episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. Trusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. [dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. A new approach to building and running data platforms and data pipelines. It is an open source, cloud native orchestrator for the whole development life cycle with integrated lineage and observability, a declarative programming model, and best in class testability. Your team can get up and running in minutes thanks to DAXTER Cloud, an enterprise class hosted solution that offers serverless and hybrid deployments, enhanced security, and on demand ephemeral test deployments. Go to data engineering podcast.com/daxter today to get started, and your first 30 days are free. Data lakes are notoriously complex.
For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte scale SQL analytics fast at a fraction of the cost of traditional methods so that you can meet all of your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and DoorDash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first class support for Apache Iceberg, Delta Lake and Hoody, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
Your host is Tobias Macy, and today I'm interviewing Ying Zheng Wu about the rising wave database and the intricacies of building a stream processing engine on s 3. So, Ying Zheng, can you start by introducing yourself? Hello, everyone. Thanks for having me here. And I'm Yoon Jin Woo, and I'm a founder and CEO of Rising Wave Labs.
[00:01:57] Unknown:
We are building a distributed SQL streaming database, yeah, as you just mentioned, where the database is built on top of s 3, so it's kind of, like, novel cloud native. Right? And, before I have been running this data for 3 years, and before that, I was at it was Freshshift, the data warehouse team. And prior to that, I was at IBM Research Almaden, the place where the database was invented. I also obtained my PhD from National University of Singapore, and I was measuring database systems and, in stream processing. So, essentially, I have been working on stream processing and databases for over 10 years or 15 years. So, yeah, it's great.
[00:02:39] Unknown:
And do you remember how you first got started working in data and what it is about the space that's been keeping you interested?
[00:02:45] Unknown:
Well, look. Well, I just mentioned that I actually, did my PhD in databases systems and the stream processing. So it it was actually kind of natural for me to and I joined the industry, and I worked for IBM and then at risk. So but well, initially, when I was just starting my PhD, I actually, tried to explore some other spaces like machine learning, like, I mean, statistics, like, yeah, and many others. Well, I just a few that I'm not probably, I'm more interested in building things instead of, like, doing algorithms. Because right at that time, there was no LIM. Right? But there was no such kind of thing as well as the TensorFlow. And, it was for machine learning, it was more about, okay, thinking of better algorithm. I, personally, I I I don't really think I'm a I'm a kind of guy, like, machine learning guy, and, I really like building some things systems, and that can work.
So, yeah, that's why I I choose to focus on database ASAM stream processing during my PhD. And after that, well, I as you just mentioned, I just continue continue my journey. I feel that guy cannot just, I mean, purely doing research. I really want to apply what I learned into the industry to do something that everyone can use. Yeah. That's my story.
[00:04:10] Unknown:
And that brings us now to what you're building at Rising Wave. And as we mentioned at the open, it's a database engine. It's built for streaming applications. It's built on top of s 3, but I'm wondering if you can give a bit more, context and color as to what it is that you're building and some of the story behind how it came to be and why you decided that this was a necessary contribution to the ecosystem.
[00:04:34] Unknown:
Well, Rising Wave, is a streaming database, and it's a SQL streaming database. So the main idea of Rising Wave is to to make stream processing much easier to use and must much, much more cost efficient to use. So that's why so in terms of the ease of use, we think that, yeah, I mean, conventional stream processing systems were about, okay, you have to write Java. You probably have to understand a lot of technical details like checkpoint, like some other things. But what I feel is that, okay, luck. Probably people can just use, like, you, to use stream processing with a familiar experience, like Postgres experience.
And in terms of the cost efficiency, I think well, I mean, nowadays, but because of all kind of reasons, like cut cutting cost. Right? People care a lot about, well, okay, the cost efficiency. And I feel that case, especially in stream processing, many people say that, k, they probably do not really want to have stream processing systems because of the cost. And I feel that, k, luck. In the in the cloud, we can actually leverage the so called the cloud native architecture to reduce the cost. And that's why we view the riding wave, and that's why we actually adopt the so called the decoupling compute and the storage architecture to, to to power the stream processing engine. So, yeah, that's the that's the story of riding wave. But before riding wave, well, why I build the riding wave, it's, it was mainly well, as I just mentioned, well, before that, I was hit. It was Redshift.
Redshift is a data warehouse, and, many people know that they probably really want to use Redshift for the for, say, batch processing. Right? Or probably store a lot a lot amount of, locked data. But when I was in AWS, I feel that, okay, actually, more and more people are interested in regional data. They just probably just want to build some dashboard on top of the, using the regional data or probably to gain some real time insights from the original data. And Redshift was not a system for that kind of, like, stream processing for for for for streaming data. So that's why I view that case. Yeah. We probably need to build something new and build something new from scratch, and that's right in wave.
[00:07:00] Unknown:
And as you mentioned, stream processing has been around for a while. Largely, that has been the domain of very programmatic and code heavy engines, in particular, things like Spark and Flink come to mind. There have also been a lot of entries into the market for near real time database engines, things like ClickHouse, Druid come to mind for that. Streaming SQL systems have been around for a while. I think 1 of the most notable ones right now is Materialise. And I'm wondering if you can talk to the specific niche that rising wave addresses that is either a subset or superset or where where in the Venn diagram of all of those problems rising wave is a, particularly well suited option?
[00:07:44] Unknown:
Before I started building the company, I saw that, k, I mean, streaming database or writing wave were probably, like, streaming database were probably a subset of stream processing systems like Flink or or or Spark. Right? Because for Spark and and and Flink, they are pretty mature. Right? And, they actually provide a set of data APIs, like Java API, Scalar API, Python API, SQL API. Right? And people can probably do anything on top of these systems. Right? But over the last 3 years, I feel that, essentially, a streaming database is not a subset of a stream processing system. It's actually a superset.
The reason here that, k, stream processing system, can actually store data, which makes it quite different from Flink and the SPoC. Because from SPARK and Flink are just compute computation engines, and they do not really store data. So if you want to store data, probably you have to use some data like, probably s 3 or probably store it store your data in some data lake format. Right? But since writing wave has both the stream processing engine as well as the data as as well as the data storage, people can do not can can can not just, okay, process your streaming data. They can actually store your streaming data and to view the dashboards, real time dashboard directly on top of, your streaming device.
So that's much more powerful than just a computation engine because, well, as long as you, use a computation engine, you probably have to find your bring out your own storage. Right? So you have to have your own storage. So that actually limits the usage of the storage computation engine. But I do fear that the I mean, ready wave is not trying to cover all the cases, all the all probably all the functionality is provided by Spark and Flink because, well, as I just mentioned, well, Spark and Flink can provide Java API, Scala API, Python API, but Rising Wave doesn't really provide that. Well, it only provides c code. And, but will people use UDF, Python UDF, and, and probably Java UDF, Scalar UDF, just like how you use Snowflake and Redshift. And you also mentioned ClickHouse and Druid.
Well, I think well, these are definitely great systems. I I really like, these these systems. But rising wave addressed a little bit different problem because well, I mean, especially for click ads, well, people use this for as a old lab database. Right? So people probably store day data, lock data, and then do ad hoc queries using ClickHouse. But ClickHouse is not that great to do stream processing. For example, if you want to extract the real time if you want to extract the insights real time insights from your data, then I mean, ClickHouse probably can have materials views, but, well, it's not good that great for handling, let's say, streaming joins, streaming aggregations, and all kind of the things. Right? And if we talk about, well, the the real time ETO, I mean, ClickHouse also don't really send data out to some other systems. Right? But the writing way, you can do that because it's a streaming engine.
So, yeah, that's that's quite different from the, between Ridin Wave and the ClickHouse. Definitely another key difference between Ridin Wave and ClickHouse is that the ClickHouse is more optimized for low latency ad hoc queries. It can definitely handle ad hoc queries efficiently, but rising waves are more optimized for handling the predefined queries, like for your monitoring applications, for your, let's say, alerting applications. You probably really want to have riding wave. And I do see that, Keith, there are some other so called streaming SQL engine or streaming SQL database, like Materialise Arroyo and, 10 plus and some others. But I think for riding wave is quite different from these systems because riding wave is a distributed cloud native system, and we actually adopt those so called the s 3 as the as a primary storage architecture, also called, like, and the computer storage architecture. And, also, we we provide a full set of, stream processing functionalities like watermark, like time rendering, and many others. So that k, if people come from the flint water, probably from the stock route, They can still use riding wave without not without any headache.
[00:12:21] Unknown:
And the point of s 3 being the primary storage layer and being able to decouple, compute, and storage that has been 1 of the selling points for a number of other database engines that have been focused on largely data warehouse, use cases. I'm thinking in particular of things like Snowflake, but also distributed query engines such as Trino and Presto. But most of those, those also aren't thought of in the same context as streaming applications. There's usually 2 separate environments where your streaming engines are doing your real time computation. But then if you wanna do longitudinal analysis and aggregate analysis over larger volumes of data, then you go to your warehouse engine that has this decoupled storage and compute layer. And I'm curious, what are some of the architectural principles in rising wave that allow it to do that separation of compute and storage at the same time as the real time stream processing, and some of the ways to think about rising wave in that other context of these maybe lakehouse or data warehouse ecosystem?
[00:13:27] Unknown:
Okay. So I think where these are 2 different questions, the first 1 is about what the architecture like, the so called, like, the couple of computer storage. And second thing here, that's, okay, how random wave is well, I mean, can fit into, let's say, data data lake lakehouse ecosystem. Right? So for the first question, I think well, the key thing here that for stream processing, the key challenge here that we actually need to stream processing is continuous curve processing. So it means that we actually need to continuously maintain internal states or intermediate results for the stream processing because it's incremental competition, and you have to guarantee that your internal states or intermediate results never fail or never get lost because if the intermediate results get lost, you have to recompute from scratch, right, which can be super expensive.
So rising wave adopts the and most of the existing stream processing system like Flink, they actually adopt so called coupled computer and storage architecture. Basically, they maintain their internal space in their local machines. The prom the pros the advantage of this architecture is that I mean, the this kind of systems can be super fast. But the problem here that, k, 1 comes into, let's say, failure recovery and elastic scaling, this kind of systems probably will have a hard time. The reason here that, k, I mean, if the internal states get lost, they actually need to create another machine and ask that machine to load the entire state from your loss search point and then get recovered, which can be super slow. And, I we we come from some some customers migrating from Flink to to riding wave, and they complained that the I mean, before, I mean, before using Riling Wave, when using Flink, they actually it will take probably 40 minutes or probably an hour to reload their internal space back from that checkpoint to their, to their local machine. I mean, this kind of use case is I mean, 40 minutes or 1 hour is too long for some online applications.
But the right in wave actually store the internal space in the remote machine basically the or the object store, right, s3 so essentially you can get instant failure recovery because, well, I mean, if the machine failed, then it failed. Right? Not a big problem because what it installed is only cache. So we can just reload or we can just reboot a machine and ask that machine to directly access data from the remote storage of a drugstore. So and that about the failure recovery. And similarly for for elastic scaling, also, we can achieve a dynamics dynamic second level that, elastic scaling, which is super powerful for the streaming workload because well, I mean, in in the streaming workload, streaming workload can fluctuate.
Probably, in the morning, we probably have a heavy, heavier traffic, but at midnight, probably, we don't really need so many machines. Right? And that's why we use so called the computer storage architecture. And, definitely, the the computer storage architecture is also also much cheaper than just storing your data in your local machine. Right? Because your s 3 is much cheaper than your e c 2. And if you talk about, well, the lake housing, I think riding wave, can definitely fit into the lake housing, the lake house architecture.
The thing here that I mean, people actually nowadays would care more more more about the real time data, and they really want to ingest the data in real time into their lakehouse for them to put if they have their Kafka data. Conventionally, what people did is that they they probably buffer your, real time data in Kafka for, let's say, 7 days before delivering it, before syncing it into data lake, iceberg, or some others, or some other data formats or probably data, frameworks. But nowadays, people probably don't really want to do that because they really want to do analytics over their both their real time data and historic data. So they actually want to continuously sync data from their, let's say, Kafka to their lakehouse.
So Reading Wave can be the bridge, can be the tool that can help you continually sync data from your I mean, the messaging queues or or upstream systems to downstream systems. And, actually, another interesting thing I will have probably want to mention is that, k, rising wave itself can be the lakehouse. The reason here that we we actually again, resonating waves stores data, its own data, in s 3. So in the future, we'll probably will consider using Iceberg format to store our our our data in s 3. So which means that people can't just use Presto or some some other, altrino. Right? And some or some other execution engines to query data in rising starting rising waves. So rising waves will be the data, like, on its own, and we'll have a open ecosystem if we store data in the iceberg format.
[00:18:40] Unknown:
That was gonna be 1 of my other questions is, what is the storage format, or what are some of the ways that you think about the storage layer in rising wave and how that how that relates to the streaming and incremental capabilities that you've built into the engine?
[00:18:56] Unknown:
Well, right now, Resin Wave has its own storage. The reason here that can which is a row stall is not a column stall. The reason here that k for stream processing will in many cases, we probably really want to do, let's say, joints, aggregations, And roostar is much more efficient for handling, let's say, streaming joins and and the streaming aggregations. And so that's why we use through store. But, gradually, we find that, k, as more and more people use riding wave to store, let's say, store data because well, I mean, if the streaming data comes in, right, and after, as the time evolve, this real time data will become the historic data. So more and more people leverage rising waves to store, I mean, their their their their older data or their historic data. So, actually, we have the requirement.
Customers have the requirement to do more complicated analytics over their historic data using rising wave. So that's why we are thinking of introducing the column store. And we are thinking of but we do not really want to build our own column. But why? Because well, I mean, build I am not a fan of always building something from scratch. Right? I'm always and nowadays, well, there are so many cool products. Right? Cool data format and, and a cool open source software. So I'm thinking of leveraging something that is open, which can be iceberg. And we all probably there are some other definitely, there are some other data formats like, like, like, they are like. But right now, we are currently working with the iceberg community on storing data in iceberg format.
And if we if we haven't done that, then definitely, I mean, people can just store historical data in writing wave and do efficient query processing using their own system or, let's say, Trino or Presto are probably just using writing wave because writing wave also has its own batch engine. And we also definitely gather the Iceberg ecosystem for free. That's the that's our goal. Yeah. That brings to mind
[00:20:58] Unknown:
the ways that people, as you mentioned, are using things like Kafka for buffering a certain time delta of information and then flushing that out to a longer storage format, whether that's in their snowflake clusters or whether that's in parquet files. And from what you're saying of using the row store for being able to more easily do efficient joins and processing of the records as they're coming in makes me think that you could also pretty easily have some, kind of checkpointing or snapshotting operation where it says as long as data is older than x time delta, then I'm going to push that out to some async process to convert that from the row store into the column store in the iceberg format so that we can do more longitudinal analysis of this data, but also still be able to get efficient query processing on new data as it arrives and then just have that kind of rewriting process running in the background periodically to ensure that you're always getting the best efficiency of data for the time delta that you're trying to operate over. Yes. Definitely. Yeah. Definitely. We are thinking of, like I mean, we we are thinking of, periodically,
[00:22:06] Unknown:
compact data from RealStore to CommStore and probably compact it to Iceberg format or probably some other data formats. That's for sure. And, that's, essentially, everyone is doing that. Right? Even for Kafka, like, Kafka probably will return the pause out. They also do the same thing. They also want to compact their, their key stored data into a spiral format or probably some other open data format open data lake format.
[00:22:31] Unknown:
Digging more into rising wave and some of the use cases and ecosystem that you're looking to enable, what are some of the other unique capabilities that the rising wave engine provides given the fact that it is natively using s 3 as the storage layer, so you don't have that long time horizon of having to bring that checkpointed data into a separate storage system to then do computation, then push it back and constantly having to do this back and forth. And some some of the ways that the engine enables teams to change the way that they think about the types of operations that they want to do on their data and, you know, when they have to maybe span 2 systems versus being able to do it all in 1 context?
[00:23:12] Unknown:
Well, I think, well, in terms of the use cases, well, in terms of the unique capabilities, well, riding wave well, I think, well, as we as the store as the as the primary storage, also called like they call it a computer storage. It's more about the tech, technology. It's more about architecture. It doesn't really I mean, people when when people using riding wave or when people use, let's say, snowflake, Russia, whatever, they actually do not need to know that kid. Okay. Look. This system, leverage s 3 as a primary source. But what I feel is that, actually, streaming database basically bring the database or bring the data storage to stream processing and actually create many more use cases than just the stream processing engine. Well, think about this. What we can do use a stream processing engine like Flink or Spark? Okay. We can do probably real time ETO right Everyone knows that. K. Probably I if I want to do, let's say, BaaS ETL, I probably can use Spark. That's the default option. Right? So Spark is the definite default option. If then people say that, okay. Probably, I want to do do real time API, and I probably can use Spark Streaming. Right? And if Spark Streaming doesn't really work, then I probably I can I can choose Blink? That's all about the the computation engine. They can do they can, they can extract data from 1 place, do transformation, and then send the data out to some other systems. That's real time ETL. And rising wave is a stream processing as a stream processing system can definitely do real time ETL, but the thing here that k I mean, it's not an entire story. What I fear is that, k, rising wave can do much more than that.
So I I think what we are focusing except for real time media, we are focusing on 2 use cases. The first 1 is real time analytics over Kafka data. So the thing here that Kate when, conventionally, what people do, they that Kate I mean, if they have Kafka data, and then they have some other, and then they want to do analytics over a Kafka data. Right? What they can do? They probably need to bring in, let's say, a stream person engine, Spark Streaming, or probably Flink, do analytics, and then and then, send that data into a database. Let's say, Cassandra, let's say let's say, Redis, like Postgres, like or even Oracle. Right? They can do this.
But the thing here that I mean, database, the Cassandra are probably ready. So probably Postgres is a different system than Flink. Right? So it's these are 2 different systems. If they really care about the freshness, if they really care about consistency, they have to determine, okay, how I can integrate these 2 different systems. Right? What if 1 system failed, but the other 1 doesn't really fail? And what if, let's say, there are some network issues between these 2 different systems? Right? So what the rising wave with a storage with a as a database, with a data storage does in this scenario is that it can replace the entire data stack. Well, it does you don't really need a Flink plus a database. You don't need a Spark plus, let's say, a Postgres.
It's too complicated. Lightning Wave is a database, and it can just directly consume data from Kafka, do transformations, and then display the results directly into your dashboard, into your, let's say, graph file, into your Metabase, superset, Tableau. Right? Whatever. So that's that can greatly simplify the architecture. That's about whether I mean, real time Linux is over Kafka data, and that's 1 of the use cases. And another interesting use cases, which I didn't really expect 3 years ago, was that actually, redoing wave redoing wave can actually I mean, as a streaming database, can actually power or boost people's or the users' Postgres experiments or database experiments no matter whether it's the Postgres or MySQL or Power MongoDB or whatever. Right? The thing here that look. Let's just take Postgres example. Okay. I take Postgres example because I'm a fan of Postgres, and, definitely, MySQL is also great. Right? And, also, writing wave also speaks in Postgres language. I mean, it's Postgres compatible, so we actually have better compatibility with Postgres. But, anyways, well, I mean, if people if people have Postgres, they actually they probably have some they will probably, have some performance issues because, well, probably they want to do materialized views inside of Postgres. Right? Or probably they want to do some analytics over inside of Postgres. Right? For this system so what maintenance. Everyone knows that. And Postgres is also not that good for any queries. Right? So what do we do is that what we find is that, essentially, people can use writing wave as a Postgres booster, which means that the writing wave can continually process consume data from Postgres bin lock, also called a CBC, can consume data continuously consume data from Postgres bin lock do real time compilation I mean, the materialized view, build the materialized views, and, optionally, send data back to Postgres so that people can just query the real time results inside of Postgres. So that's pretty interesting use cases.
And, we find that case in this this you cannot do do this kind of thing using a stream processing engine. Or if you can if you do that, then you can make things much more complicated because where you have to think about, okay, how I can use, let's say, Splink or Spark to consume CBC, and there, send to a database or send to a computation engine and do computations and there how to send the data back to Postgres. And running away is a database itself, so it can create a materialized view. And as a stream person engine, it can continually ingest the data from Postgres and send send data back to Postgres. So that's definitely another very interesting use case we have already seen, yeah, over the last few years.
[00:29:26] Unknown:
And I imagine, if it's not already available out of the box, it should be fairly trivial to even set up rising wave as a target for a foreign data wrapper so that you can actually just use Postgres to query directly against the rising wave instance to retrieve the latest output of that materialized view without even having to do the round trip back into the Postgres database.
[00:29:48] Unknown:
Well, that's definitely amazing idea. Well, I have to say that we were gonna be working on that. So it's, yeah. Yeah. On yeah. To be honest, well, we didn't really so, look, well, I send that, I say that to you. We probably can send data back to Postgres. Right? Because well, I mean, that's because, well, people just want to let's say, if they want to view the application, they actually just want to interact with 1 single database. They don't really want to, let's say, bring a Snowflake called Redshift, bring another database and ask the application to retrieve data from a second database, right, or second data warehouse or second data system. They just want to use Postgres. Interact is Postgres. So we say that, k, probably, you can send data option for you can send data back to Postgres. But when we find that key, luck. People can actually use a foreign data wrapper to directly access data, running with data using Postgres and probably can I mean, not just the I mean, access data? However, they can also join writing with results with, writing the data with a Postgres table. Right? Yep. That's definitely something we are currently working on, and I believe that we were delivered actually in q 1, 20 24. Yeah. And so digging deeper now into rising wave itself and some of the design and implementation of the system, can you give a bit of an overview about the architectural principles of how you've approached it, some of the specifics of how you've built it, and some of the ways that the overall goals and scope have changed from when you first began working on the project? Well, definitely, as just as we have already discussed, what we use is called the coupled computer storage architecture. Okay. Definitely, there are a lot of, advantages. Right? But I can tell you something about the the limitations.
1 of the limitation is the, is that the s-ray is super slow. So that the I mean, people have the impression that the if so, yeah, people just have the impression that for s 3, it's super slow. So that's good. We cannot allow people to frequently assess data in s 3. Right? So what do we do is that we actually bring a caching layer, which is to which is a tunable caching layer because, well, I mean, you can assess the as a user I mean, as a advanced user, you you can definitely set the I mean, the cache size. So definitely that's the limitation, and we address the limitation using caching system. But, actually, an interesting thing here that, k, when we say that, k, we stored data in s 3, then people will feel that, k, luck. Your system should never run into so called the oom problem, right, auto memory problem because when they think that, okay. No. You store data in s 3, so you should never come from auto memory problem. But the the fact is that can not a case because, well, I mean, we still need to use our memory to do computation.
So every time we run into the other memory, problem, then we have to explain to our users why we use so called s 3 as a private storage, but you still may run into the OOM problem because of the I mean, you are still doing computation using your memory. And okay. So that's about the limitation of the architecture. But if you ask me about where the the let's say the whether we change something, the the design principles, whether we change that, I think what is more more about I mean, I used to be an engineer, so I focused a lot about for the technical details. But over the last 3 years, I gradually find that key luck. Well, the the design principle or the so called what the where the cards were the, well, luck. If you talk about where the design principle, though, so our goals or scopes of the riding wave, well, I have to say that, k, initially, we feel that, k, luck, writing wave what writing wave that is to democratize stream processing, right, to make stream processing more accessible to to to everyone. Right? Well, I have to say that we will never change that goal. We still want to democratize stream processing. But what I feel is that the thing here that, k, we should not focus on technology.
We should not focus on, I mean, how to make stream processing easier to use. I I think that's that's pretty important as an engineer from an engineer perspective. I totally agree that, and we work hard to make things much easier to use. But the another thing here that I would I would like to say that, k, we are actually pushing hard to advocate for streaming processing's use case. Because from when I talk to people, people, always say that, k. Probably I don't need streaming data. I don't need stream processing. Because with stream processing, I don't know where where to where where where where I can use your system. So instead of pitching people about, okay. We are building a better stream presenting system.
Nowadays, we talk a lot about, okay. You don't need to know that, k, what a stream presenting is. You can use writing wave in this scenario for you. Okay? I mean, do SQL stream, SQL processing over your Kafka data or probably just to boost your Postgres performance. And that's how the I think we're small about the product messaging. Right? Well, how our product messaging chain evolved over time. But, definitely, I I I I still have facing stream president, but I do believe that the the entire market should think about, okay, how we can educate people and how we can, convince people that the stream processing is not a monster.
Stream processing is just, I mean, the a technology that everyone can use.
[00:35:06] Unknown:
Another aspect of rising wave that gets called out in the documentation is the fact that it is a distributed query engine as well, so you're not limited to single node performance. And I'm wondering, what are some of the engineering challenges that you've had to work around between managing the scale out capabilities of the system along with the performance considerations around s 3 as the permanent storage and just some of the ways that you think about what are the priorities that you're focusing on and then maybe some of the ways that you that the cap theorem comes into play along those, those considerations.
[00:35:44] Unknown:
Well, there there are so many questions, but, okay. Look. Well, running wave is a disputed system, but we actually and people also talk about, well, the scale out. Right? But I have to say that k. Now there is well, in many cases, we do scale up first. That is what I mean, we change to a better machine first. So we try a better machine first, rather than just a scale out because we're essentially that's how cloud is different from big data because we're in big data in the big data world, the thing here that I mean, a company has already had a set of machines, and they they, come so called a community machine. Right? And they need to think about, okay, how I can scale from 1 node to, let's say, 10 node and probably and then scale from 10 node to a 100 node. But in a cloud, instead of scaling out, we actually have option of scaling up, which is much more cost efficient because we're able to have you do not need to, I mean, deal with, let's say, the network network latency.
So we always recommend scale up before recommending scale up, but we still we are still building a distributed engine. The reason here that key in most cases well, the most challenging cases are that key main. When when handling the big state stream processing, let's say, joints, aggregations, you will assume run out of memory. So you have to distribute from 1 machine to several other machines. And and, also, another thing here that is here, scaling up has also has another limitation, which is that if you want to, let's say, do dynamic scaling based on your streaming workload fluctuation, you actually have to you actually have to probably stop this machine and and the migrating tile states from 1 machine to another. Right? So if we have, if we do scale out, then it can make things much easier because we can just migrate part of the space to the newly boosted machine. Right? But okay. So let's talk about the stream, distributed stream processing. I think the key challenge is not about how we can interact with the s 3, but how we can handle the data extrudeness. Because, well, I mean, stream processing is about more about the real time data, and the real time data can definitely screw. Right? If you use, let's say, a fast processing system, let's say Snowflake or click or or Redshift, they can just rebalance, right, rebalance in batches.
But in stream processing, the thing here that, k, you have no idea about how the data skewness look like. You have to adjust the skewness, or you have to adjust the data distribution at a wrong time. So that's much more challenging than, yeah, the just the rebalancing of batch, your batch data. So so we actually made a lot of efforts, in rebalancing streaming data. And, yeah, it's more about about how we handle, the shuffling or some other things. Sorry. What's the what's the second question?
[00:38:44] Unknown:
Mostly just interested in the engineering challenges that you've had to address in being able to build this real time processing on top of s 3 and being able to do it in a distributed fashion. And then the the last, curve ball that I threw in there was the question about how the CAP theorem comes into play in in the context of trying to address all of these challenges.
[00:39:04] Unknown:
Okay. Okay. I see. I see. All these all these things I mentioned was more about the engineering challenges. But, essentially, we actually put a lot of efforts in automating the user experience in the local machine, in the local machine use cases. So while we do that, the thing here that what do we find is that in in many cases, people start from their I mean, let's let's say, I want to start using riding wave. They first check out for GitHub and then download Riding Wave in their local environment, which is just probably a laptop, a Mac. Right? So we actually reduce the bar which lower the bar for people to use riding wave in their local machine. So we did a lot of optimizations for them. K? In, in your local machine, you know, let's say, the Mac. You don't probably def definitely, you don't have access to, yeah, s 3 or probably HDFS. Right? We actually will, we'll pack everything together into, 1 single binary so that you can just run riding wave in a small machine. So that the people were just directly test the riding wave's functionality instead of scalability.
Test the riding wave's functionality in in their look machine. If they feel that, k. Luck, riding wave is the system I want, then they will think about, okay, whether I can install riding wave in my own cluster using Kubernetes because where I'm in nowadays where everyone use use Kubernetes. So yeah. Anyways, well, though, riding wave is a dispute system, we actually put a lot of efforts in automating user experience, especially for people to use riding wave in their local machine.
[00:40:44] Unknown:
And another element of that user experience and developer experience that I noticed when I was preparing for this conversation is that when it comes to change data capture, as you referred to earlier, most of the time, a system will say, oh, yeah. We support CDC as long as you're running Debezium. So that's another at least 1 system that you're running where you have maybe Kafka with Kafka Connect and Debezium, but you opted to actually build your own CDC processing for Postgres, and I think I said MySQL as well. And I'm wondering what were some of the motivations for engineering that into the rising wave system itself versus deferring to other existing solutions.
[00:41:25] Unknown:
Well, first of all, I have to say that, okay, we definitely love Debitium. Debitium is super powerful, and we actually also use Debitium. That's for sure. And well, the but well, the thing here that, k, we are quite different from the running wave is quite different from the other stream processing in handling CDC in that we do transaction. We actually do transactional CDC. The thing here that okay. Look. If you do CDC, probably you, let's say the let's say the Postgres CDC or MySQL CDC, these databases are transactional. So people actually want to, basically preserve the transactional semantics from these databases. Right? But for existing I mean, stream percent systems, let's say, Spark or I don't know whether Spark can do CDC or, I mean, Spark Streaming or probably Fling or some others. I mean, they cannot preserve the transactional semantics.
So people I mean, especially in the fintech domain or some I mean, the or probably some other, I mean, transact domains that leverage transactions. They actually cannot accept this. So we actually put a lot of efforts in building so called transactional CDC to allow writing Wave to understand the transaction boundary from the, issued from the upstream systems like Postgres and MySQL. And that's how Reading Wave CDC is more powerful and more actually more popular than the other, system CDC.
[00:42:52] Unknown:
In terms of the overall workflow of teams who are adopting Risingwave, you mentioned that the onboarding path is just somebody downloads it, they play around with it, then they say, hey. This is really cool. Let me put it in production. Once you've got a Rising Wave instance deployed and you're starting to build workflows on top of it, I'm wondering what are some of the, ways the teams are thinking about how to manage that overall workflow, some of the other elements of the ecosystem that they're bringing to bear or other, investments that you and the rest of the rising wave team are making to make it be a more natural citizen of the overarching data engineering community?
[00:43:34] Unknown:
Well, I think what it's more about the ecosystem. I think for 2 things. Well, 1 is, ease of use, and second 1 is ecosystem. Ease of use, I just mentioned. I mean, we have to we have to tell people why rising wave fast. I mean, we people would not use your system invest let's say invest, let's say, a week or probably a month into your into, your system. They probably would just check out your GitHub page for 5 minutes or probably 10 minutes. If they're interested, they will probably dive deeper into it, probably download. And they probably don't even need to they probably don't really want to downloading, you know, I mean, using your Kubernetes, right, deploy your system in Kubernetes. They just probably really really want to play around, on their laptop. If they feel that's okay, I mean, it's not a super use case, it's not a super system for them, then they will stop using that. I mean, they probably they were just at the late right away from their laptop. Right? So we make things much easier to use in their own laptop. And about the ecosystem, I mean, if people if people want to bring a system into their data stack, they actually care a lot about, okay, whether the new system they want to bring in can integrate with their existing data stack.
For that's why we actually build a bunch of, let's say, integrations with both upstream systems and downstream systems. And people can some people say that, k. No. We probably don't need this. We'll pick up because what we can bring let's say, we we can probably use Kafka Connect. Right? But the thing here, like, what we find is that kid look. People don't really want to bring additional system to use your system. They just want to use 1 single system to address all their problem. That's how people think. Right? Well, people don't really want to bring troubles. Right? People don't really want to bring complexity. They just want to use 1 single system that is out of box and then can fit into your desktop and then can run. That's the thing. So we actually build a bunch of connectors and direct connectors without CALFA Connect, without any additional systems help. So we just build intel, yeah, direct connect to all these upstream systems and downstream systems and to make rising wave much easier to use and much to make people more comfortable to integrate Rising Wave into their existing data stack.
[00:45:51] Unknown:
And as you have been working on Rising Wave, working with the community, what are some of the most interesting or innovative or unexpected ways that you've seen it applied?
[00:46:01] Unknown:
Well, if you're talking about for the innovative, I mean well, I can give 1 example, which is, like I I think it's more about, well, unattractive way to using riding wave. So in the riding wave community, we do see a lot of, let's say, trading firms, banks, like, or probably SaaS companies, tech companies. Definitely, they know what stream processing is. They know, okay, what is I mean, they need real time data, and they have real time data, and they need real time data processing. But what we find is that, actually, radio wave is kind of popular in in the manufacturing industry. I can give you 1 example, which is, like, To me, well, it's kind of surprised. I mean, this manufacturing company is 1 of the world's largest manufacturing that produced so we can produce the TV main board. So this kind of company where give people the impression that, okay. Look. Well, this kind of companies are kind of old school, and they do not really have more than data stack. Right? But they probably are still using some old fashioned data systems. They actually adopted Riding Wave to monitor their production line. Before riding before introducing Ridin Wave, they have their own factory. This is a big company, enterprise company, and they have multiple factories across the world.
And before introducing riding waves, they actually oh, they need to monitor the production line across the world. Before introducing value, they actually have to make phone calls to the on call purse on on duty personnel and ask them, okay. Is everything good? Is anything that I can help with? Or probably is there anything wrong? Or probably if we want to handle some emergency issues, every 6 hours, they make such a phone phone call. But now since chance, this factory, I mean, this company just introduced, let's say, the in riding waving to their their stack. Well, I mean, they can just sit into the their their their office, and then they also install these so called the sensors in their protection line so that, essentially, the sensors can send data, send their so called IoT data. Right? IoT data into their, let's say, the database installed in their the factory.
And the factory will continuously send data back to their office to back to their riding wave instance. And the riding wave, for a dark display, the dashboard to their, yeah, IT person. So that's they will know, okay, what happened in my factories across the world. And so whether there's any emergency issues, I I need to do with and whether there's any, let's say, the flaws or whether there's any anomalies I need to deal with. Right? That's how yeah. This kind of use cases really impressed me because, well, it gave me an impression that, yeah, I used to think that's key. I mean, these kind of factories, oh, it's good. They do not really they are not interested in new technology.
But this kind of music is really, yeah, refined refreshed my mind. And I do really think that, okay, more and more factories and more and more companies will introduce new technology to help them to improve the efficiency as well as saving cost.
[00:49:01] Unknown:
And in your own experience of working on the rising wave engine contributing to this environment and ecosystem and building a business on top of an open source product, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:49:16] Unknown:
The all the biggest lessons are learned with with the toolkit do not just focus on technology. Especially the 1st 2 years, I I that's where I spent a lot of efforts in thinking about, okay, what stream processing is and how we can build a better stream processing system. That's that's pretty important. I used to be an engineer, and I really love that. I really love hacking code, and I really love learning new technologies. I really love, I mean, learning how how to reduce the s 3 latency. These are cute questions, but the thing here that I I feel that, okay, better understand users. Not users that use your product, but users that do not use your product. So we need to think about the user journey. I mean, look, when not the user journey of using your product, but user journey of building a data tax stack. Right? So let's say, like, hey. I mean, if I want to create a if someone build a new company, the first system, they introduced will be Postgres or probably OLTP database. And second data a second system they were probably already introduced is probably a Redshift or probably a Snowflake. Right? Or probably a OLED device. Right? But then if you tell them, look. Do you need a stream processing? I actually have a better stream processing system that can replace your fling. They won't buy it because they have no idea about what stream processing is and what fling is. They only care about their own business, whether you can bring more value to my business. So we need to think more about the lessons the key lessons I learned is that, hey. I actually have to think more about the user user journey and where they are and when riding wave can bring value to their business and when riding wave can be a good fit into their data stack. So that's the biggest challenges well, the the biggest lessons I learned over the past 3 years.
[00:51:03] Unknown:
And for people who are interested in trying to gain a more up to date view of the data that they have, what are the cases where Rising Wave is the wrong choice?
[00:51:14] Unknown:
Well, don't really don't use Rising Wave to replace your Postgres if you are looking for a transactional database. I mean, ReadyWiff can do, I mean, can can can process your transactional CDC, but it doesn't really support read and write transactions. So in this in this case, it's where you'd better to use Postgres or probably MySQL or probably even Oracle. And second thing here is that, k, I mean, Redeemer probably is not that great for old lab and original workloads. And if you want to do, let's say, large scans of your of your base table, probably you will need to have, let's say, Snowflake Russia or probably ClickHouse.
But ready wave is, I mean, can be part of your data stack. Well, I mean, even if you have already had a post grant in the in the in the ClickHouse because, I mean, it brings unique value of stream processing
[00:52:06] Unknown:
to your data stack. As you continue to build and iterate on the rising wave project, build a business around it, what are some of the things you have planned for the near to medium term or any particular projects or problem areas you're excited to dig into?
[00:52:20] Unknown:
I think, well, there are 2 projects I'm super excited about. The first 1 is the so called the unified batch and stream processing. So, yeah, Reading Wave is a streaming database, and its focus is on stream processing. But what we find is that, k. Look. People are also interested in batch processing. I'm not saying that, okay. Okay. Redwing wave should directly compete against a snowflake, redshift, or whatever. But the thing here that people actually care a lot about where they I mean, do not just want to process your real, real time data. Sometimes they also want to join their real time data with your best data. So we actually need to need to build a muscle of processing batch data. And, also, we have to we also feel that people, as I just mentioned earlier in this podcast, we feel we find that as more and more people store their data in Reading Wave, we actually need to provide tools for them to better process your historic data.
So that's why, that's another motivation, yeah, why we need to build the batch processing capability. And then that's 1 project. And the other project I'm super excited about is that, k, we we have to be more integrated with the so called the data lake ecosystem. As I just mentioned, well, I mean, so we have integration with Iceberg, and we are also looking at other data lakes, like Hoodies and the Dell Lake. We feel that key. In the future, people can just store data. No matter where what what system it is, well, they can just store data in s 3, and they can wrap the data. They can store data in some open data format so that every other, any other data systems can't access the data. So that's why we think that, okay, we should have better integration with data we with the data lake ecosystem. Early conversation, I also mentioned that, okay, we have project with the iceberg community called iceberg rust. So using that project, we actually help people to better, to to the RFP integrated with a rough based project with, with Iceberg.
These are the 2 projects I'm super excited about.
[00:54:27] Unknown:
Are there any other aspects of the rising wave project itself, the overall streaming data ecosystem and data lake system or the work that you're doing to build a business on top of rising wave that we didn't discuss yet that you'd like to cover before we close out the show? Well, definitely,
[00:54:43] Unknown:
well, definitely, I would actually say that, k, we actually from the business perspective, we have to think about, okay, how we can ask people about the stream processing. Because what I think that, okay, right now, stream processing is still kind of a niche. And we, not just us, not just the writing wave, but also some other vendors like even Confluent, they actually need to think about, okay, how we can better educate people about stream processing. And that's the only thing I probably can think about. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. So I think what people are building great technology, but we have to wrap it. We have to wrap the technology in some product that is super easy to use for developers
[00:55:37] Unknown:
in their local machine, in their own, probably, laptop. That's the that's a gap. I believe that's we we we need to think about it and we need to probably fill in. Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing on rising wave and helping to bring streaming data processing to a broader audience and making it easier to combine stream processing with analytical use cases. So appreciate all the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Yeah. Thank you, Steve. Thank you so much. Yeah.
[00:56:14] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the machine learning podcast, which helps you go from idea to production with machine learning. Visit the site at data engineering podcast dotcom. Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts@dataengineeringpodcast.com with your story.
And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to Ying Zheng Wu and Rising Wave
Building a Stream Processing Engine on S3
Rising Wave's Niche in the Streaming Database Market
Architectural Principles and Cloud-Native Design
Unique Capabilities and Use Cases of Rising Wave
Design and Implementation of Rising Wave
User Experience and Ecosystem Integration
Innovative Applications and Lessons Learned
Future Plans and Exciting Projects