Summary
There have been several generations of platforms for managing streaming data, each with their own strengths and weaknesses, and different areas of focus. Pulsar is one of the recent entrants which has quickly gained adoption and an impressive set of capabilities. In this episode Sijie Guo discusses his motivations for spending so much of his time and energy on contributing to the project and growing the community. His most recent endeavor at StreamNative is focused on combining the capabilities of Pulsar with the cloud native movement to make it easier to build and scale real time messaging systems with built in event processing capabilities. This was a great conversation about the strengths of the Pulsar project, how it has evolved in recent years, and some of the innovative ways that it is being used. Pulsar is a well engineered and robust platform for building the core of any system that relies on durable access to easily scalable streams of data.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- You monitor your website to make sure that you’re the first to know when something goes wrong, but what about your data? Tidy Data is the DataOps monitoring platform that you’ve been missing. With real time alerts for problems in your databases, ETL pipelines, or data warehouse, and integrations with Slack, Pagerduty, and custom webhooks you can fix the errors before they become a problem. Go to dataengineeringpodcast.com/tidydata today and get started for free with no credit card required.
- Your host is Tobias Macey and today I’m interviewing Sijie Guo about the current state of the Pulsar framework for stream processing and his experiences building a managed offering for it at StreamNative
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by giving an overview of what Pulsar is?
- How did you get involved with the project?
- What is Pulsar’s role in the lifecycle of data and where does it fit in the overall ecosystem of data tools?
- How has the Pulsar project evolved or changed over the past 2 years?
- How has the overall state of the ecosystem influenced the direction that Pulsar has taken?
- One of the critical elements in the success of a piece of technology is the ecosystem that grows around it. How has the community responded to Pulsar, and what are some of the barriers to adoption?
- How are you and other project leaders addressing those barriers?
- You were a co-founder at Streamlio, which was built on top of Pulsar, and now you have founded StreamNative to offer Pulsar as a service. What did you learned from your time at Streamlio that has been most helpful in your current endeavor?
- How would you characterize your relationship with the project and community in each role?
- What motivates you to dedicate so much of your time and enery to Pulsar in particular, and the streaming data ecosystem in general?
- Why is streaming data such an important capability?
- How have projects such as Kafka and Pulsar impacted the broader software and data landscape?
- What are some of the most interesting, innovative, or unexpected ways that you have seen Pulsar used?
- When is Pulsar the wrong choice?
- What do you have planned for the future of StreamNative?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Apache Pulsar
- StreamNative
- Streamlio
- Hadoop
- HBase
- Hive
- Tencent
- Yahoo
- BookKeeper
- Publish/Subscribe
- Kafka
- Zookeeper
- Kafka Connect
- Pulsar Functions
- Pulsar IO
- Kafka On Pulsar
- Pulsar Protocol Handler
- OVH Cloud
- Open Messaging
- ActiveMQ
- Kubernetes
- Helm
- Pulsar Helm Charts
- Grafana
- BestPay(?)
- Lambda Architecture
- Event Sourcing
- WebAssembly
- Apache Flink
- Pulsar Summit
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need some more to deploy it, so check out our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, a 40 gigabit public network, fast object storage, and a brand new managed Kubernetes platform, you get everything you need to run a fast, reliable, and bulletproof data platform. And for your machine learning workloads, they've got dedicated CPU and GPU instances.
Go to data engineering podcast.com/lunote today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.
[00:00:51] Unknown:
You monitor your website to make sure that you're the first to know when something goes wrong. But what about your data? Tidy Data is the DataOps monitoring platform that you've been missing. With real time alerts for problems in your databases, ETL pipelines, or data warehouse, and integrations with Slack, PagerDuty, and custom webhooks, you can fix the errors before they become a problem. Go to data engineering podcast.com/tidydata today and get started for free with no credit card required. Your host is Tobias Macy. And today, I'm interviewing CJ Guo about the current state of the Pulsar framework for stream processing and his experiences building a managed offering for it at Stream Native. So, CJ, can you start by introducing yourself? Hi. Hi, everyone.
[00:01:32] Unknown:
Thank you for having me on, data engineering podcast. And my name is CJ Guo, and, I'm currently the CEO and cofounder of Stream Native. Stream Native is a San Francisco based startup, And we are providing a cloud native event streaming platform powered by Pulsar. And we're also managing a fully managed service or Pulsar on different public cloud. And the the managed service can run, either in our cloud account or in our customer account. So,
[00:02:02] Unknown:
yeah, thank you for having me here. And do you remember how you first got involved in the area of data management?
[00:02:08] Unknown:
Yeah. So I was, starting my, kind of journey working on the, distributing cluster of our system. And, about, like, 10 years ago, Hadoop was, gaining the attractions in in China. I was kind of the 1st set of the contributors that who contributed to Hadoop, HBase, and Hive. And, I was, I was part of the initial team who built, Tencent data warehouse, based on Hive. And Tencent is 1 of the largest, internet, companies in China. And then after working on Tencent data warehouse, I moved my career to Yahoo and that's why I get involved in a lot of development on Bookkeeper and data on, on Pulsar and that getting into the whole mastering and streaming space.
[00:03:00] Unknown:
And the separation of the storage from the broker and Pulsar is definitely 1 of the things that I find most interesting about it from the architectural perspective. And I know that bookkeeper is being used for a number of other systems as well. And for people who are interested in more of the sort of background and early days of Pulsar and some of the architectural principles, I did interview a couple of the other core committers to the project a couple of years ago, so I'll put a link to that in the show notes. And for anybody who hasn't listened to that, can you just give a bit more of an overview about what Pulsar is and how you first got involved with the project? So Pulsar, we usually use, kind of 1 sentence to describe what, what it is Pulsar. We usually say it's kind
[00:03:46] Unknown:
of kind of the capability providing by Pulsar. It's a PubSub Mastering System. So you can use, that use Pulsar as a normal mastering system, like, what you use for Kafka or use for Reb MQ and Net MQ. But the second half of the sentence is basically tell the difference about, like how PulseEye is different from many other messaging system. It's backed by a durable lock storage. And the durable lock storage is basically the, the bookkeeper project you mentioned. And as you can see what I was kind of the first, like engineer that who was involving in the Bookkeeper project. So Bookkeeper was originally started in Yahoo Research, and it was designed to addressing, the high availability issue of, HDFS name node. So the core mechanism or core replication mechanism was abstract out of the, dispute consensus algorithm that used by Zookeeper.
And then it gets evolved into distributed locked storage so that you can use the lab for building out many different, systems. At that point, I think maybe 10 years ago, we tried to build the 1st Pub Sub Mastering System based on Bookkeeper, which was called HeadWeek. But right now, that project was kind of already a that gate. But, that is basically set the foundation of the whole architecture of Parsa or many other followers, in in this space especially separating the, broker serving, from the the master storage. So you can have, 2 separate layers, individual layers that you can scale up independently and also improving a lot of bunch of the high variability and also fail over time.
And I actually wrote a bunch of the, articles, a few years ago about talking about the architecture advantages of this, layer architecture in, segment century storage and feel free to check out those, articles in in the Internet.
[00:06:00] Unknown:
And in terms of the overall life cycle of data, where does Pulsar fit in that overall ecosystem of the different data tools? I know that it is sometimes compared to Kafka or also maybe used in conjunction with or instead of things like Spark Streaming or Flink. Wondering if you can just give a bit more of a picture of sort of the different ways that Pulsar is being used and some of the use cases that it's optimized for. So I think to get started there is,
[00:06:28] Unknown:
I I will I will try to actually maybe maybe clarify a bit about the capability of what capability that Parsa provide. As I said, originally, Pulsar is flexible PubSub messaging system. So it offer all the capabilities of mastering system. But, after Pulse, is incubate in Apache Foundation for about 2 years and, that kind of get evolved into a more mastering, plus streaming system. What we usually call it, a cloud native event stream platform. What does that mean is the call abstraction with Imposa, it's, it's kind of a distributed log. It's, event streams. It can be used for storing infinity streams of events. And so the capability providing by Pulsar is actually you you are able to use in Parcel to ingest events to event, to topics.
You are able to keep the event for a longer duration based on, your retention policy and you are able to using different data processing tools like, you can integrate with, Spark and Flink to do unified data processing. You can use Presto or Hive to do, these interrupt interactive queries. And we also introduced custom functions to, lightweight computation. With that being said is from the role in the whole ecosystem, first is since it's able to providing the ingestion capability for people to ingest data into Pulsar. So you can use that as a messaging system to connect the service with your your whole data infrastructure. So it it will become a kind of integration platform. And, since we provide the capability for storing events for a longer duration, so you can use that for as a stream storage. And in my opinion, is it's more like it has been evolved in became, kind of a stream streaming database because it provides schema. So you can treat, those, event streams as a structured event streams. And when we do the integration with, Flink, we actually map these, topics into tables in, Flink catalog. So then you're able to use those data processing engine to query and processing data. So in short, to summarize, this is, it's a mastering platform that you can do data ingestion.
And it's, I would say it's the stream storage that you can use that for data processing. So that is kind of the the idea. And the functions
[00:09:11] Unknown:
ways that it can be used. Because I know that for instance, with Kafka, it has support for Kafka streams and the Kafka connect plugins, which Pulsar has the IO as its analog to that. But it seems that the functions capability is a bit more tightly integrated into the capabilities of Pulsar. So I'm wondering if you could talk a bit to that and some of the other capabilities of Pulsar that make it stand out from some of the other options that people might consider for this durable PubSub use case? Yeah. So,
[00:09:42] Unknown:
I think so, partial project has evolved and changed over the past 2 years, and definitely function is 1 of the most attractive features that a lot of people love to use. So function is basically very lightweight, computing, I would say, event processing framework that bring the whole serverless idea into event streaming. So you can, write, event processing logic using the language you like. Like you can write a function using Java. If you're a Java developer, you can write a function using Python, using, Go, Go language. So you can write the functions, as you like, and you don't need to learn a new framework. It's just like, for every engineer, the first thing you you know is how to write function. So this would reduce the barrier for people want to, to add the processing capability to, existing PubSub Mastering system. And 1 of the reason is, most of the I would say about 50% of the workload that a mastering a mastering system is used is basically for connecting service, connecting service within, infrastructure. And you, in order to provide the easy way for people to do the logic, function is definitely the simplest way because you don't have extra dependency. You can just write a function as you want, and, so you can submit the functions. So that is definitely, it's a bit different from, like, a traditional data processing engine, which we're more focusing on those lightweight computing use cases like ETL, transformation, routing, and prediction, and, maybe simple aggregation. So that is function. And besides functions, Pasa has adding many features in in the past, and I I probably I can share some of them. Like, another 1 is the tier storage. And tier storage is basically provide the ability to extend the the cost storage capability they're providing by Bookkeeper into some much cheaper form storage system like, S3, GCS, Azure Propsa, even HDFS on prem. And so this would allow you to keep the data into the system, in an infinity event stream form. So you don't need to kind of, I need to dump the data out of my messaging system and going into some other storage format.
And since by providing tier tier storage, you are able to keep data for much longer duration, it actually providing an a unified abstraction of your data, which is called the Infiniti event streams. And when you integrating this, data model with Flink, then you can create a unified data processing stack. And that is kind of the whole idea behind that. And I can call out some other kind of the features like, key share subscription. That is an interesting 1. And, also the, protocol handler and, which is allow Pasa to be able to plug in different messaging protocols. And those those kind of the features are kind of driven by the use case, driven by the adoption of the community. And what are some of the other characteristics of the community that has grown up around Pulsar that you would see as being distinct from some of the other
[00:13:16] Unknown:
streaming systems that are being used by people?
[00:13:19] Unknown:
So probably in the past 2 years, Hausa has been kind of community driven, use case driven. And what have been seen, very successful, most of the, like, adoptions of process coming from 3 main categories. 1 is existing, revMQ and activeMQ, users. So that is kind of, that is more coming from building out the core applications. And that drive that drives a lot of, development of mastering oriented features like TTL, dead letter topic, and schedule masters, delay masters. Those are the features that are more commonly seen in the traditional master and queuing system. And the second category is more driving by the, I would say data processing, use cases, like more integrating with Flink and integrating with Spark.
And that introducing a lot of features like key share subscription, the tier storage, and Conner offload. So be able to kind of providing an efficient way for your data processing engine to process the events within Posa. And the 3rd category is what coming from IoT use cases. That's that I would say that is kind of leads the bond of, of this creation of passive functions. So it drives a lot of development around bringing the serverless or lightweight computing features into Parsa. And that's how the community
[00:14:46] Unknown:
helped, the whole, Parsa team, Parsa PMC to, materialize the whole project as a product. And over the past 2 years, I know that some of the features that have been incorporated since the last time I talked about this project on the podcast are things like the functions workers, integrated SQL layer is new as well. And I'm wondering if you can just talk about how the overall growth of the data ecosystem and the focus on streaming as a core architectural principle of these systems has influenced some of the product direction
[00:15:27] Unknown:
the
[00:15:28] Unknown:
functionality development of Pulsar? Yeah. Just sort of, how some of the recent trends in the overall data industry have influenced the decisions around Pulsar and the direction that it's taken in the past 2 years since I last had it on the podcast? We have been observed, like, 2 trend, in the whole, when helping people adopting Pasa. Like, 1 trend is kind of more happening on the, data processing area that especially the rise of,
[00:15:57] Unknown:
the adoption of Flink and, as well as, Spark is able to do both streaming and, streaming and batch processing. And we find the, the increasing of the use case, like, machine learning, deep learning, create kind of a challenging to the existing data processing stack is, you need a processing engine that is able to do both, batch in and stream processing. Is, all this use case not just only need the historical data, but they also need, the real time data. They need to combine, both physical data and real time data into 1 data processing engine. And, Flink and Spark already do a great job on providing an abstract API or unified, processing engine. But, there's a lack of the data management system is able to provide a unified, unified data, system for those engine to be, efficiently processing those data, we found because of Pulsar, the call abstraction that provided Pulsar is an infinity event streams. And that leads us into creation of, things like tier storage that is able to support this, unified data processing, stack. So that is kind of the first category.
And the second trend we have been observed is, with the rise of IoT use cases, connected cars, you will see a lot more and more edge data centers and more and more smart devices, and those devices are kind of, collected. The events or data of those devices are collected in the ages, but age doesn't have enough resources for people to process those events. And hence, you need, provide a lightweight computing engine and, for people to maybe just easily write functions to processing those events in the edge. So this kind of age oriented or IoT oriented use cases has became, the main adoption of POS functions.
[00:18:05] Unknown:
1 of the critical elements of the success of any piece of technology, particularly open source, is the rate of adoption of users and the overall ecosystem that grows up around it. I'm wondering if you can talk a bit about how the user community has responded to Pulsar and some of the barriers to adoption that have existed and the work being done to drive those down. So I think,
[00:18:30] Unknown:
and in terms of, like so we graduate, so Pasa graduated around, like, late, 2018. And and there has been a very wonderful, year, for Pulsar in 2019. Just a couple of the metrics is the number of stars is already double, And we have seen the Slack channel, the users of Slack channel grows growing from like 500 to right now close to like 1700. And we see contributors going from like around 70 to like right now it's 250. So we see like, kind of from different metrics, we see the community has kind of doubled or even tripled. And from adoption size, we have been seeing, a crazy adoption in 2019, and we see this happening in Asia, post Asia, North America, and in Europe. In Asia, we have the 1 of the largest largest internet company Tencent is, going into an all in state into Pasa. Basically their whole billing platform right now is building on Pasa. What does that mean is, transactions are every purchase that is happening in Tencent's product is going to Pasa first.
And it has been, like, processing, like, tens of billions of transactions every day. And, in North America, we also see, Pulsar is being adopted in different industries. And we have a whole page about a power buy page, for people to check it out. And we also do, user survey. The PMC did a user survey kind of end of 2019. We've had published the survey report recently and to disclose the kind of the current state of the adoption and how people use, Pulsar and what are their plan to grow Pulsar usage in, in the coming year. So, feel free to check that out. It's available in, Pawsa website. You can go to Streamlative website to download the user report. And I know that
[00:20:36] Unknown:
the Kafka ecosystem has grown up quite a bit because of the fact that it was 1 of the first movers in this space. And so a lot of the existing systems that might integrate with a streaming system already have capabilities for working with Kafka. And 1 of the projects that you and some of your collaborators rolled out recently is an addition of the Kafka protocol running on top of Pulsar. So I'm wondering if you can talk a bit about how that's implemented and how that fits into the overall architecture of Pulsar itself, and what you think are going to be some of the benefits of that to the Pulsar community? Yeah. I think that is a kind of interesting question.
[00:21:13] Unknown:
I kind of missed in the in the previous question is still, we have a very wonderful 2019, but still there's some barrier for people to adopt, adopting Pasa because, existing there's already existing, messaging systems like, Kafka as what you mentioned and as well as Reb MQ and ActMQ. Those are kind of written in the standard messaging protocol like MQP. Hence, we're still seeing a bunch of barriers for people to adopting Pulsar. And, what we have been thinking about, like, how we want to reduce the barrier for people to, use Pulsar and enjoy all the features provided by Pulsar, like multi tenancy, tier storage, and functions. And the first attempt we have we have done and, which is also, tried by, OVH Cloud is trying to implementing, proxy. And that is usually, people would commonly try to do, when they want to adapt a newer system to existing system. So they will write epoxy and try to, write some logic to transfer the wireframes to from 1, messaging protocol to the other messaging protocol. But we found that is not a natural way to do. And there's a bunch of overhead and, challenges.
And we kind of step back in thinking about what are the real value provided by Pulsar. So as I mentioned be before is Pulsar is actually a stream, event stream storage. So the call abstraction providing Pasa is an infinity event stream. In our in our way, it's called a Distributed Log. So and Kafka is kind of building, around the similar abstractions. It's also a distributed lock. So we found there's a lot of similarity between, Pulsar and Kafka. And, we think that we we cannot step back and think, maybe we what we should do is make Pulsar as a reliable and scalable event, stream storage and allow developers to customize their own protocol or messaging protocol.
This first would help people, creating some adapters to fit in into existing messaging, ecosystem. And the other way would allow a developer to make any innovations, of of developing messaging protocols by leveraging the whole fundamental advantages provided by Pasa. So we kind of introducing 1 framework within Pasa, which is called protocol handler. The protocol handler providing a way for implementation of the messaging protocol to interact with the whole EventStream storage of Pulsar and this this into the creation of Kafka and Pulsar. So we're basically using the Postgres handler framework to develop a Kafka protocol.
And, that is a plugin. So you can download a plugin and install to your existing parcel cluster, and your parcel broker is able to speak Kafka protocol. With this capability, your existing, Kafka application or Kafka service, you can you don't need to change any code. You can just point, your your Kafka application or service across a cluster then you are able to go. And we did the work, by cooperating with OVH Cloud. And right now, Tencent is also, trying to using Kaka and Pasha. So, they're going to make Pasha as a fundamental, messaging infrastructure.
And so this would, we we we we would expect there was a go this would help in growing the community and reduce the barrier for people to trying out Pasa. And, I did a webinar with, Pierre, who is the tech lead of OVH Cloud, a couple weeks ago, the video is available in, Stream
[00:25:05] Unknown:
Streamlative website and as well as the YouTube channel. And, for people who are interested in Kafka and Pulsar, feel free to check it out. And because of the fact that you have this protocol handler layer in Pulsar and it opens up the possibility of adding new protocols, I'm wondering if there's any work being done to integrate with the open messaging specification that's being put forward as a common standard for different messaging systems to be able to interoperate more easily. Yeah. So, right now,
[00:25:36] Unknown:
what we have been working on is, in actually integrating with 2, other popular messaging, protocols. 1 is MQP. The other 1 is MQTT. MQ MQP is more, is very popular in the traditional messaging network. And, MQTT is what is popular in the, IoT, messaging, workload. And so we hope that this would, simplify a lot of use cases that they are kind of, moving from existing, traditional messaging queuing work workloads, from the IoT messaging. So that is kind of the 1 effort that we are doing now. And it's also a cooperation, with, China Mobile. So, that is so I think that the interesting things of doing this in open source is we are able to leverage with, work with a lot of end user to kind of deliver that what the end user needs and be able to serve the best use cases.
And going back to the open messaging protocol, I was actually involved into the initial creation of the open messaging protocol. Right now, I think open messaging protocol or standard is kind of still an API label standard. It doesn't get into the, wire protocol layer. So we are kind of still pushing that effort forward. If there's any open matching protocol coming out, we we we should be able to support that, very quickly. And another interesting
[00:27:10] Unknown:
aspect of Pulsar and its relation to Kafka is that there is a decent amount of overlap in terms of the use cases that it provides for. And as both projects are still very active and have large and growing communities, I'm wondering what you have seen as being some of the ideas that are being passed back and forth and some of the lessons that are being learned from each other's communities and each other's technical implementations?
[00:27:35] Unknown:
Based on the my experience on helping people adopting Pulsar is, I see that Pulsar is commonly used in 2 categories of the users. 1 is, I would say, more coming from data pipeline, data slash data processing, where, Kafka is mainly used there. And, the other category is more coming from these online, core business services, event driven workflow. People are kind of more using the or traditional messaging queuing system. And what I have been seeing here is the adoption could happen in either way. Like people can have, can coming from the traditional messaging queuing and looking into Pasa because Pasa is able to provide us scalability, that it's more scalable than, traditional messaging queuing system. Some other use cases more coming from Kafka and, in the Kafka world is most of the pain points are coming from the operational, especially, when you want to operate in multiple clusters or want to scale beyond a certain point, you will see the operational pain points. And the adoption of Pasa is coming more coming from the these 2 kind of different categories. But I see a trend is like people when people are adopting Pasa for maybe for their online, business or online use cases, they are kind of starting moving, pushing Parsa into data pipeline, maybe, like data processing.
If people adopting Parsa for data processing, they might be pushing to this online service. I do see Pasa is able to kind of emerging these 2 different ecosystems. Then this also leads into, the, kind of the enhancement, the development of both ecosystems. So putting in that way is Pasa is also learning, like, from different ecosystem how to address those kind of the issues that has been seen in the, existing systems. And I do see, in the Kafka ecosystem or Kafka community, people also looking into how to adopting, the features, the architecture advantages they're providing by Pasa. For example, I I see, Kafka community has been talking about, tier storage for a while, and, those tier storage idea was kinda originally brought in by Pasa.
So I would see that these 2 communities were still kind of growing in their own way. And, but they were kind of continuing, like, learning from each other. That is kind of my, my take on these questions.
[00:30:20] Unknown:
And then in terms of your involvement with Pulsar, you mentioned that you've been working on it for quite some time, and you were 1 of the co founders of Streamlio, which was 1 of the early companies built around Pulsar and driving it forward in terms of its development and growing the ecosystem. And now you have founded Stream Native as a company to build a managed service of Pulsar and its own distribution. Wondering if you can talk a bit about some of the lessons that you've learned from Streamlio that have been most helpful in your current endeavor and how you characterize your relationship with the project and the community at each of those stages of your career? I think the for the first
[00:31:00] Unknown:
question, I think, now working in, StreamDeo and, trying to kind of helping, people adopt in Pasa and see seeing the the project, going is definitely a very wonderful journey, and I have learned a lot of, lessons from that experience. Based on those lessons and also the experience of running, Stream Native, especially in 2009 2019, we have been really focusing on, helping people adopting parcel and growing the community. And I think the, for, I would say the most important lesson that I have learned is to first find the project in, community feed. What does that mean is you need to find, why people need Pulsar and, how Pulsar can address people's pain points. And that has to be working with the those early adopters. And, I would say, and sometimes you need to work with those large Internet companies because they have these, inference and they have this, scale to be able to help you, verify that PulseEye is able to kind of support from, small scale to, large scale and in into different, industries.
And, the last thing I find is super important is also to find the position of, software in the whole ecosystem. And I really like the question you asked earlier is what is your role, for of Pulsar in the whole life cycle of data management. And that I think that is the most important lesson I have learned, when running my own company is you need to fit in into the whole ecosystem. And if you can see in the past year, we have been doing a lot of integrations with Flink and, with Spark because that is the fit for Pulsar in the whole big data ecosystem because you are you are the Pulsar is the messaging system. You are able to get the data into the system. You are also a stream storage. You are able to keep data for a much longer duration. That is the advantage providing by, Pasa.
And in order for people to be aware of Pasa, you need to do the integration with the, the big ecosystem. And with those kind of experience, we're kinda moving into a product that, growth strategy is, mostly focused on customers and also from the community users to learn, what kind of the, requirements and also their use cases and how we we can incorporate those requirements in and the use case into development into developing the project and as well as adding the features into, the whole product. So that is kind of the most important, lessons I have learned in the past.
And, you asked the second question, since I'm kind of vendor in this market and how I categorize my, relationship with this project in the community in each role. And, I think most important thing I want to raise here is as a project running in Apache Foundation, we kind of working in the Apache way. What does it mean is everyone in the project in the community kind of wear multiple hats. Like, like, for example, taking me as example, I'm the I'm an individual that is acting as a PMC member and also a committer for both parcel and bookkeeper. And but in so I have to be giving out an independent opinions, from a PMC and committer perspective because I'm, when am I talking to the committed users? I'm the I'm I'm representing the Apache Software Foundation. At the second time, I'm also the kind of the vendor of Pulsar, the owner of Stream Native. And so what we have trying to do here is, we do a lot of things to do, try our best to helping people adopting Pasa. It's more from a partnership, collaboration perspective because we believe we have to grow the community in order to grow, any business business field around Pasa. With helping people, adopting Pasa, we get a lot of use cases that we can incorporate those requirements into developing, Pasa. And that in return can help, grow in the community of Pasa. So I think it's, we were we play multiple roles in the community, and, we kind of, developing those relationship in a collaborative way and, making sure the the main focus of, of the project is on growing the adoption and, making sure people is able to use Pasa in different industries.
[00:35:53] Unknown:
And I know that 1 of the ways that you're helping to drive that adoption is by being a spokesperson for the community. I know that you release the, biweekly, notes of what's been happening within the community, and you also have a stream native distribution of the distribution and some of the work that you're doing to help simplify the operability of the platform because of the fact that it does have so many different moving pieces.
[00:36:27] Unknown:
So I think from a stream native product perspective, we do provide a stream native platform, which is kind of it's, it's powered by, a bunch of Parsa And the currently, the main difference between a stream native platform and the deposit is basically we're providing a lot of operational, related tools for people to, to simplify the operations of people running, Parsa in a different environment. Mostly focus on the community environment. So we provide Helm Chart. We provide Golang based the administration tools, and, we provide, password manager. And we also offer an enhanced version of, Grafana dashboard for people to, really understanding what's going on in into the platform. That is kind of the main focus, for the first version of generative platform.
Besides that, as we also bundle, kind of a Kafka and Pasha natively into the platform. So for people who want to use Kafka and Pasha, you can, download Stream Native platform and get started easily. And at this moment, Stream Native platform is kind of purely, community edition. So everyone is free, to use. And we might be developing some of the enterprise more kind of, closed source features in the future, but, we haven't, decided yet. Our main focus is still on developing, our cloud service and, providing the,
[00:37:59] Unknown:
the managed service in the in the cloud. And I'm wondering what motivates you to dedicate so much of your time and energy to Pulsar in particular and the streaming data ecosystem in general because a significant portion of your career has been focused around this project and this problem domain. So I'm wondering what is keeping you interested and motivated throughout. Yeah. So,
[00:38:21] Unknown:
as I've mentioned, I think, I started my career about like 10 years ago, and I see the kind of the how Hadoop and how, kind of the infrastructure technology can grow and can influence the whole industry. Basically the whole, the whole growth of the economy in China, especially on the whole internet industry is kind of due to Hadoop and due to the whole big, big data ecosystem. So So I have been seeing how a technology can influence a whole industry. And I moved from Yahoo to Twitter and, you know, Twitter is kind of, a messaging platform for the whole internet. And we, you can think about Twitter is kind of the first company who is kind of using a lot of streaming technology. So I get into this space and I see how, streaming technology can be, used for helping an enterprise like, Twitter to become very successful.
And I want that kind of the technology or that kind of the streaming mindset to be delivered to more industry, to help them, successful in this industry. And we see some of the existing technology, have, that didn't address this in a very great way. There's still some, short shot, short comes, like drop dropbacks. So we want to kind of use our experience, use the technology that we have been developing to have it help more industry, more enterprise to be able to enjoy the power of streaming technology. And that's why kind of driving me crazy, into this space and dedicate my energy into this space. Yeah. And 1 of the interesting
[00:40:10] Unknown:
impacts of projects such as Pulsar and Kafka and the overall focus on streaming data as a core component of a lot of these data systems is that that overall design is starting to leak out into other areas of software and technology. And I'm wondering if you could just talk about some of the ways that you have seen streaming data as being an important core competency of different technology industries, and the ways that projects such as Kafka and Pulsar are impacting architected.
[00:40:44] Unknown:
So, in terms of, like, in terms of streaming, what we have been thinking about streaming is actually coming from, we see a kind of a software usage pattern has been shifting, within the enterprise. So initially, like an enterprise is more building out, in a way of using software. So you you you might building out a team of people who is kind of working, with an, database. And then while you put data layer, you provide some query interface for people to query, and that creates the whole, database, ecosystem. And that get evolve into this big data or batch processing ecosystem. But, the use cases, and also the requirements has been shifting from more event driven workflow.
Like, all these, the events can be generated from different sources. Like, when, for example, like when you browsing a web page, you click on this web page, you were generating different click events. Those events can be used by the enterprise to analyze it and analyzing the user behavior and be able to do better targeting, better marketing, be able to provide better services. So we see the use case has been shifting to more an event driven or streaming driven use cases. And that means the whole, software architecture of an enterprise has been shifting into event driven architecture or event driven workflow. And in that way, the mindset is kind of is able to kind of shifting from processing static dataset into dynamically changing data streams. So once you have this mindset shift, then you need new tools, you need new capability, and that creates the whole mastering ecosystem, streaming ecosystem and streaming toolchains. And we see this kind of toolchains and the ecosystem Internet, to financial to retailer and maybe to IoT has been very successful.
And, but this can also be very is playing a very important role in the current, enterprise software. Is playing a very important role in the current,
[00:43:11] Unknown:
enterprise software architecture. I'm wondering what you have seen as being some of the most interesting or innovative or unexpected ways that you've seen Pulsar used and the applications
[00:43:21] Unknown:
of streaming data. Yeah. So 1 of the, so I think, the common kind of impression, that the industry has, on Pasa is basically a MasterQ. And, I I we have, we have a partial users in China that is called BestPay. BestPay is 1, is the, the 3rd largest payment companies in China. And the use case there is very interesting is they use Pulsar for the, real time risk control pipeline. And, you know, in the traditional, data processing stack is people usually using Lambda architecture. So you're building out a batch layer that is using, HDFS or Hive and you building out, a speed layer using, Kafka and Storm. And you combine these 2 together into a Lambda architecture.
The use case, in BasePays, basically, they they try to get rid of these 2 layers and getting into a unified data processing stack. In the storage there, they standardized on using Pulsar as the source of juice. Basically they put everything into Parza and so then you would have a common, I would say, a a center that is keeping all these, event streams, both for the historical data and also the real time data, And they standardize the commuting engine, into using Spark so they can do, Spark structured streaming and also Spark. So they they reduced the system from 4 to 2. And with these kind of converting the, I would say, shifting the whole capability of Parshaw beyond, a messaging queue, it's became more becoming a streaming data warehouse. So that is kind of the most, interesting and, you know, in in a innovative way for using Pulsar in in in a use case. And I'm excited about this use case is because it's also using in, real time risk control that is in the core business pipeline that is really deliver significant, impact to a business logic. So that is kind of the most interesting, use case I have seen,
[00:45:38] Unknown:
Pulsar has been used. And for people who are adopting Pulsar, what are some of the edge cases or design elements that they are most challenged by in terms of figuring out how to architect and design their own solutions and use cases around Pulsar? So I think 1 of the, common
[00:45:59] Unknown:
pattern or common question I have received, in the community is especially coming from the event sourcing perspective. People has the impression is, Pasa is able to keep, event data and, you are able to keep the events for a longer duration by leveraging tier storage. And, but people kind of looking for more from perspective. They want to use a Pasa as a storage for Pondercap. But Pasa is mainly designed, for streaming workloads. In in other way, it's more designed for a scan base. So you're kind of streaming data. You are able to processing the streaming data in sequence. You want to rewind your data process, data processing job to an earlier point and we we scan the data. So, Pasa was kind of more designed in a way for scanning oriented workloads, not Pondercap. So I think that is the common kind of mistake, misusage of Pasa. And I would like people to be realizing
[00:47:04] Unknown:
it, before making any, design on on Pulsar. And for people who are evaluating Pulsar or considering it as a component of their architectures, what are the cases where Pulsar is the wrong choice and they might be better served with either an entirely different approach or a different set of tooling. So
[00:47:23] Unknown:
if you, the point of cap, as a event star, therefore, for sure, that is kind of the the first long choice. I don't I don't think at at this moment, Pasa is not capable for doing, this operation yet, but we might it might be changing in future. Who knows? But at this moment, the doing any POND cap in Pawsite, is the long choice. And the second second 1 is I see a lot of pattern is also coming up is, Pawsey is able to support millions of topics. And so people would end up trying to mapping, devices or users to an individual topic, and they try to grow, the number of topics beyond, maybe millions or 10 millions. And that is kind of still a bad it is based on current, personal implementation. That is still a kind of a bad design. Is we should try to react in a way to use I mean, at least reduce the number of the topics that will be used by that single application.
We definitely can support minions, but not tens of millions. And especially when you operating in, millions of topic, it's still a bit challenges. So when designed that, when people need to be trained up, how to use Puasa topics and how to leverage all the, good features providing by Puasa. And the last 1 I would say is more, Pasa also providing these non persistent topics or non persistent capability. And in order to use those non persistent capability, people have to be realizing what are the delivering guarantees, dispatching guarantees to making sure you, you are not surprised by those, guarantees providing by non persistent topics. So those are kind of the 3 common, pattern, common miscommunicate, use cases, I have seen in So in terms of the service, the product providing by Stream Native, that is we are kind of the fully focused on developing Stream Native Cloud, which is the fully managed process service, running on public cloud. We want to give, people the firsthand experience, very smooth experience for people to get get started in using Pasa easily. And that is from product side. On project side is, as I mentioned before, Pulsar has been evolving beyond PubSub messaging system. So you have many, you have 3 main capability is, you you are able to connect, like ingest the data into Parsa. You're able to starting start data.
You're able to process in data. So in in terms of project side is, we want, on the ingestion side is we want to integrating with more mastering protocols for people to be able to integrate with the existing messaging applications. And in the storage side, we want to do more in the offloader, in the tiered storage by bringing in some additional data processing oriented capability, like, column storage and bringing in, index and, be able to leveraging topic compaction. So Lowe's functionality can helping parts are providing better performance for unified data processing story. And in terms of the processing side, we want to improving the puzzle functions to introducing some orchestration framework to combine multiple functions into a pipeline so people can, write a simple function pipeline to chain multiple functions together. And we are also looking into integrating with, web assembly that is able to easily to support different, languages or functions.
And in terms of integrating with Flink, we we we already make Pasa as a source in sync for both Flink and, Spark. And we make Pasa as the catalog for Flink as as well. And the next step is how we want to deal with the state management and, the state management would come both for the postal functions and as well as link integration. So there's a lot of things to do,
[00:51:53] Unknown:
state management. Are there any other aspects of the Pulsar platform and its community and the ecosystem that's growing up around it or the work that you're doing at Stream Native that we didn't discuss that you would like to cover before we close out the show? So
[00:52:12] Unknown:
in the in April. And, due to the increasing, worst situation of coronavirus, we kind of pushed the YOOX conference to August. But at this moment, organizers also exploring a different, approach about, by providing, a pure virtual conference for personal summit. And please follow follow us on Twitter and we'll keep everyone posted. We are very excited, and we are very confident we are able to hold a kind of virtual conference and to be able
[00:52:45] Unknown:
to show more parcel oriented use case to broader community. Well, for anybody who does want to follow along with you for that or get in touch and see the other work that you're doing, I'll have you add your preferred contact information to the show notes. And with that, I would just like to ask a final question of what you see as being the biggest gap in the tooling or technology that's available for data management today? I think the,
[00:53:07] Unknown:
biggest, gap is, I think right now, still, in in the whole data management space or in the whole big data ecosystem, there's still, many many components in the whole pipeline and the ability to kind of grew, different type of the system together and also provide a uniform operation and also management experience. In other way, providing ability to trace events that is going through, from data source all the way to the data and and analytics of data warehouse that I didn't see there's a good tooling. And I would I wish
[00:53:49] Unknown:
in this space, we would like to see more efforts happening this. Well, thank you very much for taking the time today to join me and share your experience working in Pulsar and building a business around it. It's definitely a very interesting tool and 1 that I've been exploring for my own purposes. So I appreciate all the time and effort you've put into that, and I hope you enjoy the rest of your day. Thank you for having me here, and it's my pleasure to share all the experience, all the knowledge around this project and, as well as the company. And if you want to,
[00:54:20] Unknown:
like, chat with me more about Porsche or in general about streaming technology, you can find me on Slack or Twitter. Yeah. Thank you.
[00:54:33] Unknown:
Listening. Don't forget to check out our other show, cast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways that is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to CJ Guo and Stream Native
CJ Guo's Journey in Data Management
Overview of Pulsar Framework
Pulsar's Role in Data Ecosystem
Pulsar Functions and Features
Community and Use Cases of Pulsar
Kafka Protocol on Pulsar
Integrating with Open Messaging Specification
Lessons from Kafka and Pulsar Communities
CJ Guo's Experience with Streamlio and Stream Native
Stream Native Product and Community Efforts
Motivation and Impact of Streaming Data
Innovative Use Cases of Pulsar
Challenges and Edge Cases in Pulsar Adoption
Future Directions for Pulsar and Stream Native
Closing Remarks and Contact Information