StreamNative Brings Streaming Data To The Cloud Native Landscape With Pulsar

Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need some more to deploy it, so check out our friends over at Linode.

With 200 gigabit private networking, scalable shared block storage, a 40 gigabit public network, fast object storage, and a brand new managed Kubernetes platform, you get everything you need to run a fast, reliable, and bulletproof data platform.

And for your machine learning workloads, they've got dedicated CPU and GPU instances.

Go to data engineering podcast.com/lunote

today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.

You monitor your website to make sure that you're the first to know when something goes wrong. But what about your data?

Tidy Data is the DataOps monitoring platform that you've been missing. With real time alerts for problems in your databases, ETL pipelines, or data warehouse, and integrations with Slack, PagerDuty, and custom webhooks, you can fix the errors before they become a problem.

Go to data engineering podcast.com/tidydata

today and get started for free with no credit card required.

Your host is Tobias Macy. And today, I'm interviewing CJ Guo about the current state of the Pulsar framework for stream processing and his experiences building a managed offering for it at Stream Native. So, CJ, can you start by introducing yourself? Hi. Hi, everyone.

Thank you for having me on, data engineering podcast. And my name is CJ Guo, and, I'm currently the CEO and cofounder of Stream Native.

Stream Native is a San

Francisco based startup,

And we are providing

a cloud native event streaming platform powered by Pulsar. And we're also managing a fully managed service or Pulsar on different public cloud. And the the managed service can run, either in our cloud account or in our customer

account. So,

yeah, thank you for having me here. And do you remember how you first got involved in the area of data management?

Yeah. So I was,

starting my,

kind of journey working on the,

distributing cluster

of our system. And, about, like, 10 years ago, Hadoop was,

gaining the attractions in in China.

I was kind of the 1st

set of the contributors that who contributed to Hadoop, HBase, and Hive. And, I was, I was part of the initial team who built, Tencent data warehouse,

based on Hive. And Tencent is 1 of the largest, internet,

companies in China.

And then

after working on Tencent data warehouse, I moved my career to Yahoo

and that's why I get involved in a lot of development on Bookkeeper

and data on, on Pulsar and that getting into the whole mastering and streaming space.

And the separation

of the storage from the broker and Pulsar is definitely

1 of the things that I find most interesting about it from the architectural perspective.

And I know that bookkeeper is being used for a number of other systems as well. And for people who are interested in more of the sort of background and early days of Pulsar and some of the architectural principles, I did interview a couple of the other core committers to the project a couple of years ago, so I'll put a link to that in the show notes. And for anybody who hasn't listened to that, can you just give a bit more of an overview about what Pulsar is and how you first got involved with the project? So Pulsar, we usually use, kind of 1 sentence to describe what, what it is Pulsar. We usually say it's kind

of

kind of the capability

providing by Pulsar. It's a PubSub Mastering System. So you can use,

that use Pulsar as a normal mastering system, like, what you use for Kafka or use for Reb MQ and Net MQ.

But the second half of the sentence is basically

tell the difference about,

like how PulseEye is different from

many other messaging system. It's backed by a durable lock storage. And the durable lock storage is basically the,

the bookkeeper project you mentioned. And as you can see what I was kind of the first, like engineer

that who was involving in the Bookkeeper project. So Bookkeeper was originally started in Yahoo Research,

and it was designed to addressing,

the high availability issue of, HDFS name node. So the core

mechanism

or core replication mechanism was abstract

out of the,

dispute consensus algorithm

that used by Zookeeper.

And then it gets

evolved into

distributed locked storage

so that you can use the lab for building out many different, systems. At that point, I think maybe 10 years ago,

we tried to build the 1st Pub Sub Mastering System based on Bookkeeper, which was called HeadWeek. But right now, that project was kind of already a that gate. But, that is basically set the foundation of the whole

architecture

of Parsa or many other followers,

in in this space especially

separating

the,

broker serving,

from the the master storage. So you can have, 2 separate layers, individual layers that you can scale up independently

and also improving a lot of bunch of the high variability

and also fail over time.

And I actually wrote a bunch of the,

articles,

a few years ago about talking about the architecture

advantages

of this, layer architecture in, segment century storage and feel free to check out those, articles in in the Internet.

And in terms of the overall life cycle of data, where does Pulsar fit in that overall ecosystem

of the different data tools? I know that it is sometimes compared to Kafka

or also maybe used in conjunction with or instead of things like Spark Streaming or Flink. Wondering if you can just give a bit more of a picture of sort of the different ways that Pulsar is being used and some of the use cases that it's optimized for. So I think to get started there is,

I I will I will try to actually maybe maybe clarify a bit about the capability of what capability that Parsa provide.

As I said, originally,

Pulsar is flexible PubSub messaging system. So it offer all the capabilities

of

mastering system. But, after Pulse,

is incubate in Apache Foundation for about 2 years and,

that kind of get evolved into

a more mastering,

plus streaming system.

What we usually call it, a cloud native event stream platform. What does that mean is the call abstraction with Imposa, it's, it's kind of a distributed log.

It's,

event streams. It can be used for storing

infinity streams of events. And so the capability providing by Pulsar is actually you you are able to use in Parcel to ingest events to event, to topics.

You are able to keep the event for a longer duration based on, your retention

policy and you are able to using different data processing tools like, you can integrate with, Spark and Flink to do unified data processing. You can use Presto or Hive to do, these interrupt interactive queries. And we also introduced custom functions to, lightweight computation. With that being said is from the role in the whole ecosystem,

first is since it's able to providing

the ingestion

capability for people to ingest data into Pulsar. So you can use that as a messaging system to connect the service with your your whole data infrastructure.

So it it will become a kind of integration platform. And,

since we

provide the capability for storing events for a longer duration,

so you can use that for as a stream storage. And

in my opinion, is it's more like it has been evolved in became, kind of a stream streaming database

because it provides schema. So you can treat, those,

event streams as a structured event streams. And when we do the integration with, Flink, we actually map these,

topics

into tables in, Flink catalog. So then you're able to use those data processing engine to query and processing data. So in short, to summarize, this is, it's a mastering platform that you can do data ingestion.

And it's, I would say it's the stream storage that you can use that for data processing. So that is kind of the the idea. And the functions

ways that it can be used. Because I know that for instance, with Kafka, it has

support for Kafka streams

and the Kafka connect plugins, which Pulsar has the IO as its analog to that. But it seems that the functions capability is a bit more tightly integrated into the capabilities of Pulsar. So I'm wondering if you could talk a bit to that and some of the other capabilities

of Pulsar that make it stand out from some of the other options that people might consider for this durable PubSub use case? Yeah. So,

I think so, partial project has

evolved and changed over the past 2 years, and definitely function is 1 of the most

attractive features that a lot of people love to use. So function is basically

very lightweight,

computing,

I would say, event processing framework

that bring the whole serverless idea into event streaming. So you can,

write, event processing logic

using the language you like. Like you can write

a function using Java. If you're a Java developer, you can write a function using Python,

using,

Go, Go language. So you can write the functions,

as you like, and you don't need to learn a new framework. It's just like,

for every engineer, the first thing you you know is how to write function. So this would reduce the barrier for people

want to, to add the processing capability to, existing PubSub Mastering system. And 1 of the reason is, most of the I would say about 50% of the workload that a mastering

a mastering system is used is basically

for connecting service,

connecting service within, infrastructure. And

you, in order to

provide the easy way for people to do the logic, function is definitely the simplest

way because you don't have extra

dependency. You can just write a function as you want, and,

so you can submit the functions. So that is definitely, it's a bit different from, like, a traditional data processing engine, which we're more focusing on those lightweight computing

use cases like ETL, transformation,

routing,

and prediction,

and, maybe simple aggregation. So that is function. And besides functions, Pasa has adding many features in in the past, and I I probably I can share some of them. Like, another 1 is the tier storage. And tier storage is basically provide the ability

to extend

the the cost storage

capability they're providing by Bookkeeper

into

some

much cheaper form

storage system

like, S3, GCS,

Azure Propsa,

even

HDFS on prem. And so this would allow you to keep the data

into the system,

in an infinity event stream form. So you don't need to kind of, I need to dump the data out of my messaging system and going into some other storage format.

And since

by providing tier tier storage, you are able to keep data for much longer duration, it actually providing an a unified

abstraction of your data, which is called the Infiniti

event streams. And when you integrating

this,

data model with Flink, then you can create a unified data processing

stack. And that is kind of the whole idea behind that. And I can call out some other kind of the

features like, key share subscription. That is an interesting 1. And,

also the,

protocol handler and, which is allow Pasa to be able to plug in different messaging protocols. And those those kind of the features are kind of driven by the use case, driven by the adoption of the community. And what are some of the other characteristics of the community that has grown up around Pulsar that you would see as being distinct from some of the other

streaming systems that are being used by people?

So probably in the past 2 years, Hausa has been kind of community driven, use case driven. And what have been seen,

very successful,

most of the, like, adoptions of process coming from 3 main categories. 1 is existing, revMQ and activeMQ,

users. So that is kind of, that is more coming from building out the core applications. And that drive that drives a lot of, development of mastering oriented

features like TTL,

dead letter topic,

and schedule masters,

delay masters. Those are the features that are more commonly seen in the traditional master and queuing system. And the second category is more driving by the, I would say data processing,

use cases, like more integrating with Flink and integrating with Spark.

And that introducing a lot of features like key share subscription,

the tier storage,

and

Conner offload. So be able to kind of providing an efficient way for your data processing engine to process the events within Posa. And the 3rd category is what coming from IoT use cases. That's that I would say that is kind of leads the bond of, of this creation of passive functions. So it drives a lot of development around bringing the serverless or lightweight computing features into Parsa. And that's how the community

helped, the whole, Parsa team, Parsa PMC to, materialize the whole project as a product. And over the past 2 years, I know that some of the features that have been incorporated since the last time I talked about this project on the podcast are things like the functions workers,

integrated SQL layer is new as well. And I'm wondering if you can just talk about how the overall

growth of the

data ecosystem and the focus on streaming as a core architectural principle of these systems has influenced some of the product direction

the

functionality development of Pulsar? Yeah. Just sort of, how some of the recent trends in the overall data industry have influenced the decisions around Pulsar and the direction that it's taken in the past 2 years since I last had it on the podcast? We have been observed, like, 2 trend, in the whole, when helping people adopting Pasa. Like, 1 trend is kind of more happening on the, data processing area that especially the rise of,

the adoption of Flink and, as well as,

Spark is able to do both streaming and, streaming and batch processing.

And we find the, the increasing of the use case, like, machine learning, deep learning,

create kind of a challenging to the existing data processing stack is, you need a processing engine that is able to do both, batch in

and stream processing. Is, all this use case not just only need the historical data, but they also need,

the real time data. They need to combine,

both physical data and real time data into 1 data processing engine. And,

Flink and Spark already do a great job on providing an abstract

API or unified,

processing engine. But, there's a lack

of the data management system is able to provide a unified, unified data,

system for those engine to be, efficiently

processing those data, we found

because of Pulsar, the call abstraction that provided Pulsar is an infinity event streams. And that leads us into creation of, things like tier storage that is able to support this, unified data processing,

stack. So that is kind of the first category.

And the second trend we have been observed is,

with the rise of IoT use cases, connected cars, you will see a lot more and more edge data centers and more and more smart devices, and those devices are kind of,

collected.

The events or data of those devices are collected in the ages, but age doesn't have enough resources for people to process those events.

And hence, you need, provide a lightweight computing engine and, for people to maybe just easily write functions

to processing those events in the edge. So this kind of age oriented or IoT oriented use cases has became,

the main

adoption of POS functions.

1 of the critical elements of the success of any piece of technology, particularly open source,

is the rate of adoption

of users and the overall ecosystem that grows up around it. I'm wondering if you can talk a bit about how the user community has responded to Pulsar

and some of the barriers to adoption that have existed and the work being done to drive those down. So I think,

and in terms of, like so we graduate,

so Pasa graduated around, like, late,

2018. And and there has been a very wonderful, year,

for Pulsar in

2019. Just a couple of the metrics is the number of stars is already double, And we have seen the Slack channel, the users of Slack channel grows growing from like 500 to right now close to like 1700.

And we see contributors

going from like around 70 to like right now it's 250.

So we see

like, kind of from

different metrics, we see the community has kind of doubled or even tripled. And from adoption size, we have been seeing,

a crazy adoption in 2019,

and we see this happening in Asia, post Asia, North America, and in Europe. In Asia, we have the 1 of the largest

largest internet company Tencent is,

going into an all in state into Pasa. Basically their whole billing platform right now is building on Pasa. What does that mean is, transactions

are

every

purchase

that

is

happening

in

Tencent's

product

is going to Pasa first.

And it has been, like, processing, like, tens of billions of transactions

every day. And, in North America, we also see,

Pulsar

is being adopted in different industries.

And we have a whole page about a power buy page, for people to check it out. And we also do,

user survey. The PMC

did a user survey kind of end of 2019.

We've had published the survey report recently and to disclose the kind of the current state of the adoption and how people use, Pulsar and what are their plan to grow Pulsar

usage in, in the coming year. So, feel free to check that out. It's available in,

Pawsa website. You can go to Streamlative website to download the user report. And I know that

the Kafka ecosystem has grown up quite a bit because of the fact that it was 1 of the first movers in this space. And so a lot of the existing systems that might integrate with a streaming system already have capabilities

for working with Kafka.

And 1 of the projects that you and some of your collaborators rolled out recently is an addition of the Kafka protocol running on top of Pulsar. So I'm wondering if you can talk a bit about how that's implemented and how that fits into the overall architecture of Pulsar itself, and what you think are going to be some of the benefits of that to the Pulsar community? Yeah. I think that is a kind of interesting question.

I kind of missed in the in the previous question is still,

we have a very wonderful 2019,

but still there's some barrier for people to adopt,

adopting Pasa because,

existing there's already existing,

messaging systems like, Kafka as what you mentioned and as well as Reb MQ

and ActMQ. Those are kind of written in the standard messaging protocol like MQP.

Hence,

we're still seeing a bunch of barriers for people to adopting Pulsar. And, what we have been thinking about, like, how we want to reduce the barrier for people to,

use Pulsar and enjoy all the features provided by Pulsar, like multi tenancy, tier storage, and functions. And the first attempt we have we have done and, which is also,

tried by, OVH Cloud is trying to implementing,

proxy. And that is usually,

people would commonly

try to do, when they want to adapt a newer system to existing system. So they will write epoxy and try to,

write some logic to transfer

the wireframes to from 1,

messaging protocol to the other messaging protocol. But we found that is not a natural way to do. And there's a bunch of overhead and, challenges.

And we kind of step back in thinking about what are the real value provided

by Pulsar.

So as I mentioned be before is Pulsar is actually a stream, event stream storage. So the call abstraction providing Pasa is an infinity event stream. In our in our way, it's called a Distributed Log. So and Kafka is kind of building,

around the similar abstractions.

It's also a distributed

lock. So we found there's a lot of similarity between, Pulsar and Kafka.

And, we think that we we cannot step back and think, maybe

we what we should do is make Pulsar

as a reliable and scalable event,

stream storage and allow developers to customize their own protocol or messaging protocol.

This first would help people,

creating some adapters

to fit in into existing messaging,

ecosystem. And the other way

would allow a developer to make any innovations,

of of developing messaging protocols by leveraging the whole fundamental

advantages

provided by Pasa. So we kind of introducing

1 framework within Pasa, which is called protocol handler. The protocol handler providing a way for implementation of the messaging protocol to interact with

the whole EventStream storage of Pulsar and this this into the creation

of Kafka and Pulsar. So we're basically using the Postgres handler

framework to develop a Kafka protocol.

And, that is a plugin. So you can download a plugin and install to your existing parcel cluster, and your parcel broker is able to speak Kafka protocol. With this capability,

your existing,

Kafka application or Kafka service, you can you don't need to change any code. You can just point,

your your Kafka application or service across a cluster then you are able to go. And we did the work, by cooperating with OVH Cloud. And right now, Tencent is also,

trying to using Kaka and Pasha.

So, they're going to make Pasha as a fundamental,

messaging infrastructure.

And

so this would, we we we we would expect there was a go this would help in growing the community and reduce the barrier for people to trying out Pasa. And, I did a webinar with,

Pierre,

who is the tech lead of OVH Cloud, a couple weeks ago,

the video is available in, Stream

Streamlative website and as well as the YouTube channel. And, for people who are interested in Kafka and Pulsar, feel free to check it out. And because of the fact that you have this protocol handler layer in Pulsar and it opens up the possibility of adding new protocols, I'm wondering if there's any work being done to integrate with the open messaging specification

that's being put forward as a common standard for different messaging systems to be able to interoperate more easily. Yeah. So, right now,

what we have been working on is,

in actually integrating with 2,

other popular messaging, protocols.

1 is MQP.

The other 1 is MQTT. MQ MQP is more, is very popular in the traditional messaging network. And, MQTT

is what is popular in the,

IoT,

messaging,

workload. And so we hope that this would,

simplify

a lot of use cases that they are kind of,

moving from existing,

traditional messaging queuing work workloads,

from the IoT messaging. So that is kind of the 1 effort that we are doing now. And it's also a cooperation,

with,

China Mobile. So,

that is so I think that the interesting things of doing this in open source is we are able

to leverage with, work with a lot of end user to kind of deliver

that what the end user needs and be able to serve the best use cases.

And going back to the open messaging protocol,

I was actually involved into the initial creation of the open messaging protocol. Right now, I think open messaging protocol or standard is kind of still an API label standard. It doesn't get into

the,

wire protocol layer.

So we are kind of still pushing that effort forward.

If there's any open matching

protocol coming out, we we we should be able to support that, very quickly. And another interesting

aspect of

Pulsar and its relation to Kafka is that

there is a decent amount of overlap in terms of the use cases that it provides for. And as both projects are still very active and have large and growing communities, I'm wondering what you have seen as being some of the ideas that are being passed back and forth and some of the lessons that are being learned from each other's communities and each other's technical implementations?

Based on the my experience on helping people adopting Pulsar is, I see that Pulsar is commonly

used

in 2 categories of the users.

1 is,

I would say, more coming from data pipeline, data slash data processing, where, Kafka is mainly used there.

And,

the other category is more coming from these online,

core business services,

event driven workflow.

People are kind of more using the

or traditional messaging queuing system. And what I have been seeing here is the adoption could happen in either way. Like people can have, can coming from the traditional messaging queuing and looking into Pasa because Pasa is able to provide us scalability,

that it's more scalable than, traditional messaging queuing system. Some other use cases more coming from Kafka

and, in the Kafka world is most of the pain points are coming from the operational,

especially,

when you want to operate in multiple clusters or want to scale

beyond a certain point, you will see the operational

pain points. And the adoption of Pasa is coming more coming from the these 2 kind of different categories. But I see a trend is like

people when people are adopting Pasa for maybe for their online,

business or online use cases, they are kind of starting moving, pushing Parsa into data pipeline,

maybe,

like data processing.

If people adopting Parsa for data processing, they might be pushing to this online service. I do see Pasa is able to kind of emerging

these 2 different ecosystems. Then this also leads into,

the,

kind of the enhancement,

the development of both ecosystems.

So

putting in that way is Pasa is also learning,

like, from different ecosystem how to address those kind of the issues that has been seen in the, existing systems. And I do see,

in the Kafka ecosystem

or Kafka community, people also looking into

how to adopting,

the features,

the architecture

advantages

they're providing by Pasa. For example, I I see, Kafka community has been talking about, tier storage for a while, and, those tier storage idea was kinda originally

brought in by Pasa.

So I would see that these 2 communities were still kind of growing in their own way. And, but they were kind of continuing, like, learning from each other. That is kind of my, my take on these questions.

And then in terms of your involvement with Pulsar, you mentioned that you've been working on it for quite some time, and you were 1 of the co founders of Streamlio,

which was 1 of the early companies built around Pulsar and driving it forward in terms of its development and growing the ecosystem.

And now you have founded Stream Native as a

company to build a managed service of Pulsar and its own distribution. Wondering if you can talk a bit about some of the lessons that you've learned from Streamlio that have been most helpful in your current endeavor and how you characterize your relationship with the project and the community at each of those stages of your career? I think the for the first

question, I think,

now working in, StreamDeo and,

trying to kind of helping,

people adopt in Pasa and see seeing the the project, going is definitely a very wonderful journey, and I have learned a lot of, lessons from that experience. Based on those lessons and also the experience of running,

Stream Native, especially in 2009

2019, we have been really focusing on, helping people adopting parcel and growing the community. And I think the, for, I would say the most important lesson that I have learned is to first find the project in,

community feed. What does that mean is you need to find, why people need Pulsar

and, how Pulsar can address people's pain points. And that has to be working with the those early adopters. And,

I would say, and sometimes you need to work with those large Internet companies because they have these,

inference and they have this, scale to be able to help you, verify that PulseEye is able to

kind of support from, small scale to, large scale and in into different, industries.

And, the last thing I find is super important is also to find the position

of, software in the whole ecosystem.

And I really like the question you asked earlier is

what is your role,

for of Pulsar in the whole life cycle of data management. And that I think that is the most important lesson I have learned, when running

my own company is you need to fit in into the whole ecosystem. And if you can see in the past year, we have been doing a lot of integrations with Flink and, with Spark because

that is the fit for Pulsar in the whole big data ecosystem

because you are you are the Pulsar is the messaging system. You are able to get the data into the system. You are also a stream storage. You are able to keep data for a much longer duration. That is the advantage

providing by, Pasa.

And in order for people to be aware of Pasa,

you need to do the integration with the, the big ecosystem.

And with those kind of experience, we're kinda moving into a product that, growth strategy is,

mostly focused on customers and also from the community users to learn,

what kind of the,

requirements

and also their use cases and how we we can incorporate those requirements in and the use case into development

into developing the project and as well as adding the features into,

the whole product. So that is kind of the most important,

lessons I have learned in the past.

And, you asked the second question,

since I'm kind of vendor in this market and how I categorize my,

relationship with this project in the community in each role.

And, I think most important thing I want to raise here is as a project

running in Apache Foundation, we kind of working in the Apache way. What does it mean is everyone in the project in the community kind of wear

multiple hats. Like, like, for example,

taking me as example, I'm the

I'm an individual that is acting as a PMC member and also a committer for both parcel and bookkeeper. And but in

so I have to be giving out an independent opinions,

from a PMC and committer perspective because I'm, when am I talking

to the committed users? I'm the I'm I'm representing

the Apache Software Foundation. At the second time, I'm also the kind of the vendor of Pulsar,

the owner of Stream Native. And so what we have trying to do

here is,

we do a lot of things to do,

try our best to helping people adopting Pasa. It's more from a partnership,

collaboration

perspective because we believe

we have to grow the community in order to grow, any business business field around Pasa. With helping people,

adopting Pasa, we get a lot of use cases that we can incorporate those requirements

into developing,

Pasa. And that

in return can help, grow in the community of Pasa. So I think it's, we

were we play multiple roles in the community, and, we kind of,

developing those relationship

in a collaborative

way and, making sure

the the main focus of,

of the project is on

growing the adoption

and, making sure

people is able to use Pasa in different industries.

And I know that 1 of the ways that you're helping to drive that adoption is by being a spokesperson for the community. I know that you release the, biweekly,

notes of what's been happening within the community,

and you also have a stream native distribution of

the distribution and some of the work that you're doing to help simplify the operability of the platform because of the fact that it does have so many different moving pieces.

So I think from a stream native product perspective, we do provide a stream native platform, which is kind of it's, it's powered by, a bunch of Parsa And

the currently, the main difference between a stream native platform and

the deposit is basically we're providing

a lot of operational,

related tools for people to, to simplify the operations of people running,

Parsa in a different environment.

Mostly focus on the community environment. So we provide Helm Chart. We provide Golang based the

administration tools, and,

we provide,

password manager. And we also

offer an enhanced version of,

Grafana dashboard for people to,

really

understanding what's going on in into the platform. That is kind of the main focus,

for the first version of generative platform.

Besides that, as we also bundle,

kind of a Kafka and Pasha natively into the platform. So for people who want to use Kafka and Pasha, you can, download Stream Native platform and get started easily. And at this moment, Stream Native platform is kind of purely, community edition.

So everyone is free, to use. And

we might be developing some of the enterprise

more kind of, closed source features in the future, but, we haven't, decided yet. Our main focus is still on developing,

our cloud service and, providing

the,

the managed service in the in the cloud. And I'm wondering what motivates you to dedicate so much of your time and energy to Pulsar in particular

and the streaming data ecosystem in general because

a significant portion of your career has been focused around this project and this problem domain. So I'm wondering

what is keeping you interested and motivated throughout. Yeah. So,

as I've mentioned, I think,

I started my career about like 10 years ago, and I see the kind of the how

Hadoop and how, kind of the infrastructure

technology can grow and can influence the whole industry. Basically the whole, the whole growth of the economy in China, especially on the whole internet industry

is kind of due to Hadoop and due to the whole big, big data ecosystem. So So I have been seeing how a technology can influence a whole industry.

And I moved from Yahoo to Twitter and, you know, Twitter is kind of,

a messaging platform for the whole internet. And we,

you can think about Twitter is kind of the first company who is kind of using a lot of streaming technology. So I get into this space and I see how,

streaming technology can be,

used for helping

an enterprise like,

Twitter to become very successful.

And

I want that kind of the technology or that kind of the streaming

mindset to be delivered

to more industry, to help them,

successful in this industry. And we see some of the existing technology,

have, that didn't address this in a very great way. There's still some, short shot,

short comes,

like drop dropbacks.

So we want

to kind

of use our experience,

use the technology that we have been developing

to have it help more industry, more enterprise to be able to enjoy the power of streaming technology.

And that's why kind of driving me crazy,

into this space and dedicate my energy into this space. Yeah. And 1 of the interesting

impacts of projects such as Pulsar and Kafka

and the overall focus on streaming data as a core component

of a lot of these data systems is that that overall

design is starting to leak out into other areas of software and technology. And I'm wondering if you could just talk about some of the ways that you have seen streaming data as being an important core competency

of different technology industries,

and the ways that projects such as Kafka and Pulsar are impacting

architected.

So, in terms of, like, in terms of streaming,

what we have been thinking about streaming is actually coming from, we see a kind of a software usage pattern has been shifting,

within the enterprise.

So initially, like an enterprise is more building out,

in a way of using software. So you you you might building out a team of people

who is

kind of working,

with an, database. And then while you put data layer, you provide some query interface for people to query, and that creates the whole,

database,

ecosystem.

And that get evolve into

this big data or batch processing ecosystem. But,

the use cases,

and also the requirements has been shifting from more event driven workflow.

Like,

all these,

the events

can be

generated from different

sources.

Like, when, for example, like when you browsing a web page, you click on this web page, you were generating different click events. Those events can be used by the enterprise to analyze it

and analyzing the user behavior

and be able to do better targeting, better marketing,

be able to provide better services. So we see the use case has been shifting to more an event driven or streaming driven use cases. And that means the whole,

software architecture of an enterprise has been shifting into event driven architecture or event driven workflow. And in that way, the mindset

is kind of is able to kind of shifting from processing static dataset

into dynamically changing data streams. So once you have this mindset shift, then you need new tools, you need new capability, and that creates the whole mastering

ecosystem,

streaming ecosystem and streaming toolchains. And we see this kind of toolchains

and the ecosystem

Internet,

to financial

to

retailer and maybe to IoT has been very successful.

And, but this can also be very is

playing a very important role in the current, enterprise software. Is playing a very important role in the current,

enterprise software architecture. I'm wondering what you have seen as being some of the most interesting or innovative or unexpected ways that you've seen Pulsar used and the applications

of streaming data. Yeah. So 1 of the, so I think,

the common

kind of impression,

that the industry has, on Pasa is basically

a MasterQ.

And,

I I we have,

we have a partial users in China that is called BestPay. BestPay is 1, is

the, the 3rd largest payment companies in China. And the use case there is very interesting is they use Pulsar

for the, real time risk control pipeline. And, you know, in the traditional,

data processing

stack is people usually using Lambda architecture.

So you're building out a batch layer that is using,

HDFS or Hive and you building out,

a speed layer using,

Kafka and Storm. And you combine these 2 together into a Lambda architecture.

The use case,

in BasePays, basically,

they

they try to get rid of these 2 layers

and getting into a unified data processing stack. In the storage there, they standardized

on using Pulsar as the source of juice. Basically they put everything into Parza and

so then you would have a common, I would say, a

a center

that is keeping all these,

event streams, both for the historical data and also the real time data, And they standardize the commuting engine,

into using Spark so they can do, Spark structured streaming and also Spark. So they they reduced the system from 4 to 2. And with these

kind of converting the, I would say, shifting the whole capability

of Parshaw beyond,

a messaging queue, it's became more becoming a streaming data warehouse. So that is kind of the most,

interesting and, you know, in in a innovative

way for using Pulsar in in in a use case. And I'm excited about this use case is because it's also using in,

real time risk control that is in the core business pipeline

that is really deliver

significant,

impact to a business logic. So that is kind of the most interesting,

use case I have seen,

Pulsar has been used. And for people who are

adopting Pulsar,

what are some of the edge cases

or

design elements that they are most challenged by in terms of figuring out how to architect and design their own solutions and use cases around Pulsar? So I think 1 of the, common

pattern or common question I have received,

in the community is especially coming from the event sourcing

perspective. People has the impression is, Pasa is able to keep, event data and,

you are able to keep the events for a longer duration by leveraging tier storage. And, but people kind of looking for more from

perspective. They want to use a Pasa

as a storage for Pondercap. But Pasa is mainly designed, for streaming workloads.

In in other way, it's more designed for a scan base. So you're kind of streaming data. You are able to processing the streaming data in sequence. You want to rewind your data process,

data processing job to an earlier point and we we scan the data. So, Pasa was kind of more designed in a way for scanning oriented

workloads,

not Pondercap. So I think that is the common kind of mistake,

misusage

of Pasa. And I would like people to be realizing

it, before making any, design on on Pulsar. And for people who are evaluating Pulsar or considering it as a component of their architectures,

what are the cases where Pulsar is the wrong choice and they might be better served with either an entirely different approach or a different set of tooling. So

if you, the point of cap, as a event star, therefore, for sure, that is kind of the the first long choice. I don't I don't think at at this moment,

Pasa is not capable for doing,

this operation yet, but we might it might be changing in future. Who knows? But at this moment,

the doing any POND cap in Pawsite,

is the long choice. And the second second 1 is I see a lot of

pattern is also coming up is,

Pawsey is able to support millions of topics.

And

so people would end up trying to mapping, devices or users to an individual topic, and they try to grow,

the number of topics beyond,

maybe millions or 10 millions. And that is kind of still a bad it is based on current, personal implementation. That is still a kind of a bad design. Is we should try

to react in a way to use I mean, at least reduce the number of the topics that will be used by that single application.

We definitely can support minions, but not tens of millions. And especially when you operating in, millions of topic, it's still a bit challenges. So when designed that, when people need to be trained up, how to use Puasa topics and how to leverage all the,

good features providing by Puasa. And the last 1 I would say is more,

Pasa also providing these non persistent

topics or non persistent capability.

And in order to use those non persistent capability, people have to be realizing

what are the delivering

guarantees,

dispatching guarantees to making sure

you,

you are not surprised

by those, guarantees providing by non persistent topics. So those are kind of the 3 common,

pattern,

common

miscommunicate,

use cases,

I have seen in

So in terms of the service, the product providing by Stream Native, that is we are kind of the fully focused on developing Stream Native Cloud, which is the fully managed process service,

running on public cloud. We want to give,

people the firsthand

experience,

very smooth experience for people to get get started in using Pasa easily. And that is from product side. On project side is, as I mentioned before, Pulsar

has been evolving beyond PubSub messaging system. So you have many,

you have 3 main capability

is, you you are able to connect,

like ingest the data into Parsa. You're able to starting start data.

You're able to process in data. So in in terms of project side is, we want, on the ingestion

side is we want to integrating with more mastering protocols for people to be able to integrate with

the existing

messaging applications.

And in the storage side, we want to do

more in the offloader, in the tiered storage

by bringing in some additional

data processing oriented capability, like, column storage

and bringing in, index

and, be able to leveraging

topic compaction.

So Lowe's

functionality

can helping parts are providing

better performance for unified data processing story. And in terms of the processing side, we want to improving the puzzle functions to introducing some

orchestration

framework to combine multiple functions into a pipeline so people can, write a simple function pipeline to chain multiple functions together. And we are also looking into integrating with, web assembly that is able

to easily to support different,

languages or functions.

And in terms of integrating with Flink, we we we already make Pasa as a source in sync for both Flink and, Spark. And we make Pasa as the catalog for Flink as as well. And the next step is how we want to deal with the state management

and, the state management would come both for the postal functions and as well as link integration. So there's a lot of things to do,

state management. Are there any other aspects

of the Pulsar platform and its community and the ecosystem that's growing up around it or the work that you're doing at Stream Native that we didn't discuss that you would like to cover before we close out the show? So

in the in April. And,

due to the increasing,

worst situation of coronavirus,

we kind of pushed the YOOX conference to August. But at this moment, organizers

also exploring a different, approach about,

by providing,

a pure virtual conference for personal summit. And please follow follow us on Twitter and we'll keep everyone

posted. We are very excited, and we are very confident we are able to hold a kind of virtual conference and to be able

to show more parcel oriented use case to broader community. Well, for anybody who does want to follow along with you for that or get in touch and see the other work that you're doing, I'll have you add your preferred contact information to the show notes. And with that, I would just like to ask a final question of what you see as being the biggest gap in the tooling or technology that's available for data management today? I think the,

biggest,

gap is,

I think right now,

still,

in in the whole data management

space or in the whole big data ecosystem, there's still,

many many components in the whole pipeline and the ability to

kind of grew,

different type of the system together and also provide a uniform

operation

and also management

experience.

In other way,

providing ability

to trace

events

that is going through,

from data source all the way to the data and and analytics

of data warehouse that I didn't see there's a good tooling. And I would I wish

in this space, we would like to see more efforts happening this. Well, thank you very much for taking the time today to join me and share your experience working in Pulsar and building a business around it. It's definitely a very interesting tool and 1 that I've been exploring for my own purposes. So I appreciate all the time and effort you've put into that, and I hope you enjoy the rest of your day. Thank you for having me here, and it's my pleasure to share all the experience, all the knowledge around this project and, as well as the company. And if you want to,

like, chat with me more about Porsche or

in general about streaming technology, you can find me on Slack or Twitter. Yeah. Thank you.

Listening. Don't forget to check out our other show, cast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways that is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links