Exploring NATS: A Multi-Paradigm Connectivity Layer for Distributed Applications

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale.

DataFold's AI powered migration agent changes all that.

Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches.

And they're so confident in their solution, they'll actually guarantee your timeline in writing.

Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds

today for the details. Your host is Tobias Macy. And today, I'm interviewing Derek Collison about NATS, a multi paradigm connectivity layer for distributed applications. So, Derek, can you start by introducing yourself?

Hi. Thanks, for having me Tobias. Yeah. I'm Derek Collison.

Known mostly stays as the creator of NATS, but founder and CEO

of Zenadia.

But in prior lives, I I spent some time at Google and VMware writing a system called Cloud Foundry and,

way back in the nineties at a company called TIBCO.

For for those that were in Wall Street or the world's financial systems,

TIBCO was kind of a a heavy hitter back in the day. And do you remember how you first got started working in data?

You know, it wasn't necessarily

data per se.

But when I graduated university,

I went to work for a group called the Applied Physics Labs of Johns Hopkins University.

And so instead of the introduction to data, it was the introduction to distributed systems. But but I'll tell you that I felt kind of penalized because I got selected by the sec second best physicist at the lab, and so they had not as much supercomputer time. Right? And,

for the audience, you know, the used to either scale vertically or

or horizontally wasn't really even a thing back in, you know, the late eighties, early nineties. And,

I remember the physicist saying, well, you know, why can't you make those 12 spark pizza boxes in the basement do the same thing as, like, a Cray supercomputer or, you know, connection machine? And I thought that I was, you know, in the penalty box. But shortly after leaving

that job to move to California,

I got thrown right into the same thing. And it was kind of

very lucky by chance that I did because now everything's

a horizontally scalable distributed system, right, these days. But at the time, I just kept saying, you know, why am I getting penalized? You know, why do I have to keep doing this type stuff? And so this was mostly around moving data and access to data. And so a lot of the systems that I designed, even through the early days of TIBCO, were mostly moving

things around.

It wasn't really until

kind of RBCM,

certified messaging, which most people won't even remember, but after that, I designed a system called TIBCOZMS,

which was kind of like Kafka is today. Right? The big the big player at the time in the space of storing messages and being able to replay them through consumer patterns

was probably IBM's MQ. And so we went down the path with Sun and IBM and others around a standard called JMS and Sun, obviously. And then TIBCO's EMS was, you know, JMS compatible. But the big change step change there was is that we would store and hold on to data, not just move it through pipes. So That's definitely a lot of useful history and context for a product like NATS and what you're trying to solve there. I imagine that you also had a lot of lessons learned in the process of building and working on Cloud Foundry because to my understanding, that team was also

very closely linked with the team that was responsible for RabbitMQ,

which was a very popular queuing system and still is for the publish subscribe message pattern.

And I'm wondering if you can just describe a little bit about what it is that NATS is and some of the core problems that you're aiming to solve with it.

Yeah. So, originally, NATS came into existence

as part of Cloud Foundry, my work at at VMware designing that system. And, you know, we didn't want

Cloud Foundry to

kind of just be,

an enterprise pass, but that's how I pitched it to the the senior leadership, Paul Maritz and and, Steve Parad and and Todd Nielsen at the time that it was like a heroku for the enterprise. But we couldn't just have, like, Java apps and Spring apps and MySQL. And so we wanted to do some cool things. And so we brought things in like MongoDB and Node, but at the same time, we also brought in Redis.

And we also needed a messaging system simply because that's how I designed systems. Right? I was used to designing over top of messaging systems. And at the time, the only one out there was RabbitMQ. So we, you know, had VMware also form a relationship and actually purchased the consulting group that was in charge of RabbitMQ. And a lot of people probably know Alexis who was kind of running that at the time. But what's interesting from a pattern perspective is is that I think RabbitMQ

was designed around the AMQP protocol, which I know sounds very familiar to a lot of your your audience. But in some ways, we you know, our our time at TIBCO indirectly

resulted in the AMQP spec. Right? The financial markets and technologies were trying to make sure that they didn't feel vendor lock in with TIBCO at the time and TIBCO rendezvous, which, you know, I helped co design and and things like that, and then, of course, TIBCO's

EMS.

But what Rabbit would do is is if you ask it to do something, it would try to do its best to do what you asked it to do. And that, on its surface, sounds really, really good. Right? Sounds like something that you would want to have happen in a system. However, imagine if the metaphor is, like, I could plug in my crazy blender that I designed on my three d printing machine and blow the whole neighborhood's electrical power out. Right? You know, people would kinda look at me like, no. I mean, you know, it'll trip a breaker. It'll throw a GFI. You know, there's ways to protect that. And what I was really looking for was kind of a dial tone, a a system that no matter what was being asked of it, it would protect itself at all cost. And believe it or not, I know it sounds very subtle, but there's a profound difference in the way NATS came to life around that principle and a couple of key features that I needed that that Rabbit simply didn't. And so we made a decision at the end to

utilize NATS for telemetry,

command and control, and querying of a Cloud Foundry, you know, the Cloud Foundry components,

which in the early designs were pretty simple as a cloud controller and a bunch of DEA's, which is an interesting,

way to say engineers should not name things. Right? We were it was, I think, 2AM, and me and Vadim Spivik, who

came over with Mark Lakovsky and I to VMware from Google. We're sitting around 2AM and, like, what are we gonna call this thing? It's like, oh, well, it's cloud. There's droplets in cloud. We'll call it a droplet execution agent. So it was called DEA. And so that's how NATS was born. So, again, it was protect itself at all cost, be like a digital dial tone, or be like the electrical grid, meaning it's always on for everybody, but one bad actor can't take the whole system down and be destructive for everyone else. And there was a couple of the patterns that RabbitMQ did not do that I really wanted, which most people in the modern NATS ecosystem

don't take advantage of. But one of the ones was I wanna ask a

unknown set size a question, and I only want the first answer. And you can imagine a world where if the messaging system is pretty fast, and let's say there's 10,000 or a hundred thousand of these things, that the client that asked the question

gets the first answer, but has to throw away 99,999

other answers. Right? So there's a CPU spike and, of course, there's network costs and things like that. And so the original NAT had protected sub at all cost as part of its DNA,

do the normal things, PubSub, which for the for those that kind of think PubSub is outdated and I would agree with you, don't think of it as publish subscribe. Think of it as location independent addressing, whereas, like, IPs and DNS, for the most part, are location dependent. And if they're not, they're doing unnatural acts to get around that, any cast or DNS tricks or whatever, load balancers GSLBs. It had the circuit breaker pattern that I talked about, which is I'm gonna ask a question only with the first answer, and I want the system to circuit break all the way through the system regardless of how many components are in there so that me as the actual client, I might get, like, three responses.

You know? And I only really want one, but I own the client only has to drop, let's say, two extras or something like that. And then the last one was,

distribute queuing, meaning not queue in terms of store something on disk and and replay it, which came later in the NETs, evolution. But from a simplistic standpoint, it's I can have any number of

entities listening, let's say, on foo, but they all join a certain group, let's say, bar. And what it's telling the system is is that when a message comes in on foo, only deliver to one of the bar recipients. Right? And so this is

not necessarily brand new. It it exists in other systems. But the fact that it was totally dynamic, meaning you didn't have to change any configuration of the NAT servers or NATS components or anything. It just kind of worked from an application perspective, and you could create, you know, hundreds of thousands or millions of these all over the place. That was a really good one. And and for those in in the audience, right, this is our version of batteries included where you don't need load balancers and GSLBs and things like that. And so that's how kind of NETs came to fruition. We originally thought and and pushed pretty hard that Cloud Foundry would be the first open source project that VMware did, and that was a big step change for the company. To coincide with that, you know, I released the original NETs project as MIT. I think it still powers, you know, most of the telemetry and command and control and clearability of of cloud founder systems as far as I know. I I'm not totally sure, but I think it does. And in messaging and distributed

data systems, there have been a vast number of different protocols, patterns,

implementations

of those different core capabilities

from the perspective of lightweight,

distributed,

you know, easy to interoperate with. Another system that comes to mind is ZeroMQ,

which doesn't have that centralized

command and control capability as much as it is just you implement it and build whatever pattern you want, and it's application dependent there. But you also alluded to Kafka, which has taken the the data ecosystem

by storm from its inception, and there have been numerous other versions of that and implementations

to target that same

interface for being able to be compatible with the broad ecosystem that has grown up around it, but address some of the operational shortcomings that were inherent to its initial design.

And I'm just curious

from

your first implementation of NATS and as the evolution of messaging systems and use cases that they're powering have gone on through a few different generational shifts, what are some of the other sources of inspiration that you've drawn on to understand

where to bring NATS from a feature perspective, but also any of the

ecosystem

investment around it that has been core to maintaining its viability and appeal to a broad audience?

Yeah. I mean, that's a great question. I think the the next big piece in the NATS ecosystem that that we really bet on was is that and it was the first kind of tagline that Sanadia had, which was connect everything. And what that meant for us is is that the servers couldn't look like these massive brokers that were in printer security models and could only run-in big machines on, you know, on the cloud or in a data center or whatever. It was literally that we had multiple ways to connect servers together into pretty much any topology you could imagine that could stretch between any region, any cloud provider or multiple cloud providers, which is a a pretty common one for our customers in Europe these days, out to the edge. And

when we were looking at the evolution

of NATS in this way, we felt that whatever edge was and we kinda defined it on our own as near, far, and beyond, near edge's cloud providers saying we can do that too.

A little bit of FOMO. And from

my perspective, cloud is is

not going away anytime soon, so don't misinterpret what I'm about to say. But I think cloud is already the new mainframe. People just don't know it yet. When you interact with technologies and you say, oh, I'm interacting with the cloud, I can probably tell you that 90% of the time you're not. Right? You're operating with CDNs or or more specialized edge providers, but and that's kind of the the the far classification is the Akamai's and Fast Lease and some of the newer entries in there like Netlify and Dino Deploy and Purcell, of course. But where it became really interesting is is what happens if Edge is what I define an Edge as. You know, it's my web browser. It's my phone. It's my factory and my distribution center. It's my cellular tower. It's my electric vehicle. Right?

Then the rules become very different. And and I really do believe that when we went about making, you know, these servers more like cattle than pets, if you remember that that thing, we also really worked hard on how you can glue these things together to form a topology across any of those, deployment infrastructures.

So that was kind of the the the the big next step. And then to the meat of what you were asking about was and we knew this from our time at TIBCO, which is we eventually have to store messages.

But in our opinion, that had to be a compliment of the at most once delivery mechanism. So zero and q is mostly at most once. I I I mean, and of course, Kafka is at least once or exactly once. And so what happens is that if you're in a Kafka ecosystem and we've seen this through some of our customers where they say, oh, we wanted to use our microservices, but we're storing everything on disk, the request and the responses,

and it's just too slow. It's kinda like asking Google to do all of their logging and processing before they give you Google search results, you know. I don't think you would get anyone inside of Google that would look at you and think that's a good idea. Similarly, with AMQP and NATS as it originally existed, you know, eventually, you need to persist data. Right? And for the evolution within the NATS ecosystem, that was a technology called Jetstream. Now we did do an experiment with something called Stand, which was very interesting because we wanted to dip our toe in the water and see, you know, if people were really, you know, wanting this,

functionality, and and

the clear answer was yes. And so Jetstream now allows us to do both at least once and at most or at least once and exactly once processing, but NAT score still exists. So if you wanna do high speed global micro services where all you're trying to do and by the way, when people ask me what Cineadia does and it's a very short elevator ride, I say, we're a technology stack that allows you to dynamically

and securely

reduce latency to access services and data from anywhere. I mean, that's really all we're really, really doing. And so you can imagine you're trying to vertically scale or horizontally scale, right, a problem. But if the problem exists all the way on the other side of the world, right, and you're tromboning back and forth, let's say, you know, I'm on the East Coast Of The United States, you know, tromboning to, let's say, Japan,

you know, you feel that that's an experience that that you're gonna detect that latency, and you still can't, break the speed of light. And so that topology and then, of course, the data piece with Jetstream were the two big big things that we kind of started Sanadia, you know, little over seven years ago. There are so many different directions I would love to take this conversation. But given the audience that we're addressing, I think another very pertinent topic to explore

given the core technology and the technology and the availability of a persistence layer is the idea of what has typically been a fairly clean divide between application and operational systems.

This is something transactional. I need high throughput. And your data systems, which are typically batch oriented, obviously, there has been the introduction of stream processing, etcetera. There's also the the trend of doing user facing analytics, but those are typically still powered by different systems than your application stack. And the default mode of thinking about these is that you store all of your transactional data, and then you copy it somewhere else, then you do some other stuff with it, and maybe it makes a round trip eventually.

Core to that is the idea that the technologies powering those stacks are largely disparate. They're focused on different scalability patterns, different access patterns.

Kafka is one of those technologies that changed the thinking a bit, but I'm wondering because of the fact that NATS does support both that AMQP style or Redisq style, I just need to push something and get something. I need fast messaging as well as I can persist this data and replay it over and

over, how that changes the ways that teams think about their overall architecture going from application through to data systems and maybe some of the ways that they can reduce that unnecessary

copying.

Yeah. I mean, the the the way we approach the problem was we definitely felt that inside of NATS, we call, you know, something that collects messages a stream. Right? The low level concept way underneath the covers. That stream with a bunch of messages that you can, you know, replay at will, you can do all kinds of interesting things in terms of consumption models and at very high speeds would definitely still be applicable.

But we felt that there should be something in between

until you go to something like Kinesis, which abstracts out the semantics of of what's in the stream. And so when we designed Jetstream, we said, you know, you're gonna be able to store messages, obviously. You're gonna be able to syncurously replicate them for You're gonna be able to asynchronously replicate them in multiple different patterns, for, like, digital twins. We call them mirrors or source mux and demux, which are really good for, like, IoT or or fleet types of,

design patterns. But what we also felt we wanted to do was

build materialized views. And so we've only built two, but they're fairly powerful. One's a key value store, and so you can treat Jetstream as a construct as just a key value. And because it's built on nets, which, you know, underneath the covers is powering a lot of the functionality,

things that

may or or may not be difficult in other systems were trivial for us to do, which is you could have a history as long as you want it. Now you can say, oh, I want a hundred of these values. So every time you update foo, keep, you know, the last hundred values. And more importantly, the notion of an async push based watcher or pull based, but push based, meaning, I don't wanna keep pulling for something. Just tell me anytime Tobias changes this key or someone changes this key. And the key can be changed from anywhere in the world. The key could be living, you know, somewhere, let's say, in in Asia, Pacific region, but there's digital twins in North America and Europe. And I'm in North America, and I just wanna be notified whenever that key changes. So for us, that was very trivial to do. Now we didn't think, and and by the way, Salvatore, for those online, Antares, right, the guy, who's coming back to Redis, but, you know, was the original creator of Redis, is a friend of mine. And we worked closely together,

when we have Redis as a component of Cloud Foundry and sponsored a lot of the early work and brought Salvatore over to The States and stuff. We did not think that anyone would ditch Redis for what we were doing because we're look more like Memcache. Right? The values were an opaque blob, you know, x y z.

But we were very wrong, and what we found out very quickly was is that people do value the functionality of Redis a lot, especially the value semantics.

Meaning, the value can have a shape. Right? It can be a counter. It can be a list. It could be a map, you know, those kind of things. But in critical production scenarios,

clustering and stable storage trump it. Right? And so they these early customers said, we really want you to go into value semantics, but we're gonna switch now anyway because

we need to be able to depend and have flexibility in the clustering, the way the system could cluster together across different deployment models, and then also the notion of stable storage, you know, being a first class citizen. And so we are leaning into value semantics for key values, but key value is a really, really big pattern for us. Global microservices

without having to have all the load balancers and service messages and API gateways and, you know, 15,000,000 different things,

going on there. Those are really two big, big key areas where people look at what NASS provides as a potential central nervous system, and the sole purpose is to be able to securely access

services and data from anywhere at low latency, you know, as we discussed. Another interesting aspect

it can potentially replace whole categories

of technical systems,

albeit with different operational characteristics or different guarantees, but it can also be very

complementary to those same systems. So, again, I'm thinking in terms of Kafka, Pulsar, etcetera.

You mentioned Redis.

You also mentioned service meshes. And I'm wondering for teams who especially have already done the substantial investment in their operational capabilities. They've built their application

quite fit the patterns that they have addressed, that doesn't quite fit their technology stack. They turn to NATs and say, oh, hey. This actually solves my problem perfectly.

What do you see as the progression

of their of the rest of their technology stack once they bring NATs into

their operational,

characteristics?

Yeah. And so it depends on almost the the breadth of the investment from an a scope and and operational aspect. So what I mean by that is is we do see lots of

situations where folks are trying to get data on the edge into the cloud into Kafka. And, you know, the first thing that people might consider is, oh, we can run a Kafka broker on the edge, and then we can use MirrorMaker or some other technology, right, to kind of do that. And a lot of least Sanadius customers found that extremely unwieldly. And so they said, Nets is doesn't need any special security stuff to be secure, so it doesn't need a perimeter security model. It's not Java based, so it's, you know, binary is a lot smaller versus, you know and it's a single binary versus setting up something like Kafka. Now I don't think Kafka requires Zookeeper anymore going forward, but, you know, used to. And so we would see a lot of customers saying we want our applications to talk to NAT slash Jetstream at the edge and store it locally, and then NAT is responsible for getting it to the cloud. Right? So that the application doesn't have to kind of worry about that. Now, of course, if you have really clean pipes to the cloud or you're running in the cloud, right, your edge is actually, you know, that near category that we were talking about, that might not be as

big of a problem.

But when you're going across cell links or satellite links or combination of multiples of those, right, depending on which one's working better, what they really wanted was this separation of the applications running at the edge, not having to be bothered with, I generated the data. You just take it. You're responsible for getting it to the cloud. Don't make me responsible for that. So with Kafka, we've seen a lot of complimentary things where we will run at the edge, and they're still running massive Kafka clusters, or a a cluster within the cloud. In microservices,

we see, you know, a slow sea change when they kind of drink the Kool Aid and they go, oh, wow. We don't need all these, you know, extra moving parts.

We'll see slowly

migrating across there, but it usually replaces or at least customers indicate that they're gonna replace everything. And some customers indicate they're gonna replace Kafka too once they get comfortable with the technology, and they go, well, why are we, you know, spending this much money and all these people to babysit this massive Kafka cluster when we think

Nadia's tech, you know, the net stuff could could do it. The other one, though, that's interesting is, like, with KBs,

most of them, if they they talk to us, they'll leave spread us around. But, again, they will signal that, hey. Eventually, we'd rather just run this. And I and I do believe, at least for the audience, you know, my career started in the eighties, so I I'm I'm getting up there. But what it has given me is an interesting perspective that the technology landscape goes through these massive waves. And one wave is on the way up, which is I want more tools in the toolbox. And then we reach an apex, and then we go, holy smokes. We wanna simplify. We don't want all these moving parts. We wanna you know? And what's interesting is right now, we're in a massive simplification

down spiral. Right? And so customers that come to Synadia, a lot of times are in pain. They can't seem to solve a problem. They're usually out at the edge. But once they've solved it and they realize that the system can take on a lot of different functionalities,

it helps them in that mindset of, hey. We're trying to simplify, reduce costs, things like that. Point of when you're running

far on the edge on a mobile device,

in electric cars, you mentioned there are

huge variabilities in terms of network latencies, network availability.

And so that brings in the question of being able to

store messages locally

and deliver them opportunistically

as you have connectivity,

which a lot of the data oriented systems typically don't factor in because they're assuming that everything is running on

massive cloud hardware. You have high bandwidth network. You have reliable connectivity.

If you do have a network outage, then it's a bigger outage than just one message, and so you have bigger problems to deal with. And I'm wondering how that also

maybe changes some of the risks of managing messages and some of the ways that teams think about how to manage deliverability, where if they say, oh, I can throw it at my local NATS agent. It'll eventually make its way to the rest of the cluster. I don't have to deal a lot with error handling retries

versus

teams who are dealing with maybe a a a Kafka client, and they say, I have to make sure that this message gets act by the Kafka cluster when I generate it. Otherwise, I need to retry just some of how that changes also the the velocity at which teams are able to move.

Yeah. And I I mean, I think you bring a good a good point, and and we used to call this tromboning. Right? You always wanna try to reduce tromboning. So even if you have a fast pipe and a decent RTT, let's say your RTT is 10

to the cloud. Right? At least within NATS, when people look at Kafka and NATS, a lot of times people go, oh, Kafka can ingest messages faster than than NATS. And it's because it's an apples to oranges comparison. And that's today, every message you sent is acknowledged by the system. It's it so it says every message is important. You know what I mean?

What Kafka usually does is it it immediately batches. And so it introduces latency to publish things as it's putting them together, and it has, like, a, I think, a temporal window, and it figures out which, you know, broker has, you know, that partition x y z. So we don't necessarily do that. But in general,

what I think people

look at is they don't want core data or services to have to be trombone to the cloud in, let's say, a factory or distribution center where, you know, fifteen minutes of downtime is extremely expensive. Right? So even though they go, yeah, we pay for all this good networking equipment, unless something physically happens in the factory, that thing has to be able to operate no matter what else is going on, whether, I don't know, Azure or one of the other big hyperscalers has a blip or there's a network outage or something like that. And so the fact that everything within the NATS ecosystem can run on big hardware to small hardware, all the all the way down to ECUs running inside of a vehicle that are running both core messaging and JetStream, you know, with synchronous replication, consensus algorithms, everything, you know, kind of got coming and going, that seems to be very attractive to our customers. They like that. And so what they then eliminate the trombinding to is the NAT system that can self heal itself. It puts itself all back together automatically. And so you can imagine a world where, you know, the application knows it has to produce

some piece of data. And let's say, you know, it says that I'm I'm supposed to persist this data.

What

we like pattern wise is don't ever assume who's gonna use that data tomorrow. Right? So if an application

that's

creating the data or publishing the data or, you know, whatever you wanna call it, if you are looking at the code for that application and it is aware of who's receiving it, I think that's an anti pattern. I think that's bad in in modern distributed systems. And so what we allow applications running in a factory to do is say, I have to, you know, store this thing. And they literally just send a message just as if they were gonna send a message for a microservices interaction, which is just NAT score. And they wanna get a response back. And they get a response going, got it. And they're like, great. I can go and do my other stuff. But underneath the covers, that was stored locally, and let's say in a closet, you know, an IT closet or or on the the shop floor. And then Nat takes over and says, oh, wait a minute. There's supposed to be a digital twin, you know, a a Muxer, meaning it pulls from all of the factories or all the cars or whatever, and it puts itself together. In some other words, it self heals itself. It makes sure there's no duplication, and so it will get it to there. And again, that's that don't assume who's gonna use the that message tomorrow because

there might be another process in the cloud that's running it. We might wanna stick it in Kafka. We might wanna run some of our own analytics, whatever that is. And so I think that those choices and those patterns

lend well to edge based distributed systems, which I think, in my opinion, the stepwise change from data center or server rooms. I'm dating myself, but I used to, you know, in server rooms with a big park, you know, hat on or whatever. When we think about how differently we architect in the cloud and cloud native or whatever like that is, I really do believe that the step change from cloud to edge will be even bigger.

Meaning, all of the rules that you have in the back of your head around, oh, I can do this in the cloud, a lot of times they just simply don't apply. Right? Because cloud providers,

and this is not a bad thing, it's just not spoken a lot. You know, they want to be Hotel California. They want you to check-in, but they do not want you to check out. Right? And so it's very hard for you to pick up not only your apps and your data, but all the stuff they provide that made it easy to get started, you have to figure out how to duplicate that at the edge. Right? And so if you come into a small IT closet with a small little nook running and go, yeah, we're gonna set up Kubernetes and we're gonna have ZooKeeper and JVM containers and Kafka clusters, and they kinda look at you like, that's not how this edge at least works. Right? Think about if you wanted to do that inside of a vehicle, you know, and run these things on ECUs. It's just not gonna happen. And so Sanity offers a solution there. But what we've seen with some of our customers that have been with us for two, three years now or so is that the pain point was an edge initiative. Right? Something. Right? But once they kind of

get that working, they start to actually back port the architecture

back into their cloud presence. And so they start removing things like SQS and, you know, any type of thing that the cloud provider has built in that you can't just pick up and move when you want to move. And so we've seen that quite a bit with some of our longer customers. There are a couple of interesting patterns that come to mind talking about the reliability, deliverability question,

as well as some challenges from a system design perspective in terms of the NATS architecture.

So just briefly touching on that, what comes to mind is if you have clients

that have

unpredictable connectivity,

they could be, although, always available, or they could be heart beating once a day or or longer is not knowing not being able to predict whether that client is just having unpredictable network coverage or if it's just completely gone and dead, and you should never expect to hear anything back from it again, which makes it complex from a application and data perspective of, I think I'm gonna get data from this client, or I'm just never gonna get any more data. And then on the producer side, I'm thinking in terms of some of the patterns of, like, click stream analytics where you might send them to a segment or a rudder stack, and I wanna make sure that that message always gets through. I never want to drop any messages because this is core to my business observability, and I need to make sure that I know every event that every customer,

engages with. And so I can imagine that NATS is definitely a very useful utility in that regard, but, also, you still have the problem of, well, what if that server goes away and it had a a batch of messages that it didn't get to send because of some network

connectivity problem.

Yeah. I mean and you raise great points there. And I I think, you know, modern distributed system architecture,

you know, we still draw the same boxes and squares and triangles on whiteboards, and, Sanadia and,

NETs don't wanna change that. Right? So you're still microservices and key value stores and object stores and relational stores, which we don't do, but we power a lot of the clustering technology for a lot of relational stores out there. But I think it's interesting

when you think hard about, let's say, a microservice. Right? And let's say I we're talking about the the responder. Right? So I can get a request and I can send a response back. Earlier in in the podcast, I talked about one of the core tenants of NATS Core was distribute queue groups. Right? And so for us, just combining

that and packaging up to to what we called micro. So micro is just a a mini framework inside of our clients, but they automatically

do, you know, distribute the queue groups, you know, meaning you can run these everywhere around the world. And what's nice is is that the NAT system is updating itself all the time in real time. And so if I blip and I go offline and you send a message that's supposed to be for a Derek, but let's say there's 500 of us running around, you know, if I go offline, the system knows that pretty much instantly, and it so it won't deliver it. And if all of a sudden, I reconnect, and it's like, okay, you're back, and, you know, you're you're available for for new messages. Combine that with obviously Jetstream, which moves from at most once to at least once, and then exactly once. Exactly once is is a is a tough one. Right? I've been in those

wars, so to speak, from very early on. You know, for our perspective, it's kind of a bifurcated problem. There is a how do you do very efficient and semantically correct dedupe. In other words, I sent it, but I never got a pub back, and so it was a blip, and now I'm reconnecting, and I send it again. I don't want it in the stream twice. Right? And so that's one part of the problem. The second part of the problem is is that you only receive or more technically process the message once on the consumption side. But in that, which is different than some of these other systems is we went down the path that there could be multiple consumers for any given message. And so things like very simplistic, easy to rationalize

DLQs

don't exist out of the box in NATS because, believe it or not, it's an anti pattern. In other words, if all of a sudden I Yeah. You know, I think I'm the center of the universe and I couldn't process the message, so I want this message to go into DLQ. In the NAT system, there could be lots of people that they they had no problem processing a message. It's just me. Right? But when you combine this notion of this the the NAT systems are updating, they always know when things are offline or online all the time in real time or the speed of light real time, combined with the ability to move from at most once to at least once and exactly once semantics, gives you at least a a decent amount of patterns to figure out what you wanna do. And so a lot of times, you know, we we get on calls with customers and or users who have been using that for a while, and we always try to start the conversation with, alright. You're gonna tell us how you've deployed the net servers and what the apps are doing, but let's start with what problem are you trying to solve. You know what I mean? And what are the requirements of of of the system? And then the last question I always ask is if everything goes perfectly right, how big will the traffic be in two years? Which nowadays, it used to be five years. Now, probably, it'll be six months the the way the technology landscape is going so fast. But we try to to cultivate all of those and both design the platform side in conjunction with the domain experts, the customer or the user, but also,

help them understand

the what's versus the how. So the what's is microservices, key value accesses, things like that. The how is where Natch changes the game, but it's not changing the what. It's just changing the how. And so in very

sophisticated, very, very high level critical

deployments,

you know, we spend a lot of time with the users and the customers around how we're actually doing the things under the covers and what to look for, you know, when when things kind of go wrong. One of the things that NATS did that was by design, but I kinda regret was we try to make it so easy to spin one of these servers up to create a cluster or whatever. And we have a lot of folks that come to us and go, yeah, you know, we we're having issues, you know, it's been running fine for six months, but they never change it. They set it up. They never change it. The app tier is changing and growing and expanding, and they're like it's kinda like taking a pickup truck and loading it with two tons of stuff and go, why are my tires flat? And so that's kinda bit us a little bit, with people going, this is still a very complex distributed system where you might be running in multiple regions, cloud providers out to edge locations with, you know again, in most what semantics, at least what semantics,

mirroring digital twins, all this stuff. But we love working with super hard problems with these customers, so it's a good thing. But it is funny to see. It it it repeats itself. Another element of the overall problem is the

the guarantees that end users should expect where, again, with distributed data problems, there's the cap theorem. It's unavoidable.

And so you can either say you have,

serializable consistency. It's always read after write no matter where you're writing to or reading from, or there's eventual consistency. And so you can read your own rights as long as you're talking to the same node. And then there's also the challenge of, as you were alluding to at the beginning, horizontal scalability where you want to be elastic. You don't want to say, I start small and then, oh, I need to be be able to get bigger. So I go the next size bigger, but now I can never collapse back down because then I start losing data. I start losing messages.

And so just some of the ways that the architecture of NATS addresses some of those problems of horizontal scalability and elasticity

of the of load as well as some of the guarantees

that it offers beyond just the message semantics.

Yeah. I mean, we you know, for for core NATs, right, you could probably scale up and down, but we usually don't recommend people to do that, because most NAT systems these days are running Jetstream.

And so we have, one customer that just yesterday was asking, hey. You know, our system got into a kind of a funky state. And what they were trying to do was they were trying to scale it up. And so the way we designed the Jetstream layer, it has when it's in clustered mode, it has the notion of what's called a meta layer, which is a

membership layer. Right? And it also does all the CRUD operations for streams and consumers and things like that. But it can add things. So if it sees something new, it goes, oh, okay. Well, I guess, Tobias is a new member of our group. Great. You are welcome in because we saw you. We saw you. You're connected to us, meaning you had the credentials to get connected. We saw you heartbeat. We know who you are, x y z. However, the inverse is not true. If you all of a sudden go away,

meaning I can't see you anymore, I can't assume that that means you're gone gone and remove you. Right? And so I think what this customer did was they then shrank the system down and it just locked up. And the reason it locked up was is, like, I can't make any progress because I can't make a quorum. We used to have six, and now we only have three. We need four. And it's very counterintuitive

when we say, hey. Start one more server, and you'll be fine. They're like, what? You know?

But they they go, well, why wouldn't you just auto remove it? So we walk them through. Okay. Let's say one of your things goes offline. When should we remove it? After thirty seconds? After two minutes? After an hour? You know? And I and I tell them, hey. Every customer will tell us a different number. So the answer is we can't. But we have architect systems that can shrink down because we have the mechanism to say no.

Tobias is leaving. Remove him from the set. Right? There are operations that can do that. But if someone who looks at Kubernetes and looks at NATs and just goes, oh, I just wanna scale from a three node cluster to a six node and back down to a three node, NATs kinda you can do it, but you need to know all the moving pieces. You can't just hit an easy button on both sides and say, yeah, that should should work fine. And we do, to your point of reads after writes and things like that, we do quorum based semantics right now. So we can do stale reads if we're in what's called direct get high speed, direct gets used DQ. So that same thing from that score, and hopefully, the audience is discovering a pattern that our data layer, in our opinion, is special because it is based on the core layer. It's taking advantage of all the stuff, location independence, the DQ, all that kinda of stuff. But in that instance, right, you know, if all of a sudden you have five members,

you only need three to acknowledge the message, and we can get back to the publisher saying you're good. And then they could issue a direct get read saying, I wanna see that sequence number. And it could go to one of the rep because that's online, but it just hasn't processed it yet. Now, obviously, you could say, no. I only wanna talk to the leader. But a lot of times, people are and this is not new, but people are always in conflict of speed versus, you know, consistency

or quickness. Although, you know, every time I see someone write a message and then try to read it right back, I go, you have the message in your hand. Why are you doing that? But they go, I need to read back exactly what I write. And so we can set those up, and then they go, hey. Why did the performance just drop? It's like, well, all your reads now have to go to, you know, the leaders. And so we're working with customers around some of those issues. Nothing has been fatal, but we have been called out a couple times now, you know, being totally transparent and saying, hey, I tried to read what I just wrote and it was different, and I have a five or a seven node cluster and, you know, the replication factor is, let's say, seven. And so we're looking at that. We also get a lot of people going, I just wanna throw data at you as fast as I can and have you store as fast as you possibly can. Meaning, they want kind of the Kafka ingest, meaning I don't care if it's takes a long time to get an app back. I wanna be able to batch. I want you to be doing all kinds of tricks. And if something really bad happens, I can I'll lose data. And it reminds me of a a a story at,

Google where I was look I didn't work in the ads group, but I looked at some of the ads stuff for a while on the behalf of my boss at the time. And I said, hey. You know, if x, y, and z happens, you guys are gonna lose everything. And he goes, how much would we lose? And so I said, well, I think you'll lose, you know, at least an hour, maybe two hours worth of stuff. And he goes, do you know what we would lose if we didn't go at that speed?

And the number was astronomical, meaning you know what I mean? They chose speed over if something really bad happens. They might lose some of the the CPCs and all that other stuff. And so it's always a trade off, and that's again

why I really if I'm on a call for the first time with a customer, I always go, I I get it. You guys have done a lot of stuff, but let's back up. What's the system supposed to be doing, and what do you care the most about? You know what I mean? What's the most important? And it usually makes folks think slightly differently, you know. And and then I'm not saying they're minutiae, you know, issues down below, but they think very differently of, oh, yeah. I was trying to get it through this, and I just should be doing it a totally different way, which I think is is healthy. Right? And it's it's good. And all of our best results, by the way, come from, you know, the Snady team and our expertise at but not only NATS, but distributed systems with the domain experts. Right? You know, that those those are always the the best solutions. Digging a bit more into the ecosystem

around NATS, obviously, it's very straightforward to get started with. It's very flexible.

There are

a lot of layers of software that have been built up over the decades. And so if somebody has a solution that is mostly working and they say, actually, I need what NATS is giving, but I don't wanna have to rewrite this portion of my stack. I just wanna be able to plug it in and start using it right away. So I already have Kafka, but I just wanna use NATS because it gives me better flexibility.

Or

I'm already, you know, using something like a

a a log Taylor, but I actually just wanna send my messages directly to NATS instead of worrying about writing to disk and then having it get read back. And just wondering, what are some of the characteristics of the ecosystem around NATS to be able to provide that flexibility of adoption

without having to do a huge investment specific to NATS versus some other generalized pattern?

Yeah. I mean, that's a excellent question. And I think, you know, internally, we call that moving from

greenfield to brownfield. And so

what I mean by that to the to the

folks listening is

greenfield is the pain you're experiencing is so high that you'll rewrite everything to fix it. Right? You know what I mean? It's like you'll take a clean sheet of paper. And, again, we don't change the what. So all the boxes and squares and triangles all look the same, but the how is very different. And Synadia

and the NATS ecosystem in general are definitely in a transition from greenfield to brownfield. And so there's three things,

that that are going on there. So right off the get go, and this has been around for a while, we built a bidirectional Kafka NATS bridge. Right? That now has turned in an ecosystem of I think we're up to a hundred different connectors, which we're about to release

to connect bidirectionally to pretty much anything, s three, Kinesis,

SQS, you know, Mongo, Splunk. You know, we got all these types of things. So that's kind of one area of concentration there. The other big one is is that and and we told this to investors early on. One, we're gonna be misunderstood for a while, and you have to be patient with that. And two, if we're ultimately successful,

most people won't interact directly with the the NATS technology itself, meaning we will brownfield the last mile. And so about three years ago, my my numbers might be off here, Tobias. But I think about three years ago or so, we introduced native MQTT support into the server. So you can just take an MQTT client, just point it right at a net system and it works. And so that's that last mile for the factories, and and I think there's a renaissance going on in manufacturing right now where it's combining the high level information flows with the low level data coming off of all the machines, the sensors, things like that, which a lot of them not all of them, but a lot of them are MQTT enabled. And then, of course, the other big one, which

this might

surprise

folks who know of NATS or or know of my opinions, I don't think NATs everywhere makes sense. The last model is HTTP, which we have for Sanadio customers. We have an h bidirectional HTTP gateway, so you can literally curl commands to put stuff in streams, take things out, interact with kVs, you know, x y z. I think all of those are good. Where I get where I struggle or get confused, I guess, at at what people are really trying to do is when they go, no. We'll we'll do

HTTP for everything. And again, I'm not necessarily against HTTP. I just use

use the right tool for the job.

And I was at a conference,

a few months back, and someone said, can you give me a good

example of that? And I said, think if AT and T used, you know, wireless technology for everything, not just, you know, to to last mile connect to your phone. Or think about if NVIDIA's GPU cluster just used, you know, 10 base t networking, you know what I mean, which is the equivalent of, like, an HTTP type thing. And it kind of starts to resonate, and people go, oh, I get what you mean. So I think there's a place for something like NATS that is, you know, at its core, location independent. You know, it's end to end, not one to one, and it's not request reply. It's multiple paradigms,

meaning push and pull. Then combined with that, the data layer on top of it, that can

be very, very powerful. But at the same time, if you wanna just use fetch from a, you know, a browser to get something, you know, you should be able to kinda do that. Now we do have JavaScript clients that run-in the browser, and so we do have customers that take NATS all the way into the browser, right, and are even synchronizing databases through CDC feeds right over, you know, into the browsers type stuff. But we also have people that say, no. We want the browser to to a customized HTTP

API to the back end, and then it flips over to to NATS back there. And both of those are correct depending on what you're trying to do. And so things like WebSockets

and, you know, MQTT and HTTP as these last mile extensions make a lot of sense. Now where we start to

not butt heads, but kind of have philosophical differences is that when people go, hey, we expect everything to be architect on HTTP, and then we build our complete security story around that. Everything has to go through a WAF or something like that. That's where we kinda scratch our heads and go, that that doesn't feel right to us. Right? And I even believe one of the big paradigm shifts going from cloud at out to edge, you know, the far edge is is that I don't believe

that world,

perimeter security models will function the same way or at the same level of intensity they do in the cloud. And again, for for the audience, it's kind of like NATS is more like

your iPhone or your Android phone. If you paid your bill and you have your SIM card in, you are a production user. Doesn't matter if you're at your house, at work, on a plane, you know, traveling.

It just works. And that's just kind of like that. Now we work with perimeter security models. Right? We can definitely, you know, mix and match, but we believe that the weight that people put on perimeter security models today

is too much, at least in our and and the same thing with everything has to be HTTP. That way, we know our security story, which is

a WAF man in the middle, you know, essentially,

approach. We just feel that that might be a little short sighted. Your mention of interacting with NATS via the HTTP

bridge and being able to

effectively treat it as a back end or a a portion of your back end system also put me in mind of some of the patterns that have been growing up in some of these streaming systems

of having sort of an embedded functions as a service

use case. So I'm thinking in terms of Red Panda being able to use their WebAssembly

binaries that you can plug in to say, perform this operation on every message that hits this stream or

the Pulsar functions capabilities or also just generally the Lambda or OpenFaaS. And I'm wondering how you think about that in the context of NATS and effectively

decomposing your application back end as well to just a series of operations on messages that just get mutated and put back into at the same or a different queue.

Yeah. I mean and that's one. We recognize the importance of that.

And I even was like when WebAssembly started coming into, you know, Vogue, I was like, let's let's maybe consider this. But the reason that we haven't done that yet is, one, being able to debug something like that. You know, you know, let's say, us trying to debug something running in a factory or an air gap facility or something like that. I I think a lot of people at Cenadia might quit, right, if we did that. But the other one that that might not be as

obvious to the audience is is that if the central nervous system and, again, most of our customers will say we're their central nervous system. The central nervous system's components that are moving things around and storing things are going to be very small compared to the number of things interacting with it. To give you an example, one customer has, I think, six servers, you know, and they're supporting a hundred thousand connections

of this fleet. Right? And so

where we've come down on is this bunch of our enterprise customers were really asking us to think harder about auth z and auth n. We have our own root of trust. We have our own things, but we can interrupt with m t l s and things like that. But they're like, no. We need to be more opinionated

on auth z or more opinionated on auth n. And so what we designed was a system, this is about two years ago, called auth call out, which you tell the system, hey. I'm instructing you

to not do auth z and auth end, and I want you to send a message on this subject in this account, and what you're gonna get back has to be digitally signed. It's encrypted. So, you know, it's a zero trust construct. So there's a lot of, a lot of,

complexity,

unfortunately, to get that level. But essentially, whatever I come back with, that's what your Aussie and auth end constructs are. So what we've designed, but we haven't started implementing it yet, but it might come in 02/12, is this notion of generic callouts, which is any message inbound on a subject can be dictated, say, you have to call out to this protected

subject that no one can access. You can listen for it, but you can't ever publish messages on it. And then you return

a indicator of what you want us to do. So you might be able to return just a little check mark on the edge. It's fine. Just send it. Or you might say, no. No. Replace it with this. You know, I reread a whole bunch of fields and, you know, I knew what it was. Because NATS, for the most part, at its core level is is payload agnostic. It doesn't understand what the payload is, but, of course, systems and applications do. But what's nice about that is that these callouts, generic callouts, you know, use that micro framework that I talked about.

You can run thousands of these for a single server. It can be pumping messages out. And that server, I haven't done the math lately, but the last numbers I remember is core raw routing processing speed is about 20,000,000 messages a second.

But let's say your auth n l z takes five milliseconds for each one. Right? Even if I had that Wasm module inside of the server, right, it's only one server. Right? And so it's running I think Wasm is single threaded still. Right? They're still working on the multithreaded and networks and all of those stuff. So we think these these zero trust generic callouts is probably gonna be the answer, and then Sanity can package up prepackage things. Right? And and say, hey. You you know, you're good to go there. And so that's kind of the way we're probably going to be dealing with that. So I guess that's our version of Lambda. But embedding stuff into the server,

it sounds like a great idea. You know, even I was considering it, even though I got burned in the nineties around some of these issues.

But when you really step back and look at mission critical production systems, they have to be deterministic.

You have to be able to triage and debug them in the production setting, and they also have to scale. Right? And so we would hate to see, you know, a NATS cluster that's 500 servers just because

that's how many they needed just to process, let's say, you know, some schema check, right, for for requests or responses or something like that. So I believe, you know, we've the team has settled largely on generic callouts being the mechanism for things like schema in in, you know, validation,

both advisory mode, but more importantly enforcement mode. One of, the Sanadia

folks, actually our chief architect, Ari, you know, he's a big fan of things like OPA, but he also believes OPA shouldn't believe inside of the NAT Server. But we should have a very clear interrupt with it and call outs could allow that. Right? So you could call out. I have a request to create a stream. I hand that thing to OPA with, you know, the auth n and auth z, which is you, Tobias. Let's say you're making the request, and Opa says, yep. He's good to go. But if I sent it, Opa comes back and says, no. Derek can't create a stream in production to do whatever he wants. That's probably the approach that that we're going to roll out, you know, hopefully, fingers crossed with with 02/12, which is, I think, sometime around the September timeframe. In your experience and history of building that,

growing it, using it both in terms of solving your own problems, but now growing it into a service and building a business around it and working with their clients, what are some of the most interesting or innovative or unexpected ways that you've seen the technology applied?

That's a good question. We have a a bunch of diverse use cases.

We most of the folks that started the company, you know, had

some affiliation with TIBCO, you know, or Apcera, the the startup I did before Sunadia. And TIBCO's products obviously were very big in the financial markets, and we do have a lot of financial customers for sure. But what took us by surprise was the adoption around manufacturing,

the connected car. So we do a lot of work around connected cars where, again, we try to eliminate the tromboning. And so you can imagine a world where the communication patterns are NATs in the factory, inside of the cloud, and inside of the vehicle, whether you're talking to the cloud or not. So if you're just talking to a service that's running in the vehicle, that's still all over NATS.

And then the last one, which we kind of knew or we thought that we could play an interesting role here, was on, the AI

inference stack at the edge. You know, again, so you can imagine the problems that are happening there, which is I need access to data to augment my prompt. You know, right now I access a model, but I might not know where it is. And now, of course, we're quickly going into a world where we're accessing multiple models, which we might not know where they are, but we need to trust them. And, conversely, those might be walking through multiple agents, you know, tooling and things like that. And so we've been doing quite a bit of work with some of the very large players in the space around inference as a service and having, you know, Sanadia and that's kind of power that that functionality,

not only from

prompt augmentation,

you know, flipping RAG from a pool model to a push model, you know, where you don't have to scale every single RAG component with your inference layer

to multiple traversals of models or or agents,

as well as data collection. And so that was kinda surprising. But we see, you know,

somebody, I think, has it running inside of a toothbrush.

Someone put it into home assistant, so they they use home assistant, but powered all by NATS and MPTT to control everything in their house. We've done a tail scale clone internally just because we could. You know what I mean? Where everything goes over NATS, but it just looks like a a a ton tap, you know, interface, and it's just you know, everything is just TCP as that last mile, pure TCP, but it's routing and stuff like that. And the cool thing about NATS is we could actually do, like, broadcast and multicast

if we wanted to because that's just naturally, you know, Hub Sub type of of stuff. We've had people plug it into Ollama so that they could send NATS messages and and get Ollama responses from a a farm of Ollama instances running different models, things like that. We have a lot of database customers that use us to

do their clustering tech. We have customers that use us to do CDC and and filter down database

from, let's say, Postgres or Cockroach to, let's say, SQLite in a browser type stuff.

Those were kind of interesting for sure. And in your experience of building this technology,

building a business around it, investing in the ecosystem, what are some of the most interesting or unexpected or challenging lessons that you've learned personally?

That's a good one.

One we already alluded to, which is the the decision to make it very approachable and easy to start with, kinda seems to be biting us now as people are trying to do very complex things, and they don't realize that, you know, complex issue systems need, a lot of expertise to understand how to set up right and to monitor correctly. You know, we have, you know, customers who come to us and say, hey. You know, we had an issue. The system's fine now, but something happened last night. And we'll look at their system. We're like, your system is not fine. It's been screaming. Because that is not quiet. It screams. But if you're not listening for all of the advisories flying around going, hey. I don't like something, you won't see it. So that was very surprising. The other one, to be honest with you,

that, again, could be a podcast on its own, is this notion that you can start a company based on

support models around open source. Right? But I don't think you can necessarily

grow a business around that.

But to start one, I made

the assumption, which I was convinced would be true, is if we are mission critical, which I would say 99%

of our customers are all mission critical. Meaning, if we go down, my my phone will go off. Right? So I put on do not disturb and everything, but my phone will still buzz if if there's a production outage with either our global cloud service or any of our our customers.

You know, I understood that part. What I didn't understand and that they would want a commercial agreement. Right? The one neck to choke or whatever you wanna call it. What I found, though, that has really been occupying a lot of my mental energy over the last eighteen months or so is that I see a very

interesting trend, which is if our incentives are to make the software as good as possible, the least amount of bugs as possible, the docs as best as possible, reference architectures as best as possible, then our potential customer set is disincentivized

to actually pay us. Right? And so we have a managed platform. Right? And we'll manage it for you. And your psychology is, oh, I'm paying them to every day look at our system and make sure it's up to date. And if there's any security things or if there's trends that they're watching and seeing, they can tell us, hey. You're gonna run out of blank resource, you know, network disk, whatever, and and we'll handle it type stuff. That psychology is very different than the psychology of I paid this amount of money and I just filed one support ticket. You know, I'm wasting my money. And it's almost a soapbox that I get on, but I I don't do very well on, but it's kind of like comparing eastern and western medicine. Western medicine, you play the doctor when you're sick. Eastern medicine is you pay them when you're well. When you're sick, they're not doing their job. Right? And so any type of OSS ecosystem where there's a single company that's driving it and they want to make a business out of said technology, not ancillary stuff like Basecamp. Right? They put a lot of effort and DHH into rails, but he made a very conscious decision that that's always free and that they're gonna build stuff on top that'll never be free. You always have to pay for it. But for any company like Saniya that started saying, hey, we're we're pushing really hard and we, you know, I think we tallied up. I think investment wise in that, it's been, I don't know, twenty five, thirty, forty million dollars to date. You know what I mean? That incentive misalignment concerns me where people go, and I've even heard rumblings of, oh, well, if you're in a foundation and and the project graduates, then perfect. That's when we just we can stop paying you. And this isn't Bob and Vinny's Pizza Shop out of Venice, California. This is Fortune 500 companies that you would know the names if I mentioned them that say, yeah. We're not gonna pay you, you know, type stuff. And I don't think enough people are looking at that problem. Now the reason they might not be is is that things like Kubernetes where all the hyperscalers have, you know, a hand in the pie, and they all make money off of other things besides that. Right? They can they can do that. But I think a lot of the really revolutionary things are companies creating an idea. They wanna make it open source because they wanna have that community engagement, but they're business. And, you know, when they start out, they're getting VC funding, but they have to flip that over where they can self sustain themselves.

And when the consumer biases, this should be free or I'm not gonna pay you once, you know, you get all the bugs, you know, worked out type stuff, that was kind of a big eye opener for me. I was like, oh, this doesn't feel like we're heading in the right direction.

And last word, and then I'll stop on that because you can tell I can go on for a while on the subject is, you know, charity is not a business model. Right? And so if you're sitting there with a donate page on your thing and you're like, that's how I'm making all my money,

you know, if you're an individual, you know, or one or two people and stuff like that, that kinda works, but I still feel pretty convinced that Verity is not a business model. And so that and the OSS incentive misalignment,

those were the biggest surprises that I've seen as Nat says I don't know how many downloads we're at to now. It's some big number,

in 400,000,000,

5 I don't know. Somewhere like that. That was the biggest surprise. The other one is that's not as important as that, but it was interesting to me was when I designed Jetstream originally, I was like, yeah. If someone goes crazy, they might have a hundred consumers

that are looking at data in the stream. And

that

proved to be very untrue. And so we have people that say, no. We wanna have hundreds of thousands of observers to the stream or even millions of of observers. And if you look at the way, you know, TIBCO's EMS did it, which which I architected or even the version the early versions of Jetstream,

consumers are heavyweight.

Right? They're they're doing a lot of things. They've got their own consensus algorithms. They're, you know, tracking a lot of state unless they're just a simple fire and forget. And so we had to adapt very quickly at some of these early Fortune 500 customers that said, what if I want 40,000

consumers on the stream? Or what if I want, you know, a million observables on a k v key? You know, how does that work? And what they did was they merely just reached for a a NATS consumer, a Jetstream consumer, which again are very heavyweight in the system, you know. Hey, I have a three node and it's running in Kubernetes and I gave it half a CPU and two gig of RAM. Why is it falling over type stuff? And so that was a little not surprising that they tried it that way. That that that made sense to us. But the fact that they

were expecting, oh, yeah. We can just throw hundreds of thousands

of consumers at at a stream,

and it should just work. And so we've worked really hard at delivering different constructs

within

the system to

do those extremely high numbers.

But right now, they feel a lot of one offs and and interactions with our customers. And so the team is slowly seeing the patterns that are resonating with everybody, and we're starting to kind of collate those up into some concrete ideas that we can put into either directly into the clients or what we call orbit, which is kind of our experimental,

client additions where they can say, I want a consumer, but I want a super lightweight consumer. You know what I mean? And it's like, oh, yeah. We can run millions of those. Whereas, I want a full blown

durable

redeliveries,

individual acts, you know, all this stuff,

consumer with dedupes and all this, you know, crazy stuff going. And they go, I want, you know, a hundred thousand of those. And it's like, yeah, you can do it. You're just gonna have to throw a lot of hardware at it. Right? There's a lot of machinery to do that. But that surprised me as well. But I guess I should have seen that one coming. And then, of course, we already mentioned the KB, where I was like, nobody is gonna use KB for anything crazy, like, to replace Redis. And I think we released an alpha version with a short video, and by the next day, someone who was saying, hey, I've got 2,800,000

keys in this thing, and it's starting to slow down. And we were like, whoops. Okay. Yo. We misread that. And so now we support, you know, hundreds of millions of keys and, you know, with big customers and stuff, but that surprised us as well.

And for people who have been listening, they said, great. NATS is gonna solve all my problems. I can just throw away all my other tech. What are the cases where NATS NATS is the wrong choice and they should stick with some other pattern or technology stack?

You know, I think when you're doing super high ingest rates, when you're taking advantage of, like, Kafka batching and and the partitioning schemes that they do again, when we actually artificially did the batching on our own, just as a, hey, or because we think we're pretty fast at a lot of stuff, but we process every single individual message. But once we batched messages and made it more of an apples to apples, I think we were actually

two x faster than Kafka, I believe. And so we're gonna go ahead and do that. But right now, you know, that doesn't make sense. If you have massive investments, you've got a team of 20 people running a Kafka cluster, yeah, it might look nice to, you know, repurpose all those people and do something cheaper,

but all of the processes and all the operational stuff around it might not make sense. If you're talking about petabytes or exabytes of data, you know what I mean, and and hundreds of servers and stuff like that with a team of 20, you know, people with a Kafka cluster, We're probably not yet the best for that, although we do have visions of of getting there. But right now, we would tell you not to do that. So those are probably the the two big ones that that I can think of off top of my head. So super fast ingest.

Again, it's not apples to apples comparison, so we understand why, you know, it looks different. But we're we're gonna solve that one. And then, for k v, if if your value proposition is not in clustering

or stable storage, you know, that's not your pinpoint. It's it's you're really taking advantage of the value semantics. Today, we would say don't replace Redis with us because right now, we look more like Memcached than Redis. But for those that do care really about clustering

agility and the ability to dynamically

create topologies,

in very random fashions and stable storage, then we do have a a bunch of customers that have

moved to to NATS for that because that was their pain point, and they're working around, hopefully, a short term while we don't necessarily do value semantics, like Redis. And as you continue to build and iterate on NATS, you've mentioned a lot of the things that you have that are forward looking and changes that you have planned. Are there any other aspects of the future trajectory of NATS or the work that you're doing at Synadia that you want to share with the listeners?

Yeah. I mean, it's it's an awesome time to to check it out, especially as edge is becoming so dominant. And if, you know, if if you're in, you know,

AI inference at the edge types of problems or in manufacturing types of problems or distribution centers or or even connected car, or connected fleet of any kind. Those we have a lot of customers in those spaces, and we've not only, I think, delivered really cool tech to those ecosystems, we've learned along with those ecosystems of the things that we needed to do differently or change around. And so I think that's great. And going forward, right, we're gonna do the key key value semantics.

So counters and lists and maps and things like that. Lots of massive massive scale. Right now, if you do event sourcing,

then you every single message has a different subject. You know, they're all unique.

We keep that embedded information in memory. Alright? And so once you get to about 40,000,000, you're using up about 20 gig of just meta information to just hold us so that when you ask for a subject, which is a key, we know which block to pull up and x y z. And so but we have plans on really

making that massively scalable. And we again, to make it apples to apples, we have notions of loosening consistency algorithms and core algorithms internally.

When we originally designed the system, we could do about a million disk writes per second in a cluster, but we had a customer that said, hey, what happens if every disk runs out of space at the same time? You have multiple ones having a memory, but they can't write it anywhere. And so at the time, we didn't have time to do the trick that I did in with TIBCO's EMS. And so we said, okay, we'll turn that part off. And so we punch through to the kernel and the disk subsystem every time, and so that brought us down about 250,000

per second, you know what I mean, that we can write. And so we're going to offer in an opt in version saying, I'm okay if it's a little bit less consistent,

but you need to be able to ingest a million messages a second in a single stream versus partitioning it out kinda like the Kafka world, right, where you've got

sometimes they've got so many partitions, and then someone says, hey, we need to add another one. And then apparently, that's that's not a fun day in Kafka land, apparently, when you have to repartition everything. Are there any other aspects of the work that you're doing at NATS, the patterns that it enables,

the adoption or implementation

that we didn't discuss yet that you'd like to cover before we close out the show?

No. I think, you know, for for closing thoughts for for the audience and the listeners is, Sanadia's approach with taking that sense of technology, a foundational technology, is to

modernize how distributed systems are built that can cross regions, cloud providers, and and most importantly, out to to edge. And that the what doesn't change. You're still designing microservices and key value stores and object stores. And, again, we don't build relational stores, but they're always a component, right, in these these whiteboard,

sessions.

But how that changes is is a big deal. And then the last one is is that we are making the transition, the greenfield to brownfield. We've already done MQTT and HTTP, but you'll see us in the next couple weeks or so release,

our connectors,

technology, which allows you to connect pretty much anything, right, out there. So we think that this is gonna be a a really big, big deal for a lot of our users and ecosystem. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

You know, I think,

data management, in my opinion so there's gonna be a lot of people that that probably don't agree with this is but it it aligns with what Snedi tries to do, which is we need a way to be able to, in

real time, decrease latency to access a service or data.

And let's let's take data, for example, in that case. Where things start to break down isn't necessarily I mean, Nats does this very easily. We just kind of gets a lot of this stuff for free. But where it becomes really, really important is is that regardless of where things are deployed, that you have a consistent and dependable security story. That if that piece of data lives in the cloud and

Tobias, you know, our host can access it but Derek can't, that rule better be the exact same if the data gets duplicated into a factory. And all of a sudden, I show up at the factory and I log in as Derek, and I go, oh, can I read this message? And she said, no. You you you can't. And so when you look at the security around data access and then you look at how things within cloud native mostly being

reliant on perimeter based security models, that's probably the biggest

mind shift. The second biggest mind shift,

that I've seen is

security companies that or sorry, security groups within companies that just say everything has to be HTTP because all of our tooling is just HTTP. That's all we know how to control. And I don't believe HTTP is bad, but I believe it's bad if you use it for everything. And again, it's like, you know, NVIDIA

using HTTP to connect their GPUs, you know, in a in a in a super cluster, you know. No one would think that's a good idea, Yet software people design these big distributed systems, and all they're using is ACDP or an ACDP derivative like gRPC, right, which is still is I have to know where you are. You know what I mean? And if you can move around,

we have to do unnatural acts to make that happen with load balancers and DNS tricks and things like that. And so, you know, security

movement with consistent security across different deployment paradigms, and I'm not talking Kubernetes versus system d versus just Docker. I'm talking about cloud, different regions, different cloud providers, different edge locations, all the way down to very, very small resource limited environments like ECUs inside of vehicles type stuff. And then HDP being the end all, be all of communications

is it it's I don't agree with it, but I'm I'm also gonna say if you're building a really small app that gets, you know, five or six, you know, requests a second, you know, okay. Fine. Where I get concerned is is that the perimeter security model folks are also moving to everything has to be HTTP, and we have a playbook for how do we think we secure that internally.

And I don't think that that's that's a good thing, or it needs it needs to be looked at again. How about that? Alright. Well, thank you very much for taking the time today to join me and talk about the work that you've been doing on NATS and all of the capabilities

that it enables. I could definitely talk to you about it all day. So, I I appreciate you taking the time and all the work that you and your team are putting into

bringing this capability into the world. So thank you again, and I hope you have a good rest of your day. Thanks, Tobias. Thanks for having me on. I

appreciate

it.

Thank you for listening, and don't forget to check out our other shows. Podcast.net

covers the Python language, its community, and the innovative ways it is being used. And the AI Engineering podcast is your guide to the fast moving world of building AI systems.

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@dataengineeringpodcast.com

with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.