Reducing The Barrier To Entry For Building Stream Processing Applications With Decodable

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Introducing Rudder Stack Profiles.

Rudder Stack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable enriched data to every downstream team.

You specify the customer traits, then profiles runs the joins and computations for you to create complete customer profiles.

Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack.

You shouldn't have to throw away the database to build with fast changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades old batch computation model for an efficient incremental engine to get complex queries that are always up to date. With Materialise, you can. It's the only true SQL streaming database built from the ground up to meet the needs of modern data products.

Whether it's real time dashboarding and analytics, personalization and segmentation, or automation and alerting, Materialise gives you the ability to work with fresh, correct, and scalable results, all in a familiar SQL interface. Go to data engineering podcast.com/materialize

today to get 2 weeks free.

Your host is Tobias Macy, and today, I'm interviewing Eric Sammer about starting your stream processing journey with Decodable. So, Eric, first off, welcome back to the show. Good to have you on. And for folks who haven't listened to your previous appearance, can you give a bit of an introduction?

Sure. Thanks so much again for for having me. So I'm the CEO and founder here at Decodible. We're a stream processing platform based on Apache Flink

and Debezium for change and to capture. And do you remember how you first got started working in data?

You know, I've maybe. I have to think. I have to go into the way back machine on this. Yeah. So I think, my

I I very early got started working on database systems and data platforms and infrastructure. You know, I've been doing this for 25, you know, probably creeping up on 30 years. The place where it sort of really switched over for me was when I went from building internal systems

at places like Experian or the various other organizations

to working at when I joined Cloudera in late 2009, early 2000 10. So I was a pretty early employee there and worked on a bunch of the what I would lovingly call the 1st generation big data

stuff with with Hadoop and, you know, and then eventually things like Spark and Kafka,

things like that. And so that's probably where I really cut my teeth.

Ever since then, you know, 2010, I've been more focused

on building

platforms for other people as sort of a a vendor. It's probably the deal the only way to say it.

And so bringing us to decodable itself, we had you on, looks like in October of 2021

early in your journey.

But for folks who aren't familiar with it, can you give a bit of an overview about what Decodable is, the story behind it, and why you decided that this is where you wanted to spend your time and energy?

Yeah. I mean, the last part of that is probably the funniest. I would rather never spend my time on this, but I hate this problem so much that I am determined to solve it or or die trying.

So, I mean, decodable is you know, it's it's a relatively sort of simple concept. I mean, there's a bunch of different sources of data. There's an increasing

especially on the destination side, an increasing number of destinations of data. And in our world, we think primarily on the source side about event streaming platforms like Kafka and Red Panda, Kinesis, GCP Pubs, things like that. Operational databases like Postgres, MySQL, Mongo,

Oracle, SQL Server, those those kinds of systems. And then on the destination side, analytic database systems so all of the same systems on the destination side that I just mentioned on the source side, plus the analytic database systems,

cloud data warehouse,

data lake systems, Snowflake, Databricks,

even things like s 3 with, like, parquet data and and those kinds of things. And increasingly, there's, like, long tail,

more specialized data infrastructure, and we would include, like, the real time old app systems like StarTree and Imply

or Apache Druid, Apache Pinot, those those tens of systems, Rockset,

so on and so forth, as well

as things like Elastic

and, you know, full text search systems,

you know, InfluxDB,

sort of like telemetry, like metric stores, and and those kinds of things. And so decoderable really exists to sort of sit in between those source and destination systems,

including microservices that are hanging off of Kafka topics, and be able to

process data in real time, filter,

join, enrich,

aggregate, sort of all the verbs that people would use either in SQL

or

through, you know, Java APIs. And so we the the open source projects that we're based on, again, primarily Apache Flank on the stream processing side, Debezium on the change data capture side,

so we're built on top of those 2 systems. And so that's the set of APIs and capabilities that we're exposing through the the commercial

product, you know, that that is decodable. The genesis of the company

is, you know, is also probably not sort of surprising. It's just that, like, as the sort systems grow and get more complex, as the destination systems grow and get more complex, and the processing in between them, and especially the fact that the data is driving

coming from operational systems

and driving other operational systems, including microservices, including caches like Redis,

you know, full text search systems, but also the analytical database systems,

Snowflake,

Databricks, so on and so forth. I think that the tooling that sits between them is just has historically been

super powerful,

really robust,

but quite frankly, really hard to use and reason about, whether it's

how do I reprocess data, how do I make schema migration, how do I make schema changes,

how do I,

what's the right way to think about data quality and observability,

what's the right way to build and deploy those pipelines as part of applications and CICD processes. And so, you know,

we really wanted to solve this problem and make it relatively simple or simple as possible to to be able to process,

the data, to add new source and destinations.

That's a long explanation. But, fundamentally, like, that's the space in which we in which we play.

And like I said, I think that this problem is just far more complicated than it should be, and I will die trying, you know, to drain the complexity out of this problem. It's, it's just it doesn't need to be as complex, you know, as it is.

And to that point of complexity,

streaming systems

started to become in vogue, I wanna say,

maybe creeping up on 10 years ago now,

as a lot of the so called hyperscalers

were getting to that point of hyperscale where they said, oh, we need to be able to start processing all of this data in real time because all of our batch jobs are taking a week to complete. So I know Twitter was 1 of the early ones with, I don't remember if they were the ones for Storm or Samsa, but I remember, like, hearing about Storm, Samsa,

Flink was a little bit later on, Spark Streaming. You know, there were dozens of streaming systems. Now it seems like it's coalesced mostly down to a handful of 2 or 3.

Within the overall space of complexity in streaming. You hear a lot about things like checkpointing, windowing,

at least one's consistency,

exactly one's semantics,

challenges

of the,

kind

of late arriving data. Wondering, given the fact that we've been dealing with these problems for the better part of a decade now, how many of them are still something that somebody coming into the space really needs to understand

still? How much of it are you still dealing with with building decodable, and how much of it has been solved through just kind of force of will or new updates and understanding to how we can address these systems in a more

sane fashion? I mean, that's the spiciest of spicy questions. Because, I mean, that's the complexity, I think, that we're talking about, all the things that you mentioned. So I think that that's right on the money. And I'll be really honest with you. I think that we've we've handled some of those things. So there's quite a bit of configurability in this is in these systems. And in fact, you're talking to Robert Metzger, the the PMC chair of Flink. He's he's 1 of our engineers, and he and I were talking about this problem. And he's like, you gotta understand that, like, Flink was built in a way that allows

quite a bit of tuning

where you know a lot about the workload.

And if you compare that to something like Postgres,

where, like, you just fire queries at these things and, you know, and, like, yes, there's there's tuning quite a bit of tuning you could do on Postgres, but for the most part, it just kind of does what you expect it to do. So I think I think that's the right question. So a lot of what we do is trying to, you know, solve exactly those issues. So some of them, things like checkpointing and memory tuning and buffers and and and sort of configuring at least once versus exactly once and getting the semantics right and the retention and the timeouts. Those kinds of things, I think that we've actually done a pretty good job, you know, here at Decodable, and I think it is possible

to paper over a lot of that complexity. Some of the challenges around late arriving data

and

state management in terms of, like, replay and, like, what happens if you have bad bad data and, like, how do you sort of compensate for that? You know,

those things, I think, are intrinsic to data engineering.

They're not even strictly stream processing problems, although they do tend to rear themselves

in stream processing in ways that you probably don't encounter as much in batch. So I think that there's definitely some of that. And there,

it's not so much about making that problem go away. Because like I said, I think it is, like, an intrinsic thing that you have to think about, but I think you can build work flows

and processes and systems around that to at least guide people down the path of making sane decisions.

Right? So, like, if you care about late arriving data, like, there's actually a way of

presenting the concept of, like, watermarks and, like, all this deep complexity to somebody in a way that, like, is more intuitive,

I think, might come out of the box with, some of these projects. And, again, not for lack of trying, these things are incredibly

sophisticated and complicated and have to handle this, like, bevy of use cases

that vary in the in their needs with respect to timing and and all and state and and, you know, later on in data and all these other kinds of things. So I think that we've actually come quite a long way, especially in the Flink ecosystem where I think things are starting to really coalesce just in the last couple of years. I mean, Flink has been around since, you know, as a research project. I think it was, like, 2010.

I think at 2014,

it actually became

the open source project,

Apache Flank, you know, from the from the research project. But, like, Sam's out of LinkedIn and Storm, which I think primarily came out of Twitter, there's been so many systems that I think have carved a a wonderful path here, but I really think that the the industry is coalescing around Flink as being sort of the the de facto winner,

you know, maybe the way that Kafka has. I I don't know that I'm allowed to say that out loud, but, like, you know, may maybe there's a parallel there.

With all of that complexity, with all of the different systems that are and were necessary

to be able to even think about addressing these streaming use cases,

that raised the barrier to entry quite substantially so that most people said, oh, streaming would be great, and then they started to dig into the problem and said, oh, heck no. I'm not gonna deal with that. I'll just deal with maybe smaller batch sizes and figure it out. And now that we do have platforms like decodable

and many others

focusing on different aspects of this streaming ecosystem,

how does that influence the

types of problems, the types of businesses that are actually starting to implement streaming data as a core capability of their business?

It's a great question. I think we certainly are seeing it objectively more and more. There are companies, you know, just even a couple of years ago who said, like, we don't have a streaming use case that have, like, come around to the idea, like logistics, for instance. Like, I the joke I always make with the team is, like, nobody really thought that they needed to know exactly where their pizza or Chinese food was until, like, Grubhub and DoorDash and, like, all these other companies started, like, showing you the little dot on your on your mobile device. And now customers just expect that. So I think the use cases have really pushed a bunch of people into thinking about these things, especially in retail, Fintech, logistics,

inclusive of, like, these delivery services and those kinds of things, and gaming. You know, systems like Fortnite or companies like Epic Games and stuff like that have done quite a bit with real time processing based on on things like Flink.

I think the first step for a lot of people was adopting,

event streaming system like Kafka. Right? And I think that, like, the when you look at, like, the Fortune 500 or even, like, the Global 5, 000 or what however you wanna slice the market, I think, like, most people have done that. Well, maybe that's an exaggeration, but, like, most of the Fortune 500 have done that for sure. You know, obviously, companies like Confluent and AWS have have really popularized

the notion of event streaming in general. And then, like, we look at that the way I think people look at, like, s 3. Right? Like, that's the primitive that enables a whole bunch of other things to follow, and I think stream processing is the natural thing that follows that. You know, first, I move the data, then I might able to to process that data without actually having to write microservice that listen or produce directly to those Kafka topics, which is a natural extension.

And I think that we're sort of at that phase now where people have, like, figured out

event streaming in general, and now they're like, oh, yeah. This is like this handles

ingesting to analytical database systems. This handles connectivity between

microservices.

And, like, now they're starting to get to, like, okay, but, like, doesn't connect to everything I need it to connect to, or it doesn't handle change to the capture versus the pen building streams. And a lot of the things that need to talk to each other don't speak the same

language in terms of, like, the payload or the semantics, or there's govern governance and compliance stuff. Like, that service isn't allowed to have PII data, but this 1 produces it. But they still need to talk. So, like, how do you deal with that? Well, the answer is stream processing. Like, that's where that fits. So,

use cases like that, I think, have have pushed people into it. And so, you know, I think it's it's a little bit of a renaissance, like a second renaissance.

We just came out of Confluent Current last week,

or or a week and a half ago. There's, like, 25 100 people, and it's like all the stalwarts. It's every company

you could ever imagine,

you know,

in that Fortune 500 space and and some of the sort of higher end of what we think of as, like, the more sophisticated mid market, I don't think we're there yet in the sense of, like, it's nowhere near as robust,

a market or a technology stack as, like, the data warehouse. You know? People go like, okay. Like, it's either Snowflake or it's Teradata or it's BigQuery or something like that. I don't think stream processing is quite there yet, but I think it's on its way to that. And we're very lucky to have been I feel like we built Decodable, like, exactly the right time where we're sort of cloud only, and, like, we're able to really take advantage of, you know, kind of where where people are at and meeting them where they are.

And an interesting thread to pull on as well is you have appropriately used the right semantic terms for separating these 2 concepts of streaming that will often get muddled between

with people who are less aware of the space of event streaming and stream processing. And I'm wondering if we can maybe clearly draw the line between those 2 because particularly with things like Kafka or with Pulsar, where they have some measure of transformation available or Red Panda in particular with their in line Wasm modules being able to do some measure of transformation on events as they get, you know, entered into a topic or move between topics as opposed to what you would do with a flint. Wondering if we can maybe kind of make that line a little bit brighter for people who are confused.

Yeah. I mean, I think you're absolutely right.

So when we say event

streaming, and I understand the language is confusing on this. I I kinda wish that there were less,

that these 2 phrases had fewer words in common. Right? You know? But when we say event streaming, what we really mean

is

the durable storage and movement

of data in real time. And so

some of these projects, like you mentioned, like Red Panda with its WASM support or Pulsar with its Pulsar functions,

you know,

have different

capabilities

that are sort of either squished together in 1 box and some in certain cases, they are sort of effectively 2 different boxes.

So I think that further

makes things mushy for people. So

we think about it as storage and durable storage and movement

of data, and then we think about the processing

and connectivity

of that data. And connectivity, of course, being, like, connect to my source systems, connect to my destination systems. We can kinda get it onto and off of, like, a Kafka topic or something like that. So, like, in the using the Kafka ecosystem as an example,

storage movement is the Kafka broker.

The connectivity would be Kafka Connect.

And then the

processing, that's where it starts to get a little bit sort of interesting. You know, there's k streams. There's ksqlDB.

There's obviously Flink. Our, you know,

weapon of choice in this is at decodable is is Apache Flink. And there's other systems that you can sort of pair with that. And so

that's the way we think about it. And I think I don't know that these are proper nouns, event streaming, event processing, or stream processing.

Some people, like you said, sort of abbreviate that as saying, like, Kafka, and, like, to them, that means everything. To me, that means the broker. Right? So, like, you know, there's a there's certainly some squishiness in there, but I think that that's the right way to think about it because there are cases

where, like, we love Kafka, the broker. We really do. In terms of connectivity,

I think that, you know, quite frankly, like, you know, I'll I'll just say I'll say the quiet part out loud. Like, you know, I think that the way we do connectivity, which is actually based on Apache Flink,

has advantages over doing that with Kafka Connect. I think that processing,

Flink has advantages over k streams. I know there are friends of mine who will throw rocks at me for saying that, you know, who are sort of diehard pay streams people, which I appreciate. You know, we could debate that all day, but I actually think that there are cases where it makes sense to think about those things as a as,

individual blocks that you can sort of stitch together.

This episode is brought to you by DataFold, a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production.

DataFold leverages data diffing to compare production and development environments and column level lineage to show you the exact impact of every code change on data, metrics, and BI tools,

keeping your team productive and stakeholders happy.

DataFold integrates with DBT,

the modern data stack, and seamlessly plugs in your DataCI for team wide and automated testing.

If you are migrating to a modern data stack, DataFold can also help you automate data and code validation to speed up the migration.

Learn more about DataFold by visiting data engineering podcast.com/datafold

today.

And in the product journey that you have taken at Decodable since we last spoke in October of 2021,

I'm wondering how

the

adoption curve has influenced your thinking around your product focus of how to reduce the level of upfront complexity that people need to address to start using these technologies and experimenting with these techniques? And some of the ways that the overall industry trends have also

factored into the ways that you think about the solution space that you're targeting?

Yeah. Absolutely. I mean so there's a couple of things here

that I think have been, you know, just things we've learned and, like, you know I don't know. Maybe somebody would tell me I shouldn't do this, but I'm gonna tell you all the stuff that we got wrong, you know, and that we've got we've, like, since fixed since 2010, not just, like, in our product, but, like, in our understanding of the space. So, like, you know, in 2010, just for context, Decodable was about 8 months old. We had just raised

our series a. We're now 2 and a half years in and have spent quite a bit of,

additional time with, you know, customers and things like that. So a couple of things. 1, I think, initially, we thought that this space, stream processing space in general, was at the point where people

it was really going

probably more mainstream than it really was. Like, we we thought it was further along, and so we kind of went like, well, obviously, the right thing to do is focus on SQL because everybody knows it and, like, you know, that was the right way to think about it. And I think we were

correct, but incomplete. You know? So, like, 1 of the things that I think we have learned since then is that when we talk to customers, it's about 80%,

70%, 90% of workloads that can be expressed in SQL. Because, like, a lot of what people do with stream processing is, like, I just need to, like, route this stuff, or I need to knock out some PII data, or I need to filter records for

just the successful

HTTP events. Like, whatever whatever is, like, really simple stuff that is, like, a 1 line SQL statement that, like, effectively replaces,

like, a whole, like, chunk of ream of, like, Java code and, like, all these other kinds of things. It's 1 less service to explicitly, like, build an instrument and monitor. Like, you can sort of push that to a vendor, and, like, that has sort of advantages. And then there's, like, this

10, 20% of use cases

that are just different. And they're different because

few reasons. 1, purely from an expressiveness

perspective, SQL is just hard, and that's typically because there's, like, more sophisticated state management,

really, really complex sessionization

use cases or, excuse me, data enrichment use cases, for instance,

can sometimes require

very specific kinds of, like, time management and window functions and state management.

There are cases where people have to reach into third party libraries, and I think as, like, the like, LLMs and AI and, like, things that are above my pay grade are are sort of, like, all the rage. I think those kinds of things, like, there aren't SQL functions or they're difficult to express in SQL for various reasons, or they have deeply, deeply, deeply nested data that is complex to think about in the relation of model for a variety of reasons.

You know, some of those things could probably be SQL, but it's more natural for people to think about it as imperative code. And so people just, like, were like, we have to write code. So we had to wind up exposing full support for the Flank APIs,

and that allowed

at least 50% of the people that we were talking to who said, we love what you're doing, but we can't use you because you're SQL only to be able to use us. And, like, maybe that's internal secret sauce at Decodible, but that was, like, a major thing that, like, in retrospect, you're like, yeah. Of course. But the market just

wasn't so far along that, like, everything was SQL. And so, like, I I truly believe that you do need both. Flink has, of of course, recognized this from very early on. We wound up exposing that functionality and sort of building support around that in a way we think is is is sort of super nice to use, But, that was 1 big thing. The other big trend

is this issue

of especially as more of the

older or highly regulated businesses

or just more,

complex businesses move to cloud with their data platforms, not just like their application stack, but their data platforms. I think this question of, like, can I actually ship all of my event data to a third party vendor to be processed and then have it returned to me? Some companies, banks, insurance companies, health care providers just, like, sort of, like, recoil

when you're like, yeah. Just, like, give us all your data. Unless you're like a snowflake, unless you like a like a like a a Salesforce or somebody like that. But, like, let's be honest.

You know, decodable is not a snowflake at the stage.

Right? And so, like, the level of trust we enjoy with customers is probably not as robust

as, like, a Snowflake or somebody like that. And so I think that customers rightfully want to be able to maintain control of their data. And so our answer to this, and Red Panda has done something similar and a couple of other vendors in the space have done something similar, is just like bring your own cloud model where we cut the platform into a control plane and the data plane, and the control plane is only control you know, command and control messages, and the data plane is the part that actually processes data and touch customer infrastructure. And the data plane runs inside the customer's

cloud accounts. So it's still cloud. It still has a lot of the features of a managed offering because we're able to

collaborate on, like, the management of that, but it is resident within the customer's account. And so, like, they have full control,

over over the data that data that happens there. And I think that I don't know whether there's some debate raging online right now about, like, whether or not that's, like, the future of cloud services. Frankly.

Some

people

prefer

1

over

the

other.

And

frankly. Some people prefer 1 over the other. And I think that that was an interesting thing that we've learned. I think data infrastructure in the cloud in particular has a high

walled climb

until you are as mature

and well developed

as, as a Snowflake. Sorry. That's a super long answer. Hopefully, that makes sense. No. That makes perfect sense. That's great. I love the detail. And as to

the types of problems that people are trying to solve when they come to

decodable. You mentioned that you targeted SQL. You said, this will be great. Everybody will know SQL. This will solve everybody's problems. Oh, shoot. They actually wanna program.

And I'm curious what the process looked like from that realization

to the point where you said, okay. Here you go. I've given you what you want, and some of the minefields that you had to traverse on that journey. So I mean, the issue,

in that particular instance is not so much, like, how do we support it? Because like I said, we're sort of building on the shoulder of giants here, standing on the shoulders of giants with with Flink. So, like, Flink already has the super robust data stream API and table API API that are different levels of abstraction

that are, quite frankly, like, really ergonomic and nice to use for for most use cases, for maybe for all use cases. I'm hedging a little bit because there's gonna be somebody who's like, well, I hate it. Okay. Fine. Like, you know but I think that, I think that Flink already has a lot of this capability.

The issue for us was, like, how do we offer it in a way that is safe? And because, like, the the the real issue for a cloud provider is that, like, that is untrusted third party code. Right? You basically have customers uploading, like, arbitrary

code to a cloud service and say, run this for me. And for us to build a service around this, and we've even talked to people internally building data platforms who are like, the users could I might trust it to not do sort of adversarial

things, but we do worry about supply chain attacks. We do worry about, like, exfiltration

of sensitive information.

Or more often, we worry about that data that that code just, like, chewing up arbitrary resources

in a way that, like, impacts, like, other workloads

inside of, like, the bank or the, you know, the the retailer or whatever it might be. So I think resource management and then, like, security and safety and isolate basically, isolation. Resource isolation is the biggest challenge there. And so we spent

so much time trying to find the safe,

correct, performant,

isolated, but still cost efficient

way to do that, and wound up having to thread a pretty rough needle

on that. We think that we've actually done the right thing in that, like, foremost, the security and safety. And so, like, we spent a lot of time and effort basically trying to figure out how to do that inside of decodable.

I'll just say that, like I won't go into a ton of detail. I'll just say that, like, if you upload an existing Apache Flank job to decodable, it effectively gets isolated hardware on an isolated sub subsegment of a network. Right? And all that infrastructure is is dynamic. And we, you know, we had to work pretty hard to find the reasonably efficient and cost effective way for us to do that. In the BYOC model, it's a little bit easier because that's actually the part that's running on the customer's

infrastructure. And so, like, it's easy for them to have, like, isolated infrastructure because it's single tenant by definition.

But, basically, user code inside of the managed, fully managed version of Decodable is also a single tenant for that job. And so there's there's quite a bit of infrastructure that goes

into that. And then the resource isolation from a,

like, an efficiency perspective, like, if you're doing it from a security standpoint, you're kind of, like, also doing it from a just like an isolation like, a workload isolation perspective. But that was, like, the biggest nut for us to crack by far.

And,

and then there's, like, secondary stuff, which is, like, what if people touch parts of Flink that we sort of have to

control and manage for a variety of reasons for, like, observability infrastructure and those kinds of things. Those are those are secondary and probably easier to solve. They're quite a bit of work, but, like, conceptually easier to solve.

But, man, that that that user provided code thing is just, you know, it's rough

to get right. But once you get it right, it is so, so exciting. Because, like, the UX on it is is basically the same as the SQL UX. Like, you upload a chunk of SQL, you upload a JAR file that's built against the open source APIs, and, like, magic happens. Right? Like, it just works. And, like, that has actually been

really, really nice to watch customers be able to take preexisting

stuff,

and without making, like, any code changes, without even recompiling, just kind of go, like, run this. And it sort of gets all of the managed goodness, you know, that sort of comes from that. But, man, is it a lot of work?

Well, I mean, at least they actually have the deliverable of a JAR file that you can plug in. Imagine if it was PHP or Python.

Oh my goodness. You know, I mean, so I there you know, this is a place I think of more deep research on our part. We talk a lot about expanding the language support. So, like, Java like, Flink is for better and worse is Java. You know, it enjoys a rich ecosystem

of tooling and compatibility, and the perform the runtime performance

is is

arguably

better than it should be. I'll be honest with you. Modern Java is, like, incredible.

Sure. Somebody is like, wow. I could hand code it in Rust better. Yeah. Yeah. Yeah. You know, I agree. I'm a c plus plus Rust person myself, but, like, you know, yes.

But, you know, sort of, like, strikes the right balance.

We but, like, opening Flink and those capabilities

to languages like Python, Go, JavaScript, type script is actually real you know, maybe even PHP.

No. It's not biased. But anything is possible.

But, I mean, but I think those kinds of things are actually really important because,

certainly, for data platforms

at larger organizations, you're typically not homogeneous

on language. Right? You know, it's polyglot programming is like a whole thing these days. And so I think

that being able to,

support that is something that we know we need to improve, not just within Decodable, but but the larger, I think, Flynk ecosystem. And I think other people in that community probably would agree with that. Pipeline support is okay. It needs to be better. You know, it really does. To that point of the experience for people who are working with Decodable, you mentioned upload some SQL or upload a JAR. I'm just wondering if you can talk through the overall surface area of developer experience

and the onboarding process, the concepts people need to be aware of and factor into their work as they're

using decodable

and some of the challenges that you've had as far as how to

improve the overall experience and reduce points of friction in their journey?

Yeah. I think this is, again, 1 of those areas where I don't think the work is ever done.

You know? Systems

like

Postgres have, like,

existed forever. And so the UX on that or the DX on that is the developer experience is is

relatively sophisticated,

excuse me, and well known. There's patterns around it. And I think stream processing,

because it is effectively

an integration

technology. It's all about connecting

to sort of upstream systems, downstream systems, and it has the overhead of effectively being, like, a a query engine. Like, you know, it's not a database per se, but, like, it has all the hallmarks

of the query engine portion of of the database system.

Arbitrary workload,

definition, those kinds of things. There's quite a bit of sophistication and complexity in that. So inside of Decodable, we've tried to distill this down to, like, as few concepts for somebody to learn as possible. So we think about the world in terms of

connections

to things,

streams that are produced or consumed,

by connections,

and then,

what we call pipelines, you know, for lack of a better word, that actually process data

in between those streams. And then sort of, like, that allows us to then compose those 3 core primitives

into effectively

sequences or or DAGs of, like, this connection

goes to this stream,

which feeds these 5 pipelines,

which feed maybe these other pipelines, which feed these other

connections, and so on and so forth, and give people the ability to effectively,

design the end to end data flow,

from, you know, arbitrary sources source systems to to destinations.

But there's, like, issues of

discoverability

and impact analysis. If I change this connection, what does it

mess with downstream?

There's

I have malformed data that's, like, stuck

in this 1 part of the DAG. Like, what is the implication of that?

Helping people to, like, navigate that

and to

discover,

diagnose,

trace, whatever the the the verb is, I think, is always laden with a certain amount of friction because you sometimes wanna think about the end to end data flow and infrastructure through different lenses.

1 is,

like,

what processing supports this microservice

or this, like, you know, flows into this snowflake table. There's how is this data used, which is more of, like, a source side concern.

There

is the dependency

analysis

portion of this that is so

difficult. So we've actually tried to create

this, like,

for every object, what's upstream of it, what's downstream of it, and, like, basically, lineage information to be able to walk the graph as you sort of go through this, and then given people

tooling,

command line, API,

UI, DBT. So there's, like, lots of different interfaces

to this

to be able to operate on different

resources within that graph. And so, like, if you're a SQL person and you're thinking about using declarable to get data from, like, Postgres to Snowflake with a bunch transformation in the middle to, like, cleanse it or to, like, better structure it to flow into Snowflake so that you're not sort of burning warehouse credit, you know, to be able to do things. You're probably thinking in, like, a DBT and SQL mode. And in that case, you wanna be able to fit into the existing work flow of DBT. You don't wanna rip somebody out of that and try and teach them a different way of working. So thinking about it from the perspective of, like, jobs to be done and sort of, like, the tooling, the different personas, like which are really different flavors of the same persona. They're, like, roughly someone operating in, like, a data engineering or application like, back end developer type role, but they might be, like, coming at it from the lens of data warehousing or microservice development. Like, you know, like, there's, like, different dimensions on

on that. And I think as much as we can fit to someone's existing workflow,

that's our goal. So, like, we have to offer

APIs for absolutely everything. We have to offer

command line tools to, like, script CICD,

you know, stuff,

you know, that happens in GitHub actions and, you know, all, you know, harness and, like, all these other kinds of systems.

We have to provide DBT tooling for the data warehousing flavor

of data engineer. We have to provide

a UI to allow people to very quickly sort of navigate and understand the way that they would, like, in AWS console.

Right? Like and, like, let's be honest. There's probably people who do more inside of those UIs than are like, I know I should be doing this in Terraform, but, like, I'm not. And, like, you know, someone's gonna catch me. But I think it is the right environment to be able to, like, develop and experiment and try things, and then you kinda go back and codify it once you know that it works. And so I think we still have tons of work to do on this, but I think we're starting you know, like, we're we're down a path where a lot of people are able to be pretty productive in in short order.

You know?

But, lots of work left to do there for

sure.

As more people start using AI for projects, 2 things are clear. It's a rapidly advancing field, and it's tough to navigate.

How can you get the best results for your use case?

Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI powered apps.

Attend the dev and ML talks at Nodes 2023,

a free online conference on October 26th featuring some of the brightest minds in tech.

Check out the agenda and register today at neo4j.com/nodes.

That's n e o, the number 4,

j.com/nodes.

And to the point of the APIs

integrating with people's existing workflows that also brings and also the the,

much maligned click ops approach of just do it in the UI.

That also brings up the question of the, I guess, the the control surfaces

of

where do you want people to be thinking about how to interact with decodable? Should it be fully managed via their data orchestration system?

Does decodable want to own the whole interaction?

You know, should decodable be something that people don't even know exists and it just plugs away and does it does its thing? I'm just curious

how the thinking around that also influences the ways that you think about how to present decodable and the types of control capabilities that you want native to the platform

versus just exposing those capabilities

to the,

other orchestration systems that people wanna bring to bear?

You know, I think it's a really good question. And this is a place I think where streaming is a little bit different from the batch side. So, like, when we we don't necessarily think in terms of orchestration because these things although I suppose there's life cycle. Right? Like, you start a connection, you start a pipeline, it runs forever,

until you have to, like, update update it, you know, or or something like that with, like, new version of the code or or something. But, I mean, so the orchestration

doesn't look like an airflow. The control surface, to your point, doesn't look like an airflow. It looks a whole lot more like a code deploy. So, like, we have customers

today

who are like, we think about this as being something we do through tariff, you know, and, like, that's the way in which they manage

or want to manage

connections and and and pipelines and those kinds of things. We also have customers

who think about this as, like, I'll be honest with you. Like, I think it's part of, like,

scripting as part of, like, inside of a make file. Like, I know at least a couple of customers that run, like, make pipelines, and, like, what it actually does is call our CLI tool to, like, go, like, you know, sort of provision a bunch of things. And then somewhere

I shouldn't wrap people out. There's definitely a click ops,

demographic.

You know what I mean? Like, I think there are people who are like, look, man. Like, sure. I should do all that, but, like, what if I just run decodable create

stream you know, decodable create connection, you know, from the command line where I go into the pipe into the UI and I, like, plug in my Kafka parameters and click start. You know? And, like, there there are people who do it. And then, typically, they do go back and, like, they figure out how to

do it. And then there are people who have intensely sophisticated

internal infrastructure that is part of their own sort of deployment and development stuff,

development processes.

And they're

they just directly touch rest APIs. They really do. And so for some people, their developers are using decodable and they don't even know. So, like, they have this I define some SQL and some connection information inside of a YAML file, and then, like, that gets committed to, like,

Git or GitHub,

and something wakes up and takes that YAML file, parses it, and then makes a bunch of API calls to decodable,

and which is plus minus what our CLIs do, give or take. But I think that, the important thing for us

is that it almost does I mean, we have a an opinion about sort of, like, what we think is, like, acceptable ways to run the stuff in production. But, I mean, like, I don't know if you've ever tried to tell a bank how they should do code deploys, but, like, let me tell you,

like, it is a losing proposition.

You you I think as vendors, like, we should have a vision. We should have a a way of thinking about things. So, like, if you don't have an opinion, we'll we'll tell you, like, well, here's, like, the a good

way to do it. But, I mean, you have to fit the

internal platforms

that people have built on top of you within a bank, an insurance company, a retailer. Otherwise, you don't get to sell to them. Right? You just you just don't. If you're like, well, you should you have to do it this way. Your opinion does not matter, you know. And I think I mean, I I think that that to me

is developer centricity

and empathy. It's, like, making sure, like, what how do you guys wanna see this? Like and if you have no opinion, then, like, here's 3 things we see customers do. You can pick 1. But,

you know, I think we constantly get feedback around that. I would say that most people

use some portion of the UI during the development process, some portion of the c I CLI for

deployment and things like that, And then maybe they use it directly or maybe they build tooling on top of it. You know? But, like, it's approximately like that. And then, like I said, DBT is kind of like a special work flow, and, you know, Terraform is probably a special workflow. But those things are, like, approximately built. They they are built on top of our APIs. Right? So, like, that's not, like,

really that special, but, like, it's a it's a workflow. It's a it's a it's a lifestyle. It's not just the technology. It's a lifestyle.

Yeah. It's funny how they grow to be that way. And another element that we haven't dug into yet, but that keeps popping up throughout is these connections where you have the source systems you're pulling data from. You have the destination systems you're pushing data into.

And earlier on, that was I have a lot of source systems. I have 1 destination system.

So all of the complexity goes into the destination system, and I just make all the source systems look mostly the same. That's the approach of things like air bite, singer, etcetera.

Now that set of destination systems keeps growing, and then the idea of reverse c t l just ballooned it even further, and I'm wondering how you're thinking about how to make that a manageable problem without having to just spend all of your engineering time on writing connectors for all the different source and destination systems, and then the whole rest of the system just stagnates.

Yeah. No. I I think anytime there's connectors you know, prior to starting in Decodable, I was at Splunk, and Splunk has, like, this

incredible

long tail of connectors for sort of various

various systems. And so I think anytime you're building connectors, you have to be careful of the long tail

problem. So I think that, like, we look for points of leverage in this, and we think customers should look for points of leverage

in this. There are natural aggregators,

and what I mean by that are sort of places where things natively integrate

and get sort of first class support. And I think that there are 2,

maybe

3 natural

points of gravity in the data platform.

The operational database, the event streaming layer, and the data warehouse.

Right? Those are your 3

natural aggregators.

There are an increasing number of systems, obviously, that support systems like Snowflake and and things like that. There's obviously a bunch of things that directly support systems like Postgres,

you know, MySQL, Mongo,

Cassandra, whatever whatever your flavor is. And there are increasingly a bunch of things that support Kafka

as or Kafka, Pulsar,

Kinesis,

whatever,

as natural aggregation points. Like, as an example,

getting, like, the equivalent of NetFlow data out of something like AWS,

you can get that on a Kinesis

stream. Right? And, like, you know, or, like, security audit logs, you can naturally get, like, in a, you know, in in a Kinesis stream. And so, like, we think that those things are sort of natural aggregators. I should also include object stores, s 3 and and and TCS and those kinds of things. And so I think to whatever degree we can,

we spend most of our time thinking about the natural aggregators

and then how to get data in and out of those,

natural aggregators.

I

think,

and then some of those are natural source systems and natural

and sync systems.

So

I think the reverse ETL thing like, listen.

I don't think that the data warehouse is the right source

to drive operational

systems.

I will die on this hill. It doesn't have, like

typically, that team is not on call.

They don't have the the

the same,

uptime SLAs.

You don't think about

feeding microservices

from your data warehouse just doesn't

make a ton of sense to me. Now I think that there are vendors

who, you you know, who will always try and consolidate in those places,

but they're not gonna meet the SLAs on, like, latency. They're not gonna meet the SLAs on, like, a bunch of these other thing uptime, maybe.

I won't speculate on that. So I think that

a lot of the things that are done through reverse ETL

should almost certainly be driven from the event streaming layer. And so, like, I think that

if you're getting operational events

from Postgres databases or Kafka,

that should not pass through the data warehouse

to on its way to a HubSpot or a Salesforce.

I think it can also go to the data warehouse.

Right? So I think the event streaming layer and the stream processing layer is the point where you actually do distribution

of this data

to terminal

systems, of which I think the data warehouse is mostly a terminal system.

There are exceptions that there's, like, model training and and those kinds of things that happen there. But I think that

pushing, like, this increased

well, like, let's

sync faster

to and from the data warehouse or or at least from the data warehouse. I think to the data warehouse probably makes sense. But, like, from the data warehouse, I think, is a losing battle. I think it that is isn't cost efficient, and I don't think that applies.

So so I think that I mean, this is a a horrible answer to your question, but, like, you know, I'm I'm down this path and I I'm not letting go of the bone. So I think that, like, I the the stream processing layer, I think, is is really the right place to do this collection,

cleanse, and distribution.

And then there can be further refinement of data inside of the data warehouse. I think that that makes sense.

I I like I said, I think I did a poor job of answering your question, but that's sort of, like, where my brain goes when I hear that. No. I I think that that's a a great point, and I appreciate your

opinions on this matter where people hear something. It it mostly makes sense, but they don't take the time to evaluate it in detail. And so that's how things

grow. That's how if somebody says, oh, of course, reverse ETL comes from the data warehouse because where else would it come from? So I think that that's definitely a useful insight for people, and lot lots of interesting things to dig into. On that point

of the streaming system should be what's responsible for pushing to reverse the ETL destinations,

HubSpot, Mailchimp, Sales Force, what have you. The the converse to that is, well, it needs to go through the data warehouse because that's where all my other data lives, and that's how I enrich those records.

And to which I assume the response is, well, then you just use the data warehouse for the enrichment, but you pull that through the streaming system

to inline that data into the event as it comes from Postgres on its way to whatever its eventual destination is. I'm wondering if you can give a bit more color to that.

Yeah. I mean, so this is where I think it gets interesting. I think that the question then becomes like, well, you could do the enrichment in the data warehouse,

but if the streaming layer is the thing that actually collects data from all of the operational systems and is feeding the data warehouse, it is also

capable of performing the enrichment,

you know, to go to those other systems.

So I think that it is

very much unlikely

that there is 1 place, I think, that fits the uptime,

the SLAs,

the latency

requirements, the throughput requirements

of all these things. So I don't think it's

actually a bad thing

to have,

to be able to do enrichment with, like, operational information

from a post grass

in the streaming layer and also to have that same data inside of the data warehouse. Right? So, like, a lot of our customers will use decodable or something like it if you don't use decodable. To collect data from all these operational systems, they put effectively the raw data or mostly raw data is slightly massaged

into the data warehouse,

batch,

workloads inside of the data warehouse can, of course, use that data directly.

But when they are feeding systems like, you know,

online operational systems like transactional messaging with, like, a Mailchimp or SendGrid or something like that. They may be also,

doing that enrichment in the streaming engine, and this is why we support things like joins and, you know, all of this kind of sophisticated processing, stateful processing inside of the stream processing layer is so that you can do data enrichment at the streaming layer and then directly feed the operational system,

effectively bypassing

the data warehouse

for that use case.

So I don't think that

people need to be thinking about this as, like, either or. And I'll be honest with you. People go, like, well, what about data duplication,

and what about logic duplication?

And our point is, like, well, that's why we support SQL. So the same SQL that you can run-in the data warehouse, you can also run-in the stream processing layer. That's why that is important. That's why we also support things like DBT

so that you can actually push

that processing upstream

when you need to, not all the time, but when you need to in order to be able to fill fulfill that. And the upside of that

is that you get these decreased latency,

better cost efficiency,

you know,

richer data

in you know, richer data

that may be coming from sort of even even more systems,

you know, at scale in the hot path. And then you can feed the microservices.

You can feed SendGrid or Mailchimp. You can feed, you know, all of these other systems

in parallel, and we've tried to make that as painless as possible. But I think people like engineers in particular, myself included, will sometimes over rotate on, well, you're sort of repeating yourself and these kinds of things. But I think for online operational infrastructure, which is what that is,

transactional messaging

or updating

records in Salesforce,

you know, like, having your sales team have up to date information about who's inside of your product or something like that, that is operational

use that is an operational use case, not an analytical use case.

It supports analytics, but the the infrastructure there is is operational.

And I think that, you know, we're probably over rotated on these on these issues of, like, well but you're storing that record twice. And it's like, you know, if you think that's bad, wait until you learn what indexes in database systems are. Right? They're really just different organizations of a lot of the same data or materialized views or any of these other things. They're just different copies of the data. So, like, you know, I think that what we should be thinking about is, is, like, how do I meet the the operational requirements, the cost

requirements,

or, you know, the cost optimizations,

especially in this, you know, this, like, macroeconomic

situation. I think that this actually this architecturally has so many advantages, and it is not dissimilar from what, you know, it is, what,

I think folks like Jay Preps have talked about, like, back in, what, 2014 about the Kappa architecture and, like, all these other kinds of things. This is what it's getting at, is that you can actually drive multiple systems

with the same data that are and allows those systems to be optimized for specific use cases and workloads.

And as you have been building decodable, working with customers, helping them figure out how best to apply streaming to the problems that they're approaching? What are some of the most interesting or innovative or unexpected ways that you've seen your product used?

Frequently surprised

with what people are doing

with Decodable.

I mean, we definitely see what I would call, like, bread and butter use cases. The filter

route, transform aggregate, you know, type of trigger, you know, type of use cases that happen in the stream processing layer. I think that, you know, 1 of the things that I have come to, like, really appreciate

is this combination

of and I don't think this is, these days, all that unique, but I think at the time, it sort of surprised me how simple it it made things. This change data capture

from an operational database system

through the stream processing engine

for munging of that data

and then pumping that back into other operational systems like caches

and full text search. So, like, we work on a use case with a company

that takes in resume data, does a bunch of from MySQL, does a bunch of parsing and cleansing and and transformation,

and then indexes that data back into Elastic Search to make resumes

searchable by sort of different basically, different features

of of, a candidate. And

this process

wound up being, like, 0 code for them. It's like 1 SQL statement

that does, you know, very low latency, you know, processing of this data and then sort of indexing back into the full text search

systems.

Then sort of indexing back

into

the full text search systems. And I think if you think about you extend that to things like caches, like Redis, and and and

those kinds of things or elastic cache, like, you actually start to see

all this opportunity

to be able to

create what amount to materialized views that are optimized for different

for different workloads. Again, that's probably not

super interesting these days, but, like, the kinds of applications

that you can build and how fast you can build them, that's the thing that surprises me. You know, we worked with a telco provider

to take data out of this,

I won't name it, this legacy ERP system

with, like, hundreds of tables,

be able to cook that data

for

fast lookup by things like customer ID and those kinds of things so that it can be served out of, basically, specialized systems like caches and those kinds of things to, support different kinds of applications.

And, like, what would have been weeks

of microservice development and, like, batch processing stuff winds up being, you know, I don't know, a day or 2 of, like, a a pretty gnarly SQL statement. But, like, that thing

had 17 joins correlated

subqueries. Like, it was just, like it was a 220

lines

to take this, like, legacy ERP,

highly third normal form, like, database schema and cook this denormalized

record for, like, fast lookup by by customer ID and see, like, all the devices and, like, all these other kinds of things. So, I mean,

the speed

and, like, time to market,

I think, is the thing that is just, like, so interesting.

But, man, people get creative. I'll tell you that. Like, they can they can do some stuff that, I mean,

surprises

me, the scale of it or the complexity of it and, like, the fact that it kinda just works for people, like, that is I woulda killed for this stuff, you know, 10 years ago. I really would've. But, yeah, I mean, I think every week, we probably see something new and interesting. You know, it's it's it's exciting.

And in your work of building this system, building the product, working in this ecosystem, what are the most interesting or unexpected or challenging lessons that you've learned personally?

Oh, man. That's a good question.

I think,

the amount of glue so I think, like, the big thing is, especially as an engineer, like, you look at these open source projects, and they are, like, so sophisticated,

so powerful. Debezium, Flank,

Kafka.

And, you know, everybody kind of goes like, yeah. Our stack is, like, Debezium plus Kafka plus Flank. And the plus

in that sentence, those 2 pluses are doing a lot of work in that sentence. And I think the thing

that has

surprised me

is just how much time and effort we spend on the glue

between these otherwise fantastic

LEGO bricks. And so, like, I'll be honest with you. Here's probably the dirty secret about declarable. Our differentiation is not like that we're 3 milliseconds faster than the other guy. Our differentiation

is, like, all the glue in between those systems, the DevX, the deployment,

the APIs, the orchestration, Like, all the the observability.

Oh my goodness. The observability alone

around data pipelines in a streaming context,

data quality,

like, looks different in a streaming context. It functions differently in a streaming context.

So I think a lot of those kinds of things are just,

like, much harder

than,

I would have expected

or, like you know, and maybe that's me being naive.

But, like, the joke is, like, the, you know, the old, like, draw the rest of the owl, you know, kind of meme where you draw, like, 2 circles,

and then it's, like, draw the you know, step 2, draw the rest of the owl. I feel like that with, like, step 1, deploy Kafka, Flank, and Debezium. Step 2, like, build the rest of the platform. I mean, like, it's just the amount of time and effort is platform. I mean, like, it's just the amount of time and effort is just daunting, especially, like, given

that's, like, our

our full time job here at Decorable. We spend all our time thinking about it. And I I think if anything, it sort of shows

that the need for,

you know, these data platform teams

inside of organizations

and and the value that they provide. Because the

usability difference between here's 3 open source projects. Go you know, good luck. And here's, like, the 3 open source projects with all of the tooling and the glue

and the UX and, like, all this kind of stuff around the observability

around it is the difference between being successful with these platforms and not. It really, really is.

Absolutely. And so with that context, for people who are considering decodable or considering streaming? What are the cases where that's the wrong choice?

I think that,

I don't think that people should think about, excuse me, decodable or stream processing as,

I'm gonna shut down my data warehouse. Right? Like, I'm gonna, like, take my airflow DBT and Snowflake and shoot it into the moon. Like, that's just not what's gonna happen. I think that we don't we don't make that claim.

I think

that stream processing,

at its core, is

something that typically sits between 2 other systems.

You know, some source, some destination,

and processes data to make it available in that system. I don't actually think that it's meant to be a serving system. So, like, we purposely decouple ourselves

from, like, an Apache Pinot or an Imply or Snowflake

or Databricks

or an s 3,

a Redis

because we believe that you you actually want those as 2 separate systems

so that you can cook the data and then

serve it in whatever

most appropriate storage and query engine, you know, that makes sense for a particular workload.

So I don't necessarily

think

of I don't think that people should be saying, I'm going to replace

Snowflake with decodable

or Databricks with decodable or Postgres with decodable. Like, that is not

necessarily how you should think about stream processing.

So don't do that. You know, I think that there are plenty of other cases

where if you're thinking about how do I get data from Postgres

through the outbox pattern to be available for microservices

or, like, into

HubSpot or something like that or or or into Snowflake or s 3, I think those are the places where decodable and stream processing fit.

And as you continue

to build and iterate on and scale the decodable platform, what are the things you have planned for the near to medium term, or any particular problem areas or projects you're excited to dig into?

Yeah. I think there's a lot we can do on

I mean, there's some stuff that's, like, evergreen. We'll always be adding new connectors. We'll always be improving, like, the SQL dialect. We'll always be sort of enriching the APIs to allow people more, you know, expressive, you know, capabilities.

I think that there's some higher order functions that we can provide to people, like, more sophisticated,

management of time and these kinds of things that codify some standard

ways that people use this technology that basically makes it easier to build the jobs that run on top. Developer experience, I think, is always top of mind for us. It is not where it should be. There are certain things that are probably harder to do

than I wish they were.

Think operational

flows, like making a backward incompatible

schema change, for instance, is, like, a thing

that sometimes needs to happen. You should avoid it at all cost, but, like, when it does happen, when it does need to happen, it's it's messier than it should be. Reprocessing of data, I think that's an area, and, like,

bootstrapping net new things. Like, I wanna bring up a new Elasticsearch instance, and I wanna replay the last 12 months of data or something like that into it. I think those kinds of things and then switch over to the stream. I think those things, you can do them today, but I think that they can be easier. So I think we're thinking about those those kinds of things. And and then I think there's probably a couple of things that we're working on that, you know, we're probably not quite ready to talk about that we think will be really exciting for people, you know,

that are probably a little bit more nascent, a lot of maybe a little bit more moonshot. You know, we'll we'll figure out,

when the right time to sort of come out with those things. You know? Absolutely. Interesting stuff. Are there any other aspects of the work that you're doing at Decodable, the overall

ecosystem of stream processing, and when and how to apply it that we didn't discuss yet that you'd like to cover before we close out the show?

Honestly, I feel like, you know, this is, like, probably 1 of the better, like, overviews

that that, you know, I really appreciate the questions.

I think we probably hit, like, all the critical stuff. I would say that, like, a lot of people

are still sort of struggling, I think, to understand where stream processing fits into the stack, you know, into the

and which workloads should move to it.

And we're, you know, probably spending more and more time trying to think about how to help people understand that. I think, I mean, look, you know, at its core, people can think about this as, like, low latency

ETL that can power both analytical systems and microservices.

I wouldn't stress much harder on it than that. You know? But I think, like, you know, the operational to analytical and operational to operational data infrastructure is probably, like, the, like, the the mental model that people should have for stream processing.

But I, you know, I think that might be 1 of the areas where we have some more work

to do, and and it's probably an evergreen topic, you know, to talk about when we talk about this stuff. But, otherwise, it's been a real pleasure, man. I appreciate it. Absolutely. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Oh, man. What is the biggest gap? I,

I think the

largest portion of the problem here is really 1 around

developer experience,

and I think

my anecdotal evidence for that is that every team at every company at every Fortune 500

has, like, what I would call, like, the company specific data platform layer that glues this, like, set of things together

in a way that makes sense

for a particular company, for a particular team. And I don't think that there are

yet developed

standards

the way I think we have on the application development, like the layers above.

And I think that that creates so much 6 of 1, half a dozen of the other, just like it's different for the sake of being different,

development efforts that burn time and money, and, like, I think that that hurts people.

I don't know the solution to that, but, man, I wish we could come up with better patterns and practices for this stuff that encode it in a way that vendors can then sort of more tightly integrate with so

that there is less work for these data platform teams to do. Because they really are doing, like, yeoman's work out there. You know, it's rough. Absolutely. Well, thank you very much for taking the time today to join me, share the work that you've been doing at Decodable, the changes that have happened since we last spoke. Definitely appreciate all of the time and energy that you and your team are putting into making stream processing a more tractable and approachable problem for people who don't want to have to invest their entire lives into it. So, appreciate that, and I hope you enjoy the rest of your day. Thank you so much for having me. It's always a

pleasure.

Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning.

Visit the site at dataengineeringpodcast.com.

Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about

it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links