Summary
Building streaming applications has gotten substantially easier over the past several years. Despite this, it is still operationally challenging to deploy and maintain your own stream processing infrastructure. Decodable was built with a mission of eliminating all of the painful aspects of developing and deploying stream processing systems for engineering teams. In this episode Eric Sammer discusses why more companies are including real-time capabilities in their products and the ways that Decodable makes it faster and easier.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
- This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold
- You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
- As more people start using AI for projects, two things are clear: It’s a rapidly advancing field, but it’s tough to navigate. How can you get the best results for your use case? Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI-powered apps. . Attend the dev and ML talks at NODES 2023, a free online conference on October 26 featuring some of the brightest minds in tech. Check out the agenda and register today at Neo4j.com/NODES.
- Your host is Tobias Macey and today I'm interviewing Eric Sammer about starting your stream processing journey with Decodable
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Decodable is and the story behind it?
- What are the notable changes to the Decodable platform since we last spoke? (October 2021)
- What are the industry shifts that have influenced the product direction?
- What are the problems that customers are trying to solve when they come to Decodable?
- When you launched your focus was on SQL transformations of streaming data. What was the process for adding full Java support in addition to SQL?
- What are the developer experience challenges that are particular to working with streaming data?
- How have you worked to address that in the Decodable platform and interfaces?
- As you evolve the technical and product direction, what is your heuristic for balancing the unification of interfaces and system integration against the ability to swap different components or interfaces as new technologies are introduced?
- What are the most interesting, innovative, or unexpected ways that you have seen Decodable used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Decodable?
- When is Decodable the wrong choice?
- What do you have planned for the future of Decodable?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Decodable
- Understanding the Apache Flink Journey
- Flink
- Debezium
- Kafka
- Redpanda
- Kinesis
- PostgreSQL
- Snowflake
- Databricks
- Startree
- Pinot
- Rockset
- Druid
- InfluxDB
- Samza
- Storm
- Pulsar
- ksqlDB
- dbt
- GitHub Actions
- Airbyte
- Singer
- Splunk
- Outbox Pattern
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Neo4J: ![NODES Conference Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/PKCipYsh.png) NODES 2023 is a free online conference focused on graph-driven innovations with content for all skill levels. Its 24 hours are packed with 90 interactive technical sessions from top developers and data scientists across the world covering a broad range of topics and use cases. The event tracks: - Intelligent Applications: APIs, Libraries, and Frameworks – Tools and best practices for creating graph-powered applications and APIs with any software stack and programming language, including Java, Python, and JavaScript - Machine Learning and AI – How graph technology provides context for your data and enhances the accuracy of your AI and ML projects (e.g.: graph neural networks, responsible AI) - Visualization: Tools, Techniques, and Best Practices – Techniques and tools for exploring hidden and unknown patterns in your data and presenting complex relationships (knowledge graphs, ethical data practices, and data representation) Don’t miss your chance to hear about the latest graph-powered implementations and best practices for free on October 26 at NODES 2023. Go to [Neo4j.com/NODES](https://Neo4j.com/NODES) today to see the full agenda and register!
- Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png) Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)
- Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png) You shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. That is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing. Go to [materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access) today and get 2 weeks free!
- Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png) This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting [dataengineeringpodcast.com/datafold](https://www.dataengineeringpodcast.com/datafold) today!
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Introducing Rudder Stack Profiles. Rudder Stack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable enriched data to every downstream team. You specify the customer traits, then profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack. You shouldn't have to throw away the database to build with fast changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades old batch computation model for an efficient incremental engine to get complex queries that are always up to date. With Materialise, you can. It's the only true SQL streaming database built from the ground up to meet the needs of modern data products.
Whether it's real time dashboarding and analytics, personalization and segmentation, or automation and alerting, Materialise gives you the ability to work with fresh, correct, and scalable results, all in a familiar SQL interface. Go to data engineering podcast.com/materialize today to get 2 weeks free. Your host is Tobias Macy, and today, I'm interviewing Eric Sammer about starting your stream processing journey with Decodable. So, Eric, first off, welcome back to the show. Good to have you on. And for folks who haven't listened to your previous appearance, can you give a bit of an introduction?
[00:01:38] Unknown:
Sure. Thanks so much again for for having me. So I'm the CEO and founder here at Decodible. We're a stream processing platform based on Apache Flink
[00:01:46] Unknown:
and Debezium for change and to capture. And do you remember how you first got started working in data?
[00:01:51] Unknown:
You know, I've maybe. I have to think. I have to go into the way back machine on this. Yeah. So I think, my I I very early got started working on database systems and data platforms and infrastructure. You know, I've been doing this for 25, you know, probably creeping up on 30 years. The place where it sort of really switched over for me was when I went from building internal systems at places like Experian or the various other organizations to working at when I joined Cloudera in late 2009, early 2000 10. So I was a pretty early employee there and worked on a bunch of the what I would lovingly call the 1st generation big data stuff with with Hadoop and, you know, and then eventually things like Spark and Kafka, things like that. And so that's probably where I really cut my teeth.
Ever since then, you know, 2010, I've been more focused on building platforms for other people as sort of a a vendor. It's probably the deal the only way to say it.
[00:02:48] Unknown:
And so bringing us to decodable itself, we had you on, looks like in October of 2021 early in your journey. But for folks who aren't familiar with it, can you give a bit of an overview about what Decodable is, the story behind it, and why you decided that this is where you wanted to spend your time and energy?
[00:03:08] Unknown:
Yeah. I mean, the last part of that is probably the funniest. I would rather never spend my time on this, but I hate this problem so much that I am determined to solve it or or die trying. So, I mean, decodable is you know, it's it's a relatively sort of simple concept. I mean, there's a bunch of different sources of data. There's an increasing especially on the destination side, an increasing number of destinations of data. And in our world, we think primarily on the source side about event streaming platforms like Kafka and Red Panda, Kinesis, GCP Pubs, things like that. Operational databases like Postgres, MySQL, Mongo, Oracle, SQL Server, those those kinds of systems. And then on the destination side, analytic database systems so all of the same systems on the destination side that I just mentioned on the source side, plus the analytic database systems, cloud data warehouse, data lake systems, Snowflake, Databricks, even things like s 3 with, like, parquet data and and those kinds of things. And increasingly, there's, like, long tail, more specialized data infrastructure, and we would include, like, the real time old app systems like StarTree and Imply or Apache Druid, Apache Pinot, those those tens of systems, Rockset, so on and so forth, as well as things like Elastic and, you know, full text search systems, you know, InfluxDB, sort of like telemetry, like metric stores, and and those kinds of things. And so decoderable really exists to sort of sit in between those source and destination systems, including microservices that are hanging off of Kafka topics, and be able to process data in real time, filter, join, enrich, aggregate, sort of all the verbs that people would use either in SQL or through, you know, Java APIs. And so we the the open source projects that we're based on, again, primarily Apache Flank on the stream processing side, Debezium on the change data capture side, so we're built on top of those 2 systems. And so that's the set of APIs and capabilities that we're exposing through the the commercial product, you know, that that is decodable. The genesis of the company is, you know, is also probably not sort of surprising. It's just that, like, as the sort systems grow and get more complex, as the destination systems grow and get more complex, and the processing in between them, and especially the fact that the data is driving coming from operational systems and driving other operational systems, including microservices, including caches like Redis, you know, full text search systems, but also the analytical database systems, Snowflake, Databricks, so on and so forth. I think that the tooling that sits between them is just has historically been super powerful, really robust, but quite frankly, really hard to use and reason about, whether it's how do I reprocess data, how do I make schema migration, how do I make schema changes, how do I, what's the right way to think about data quality and observability, what's the right way to build and deploy those pipelines as part of applications and CICD processes. And so, you know, we really wanted to solve this problem and make it relatively simple or simple as possible to to be able to process, the data, to add new source and destinations.
That's a long explanation. But, fundamentally, like, that's the space in which we in which we play. And like I said, I think that this problem is just far more complicated than it should be, and I will die trying, you know, to drain the complexity out of this problem. It's, it's just it doesn't need to be as complex, you know, as it is.
[00:06:47] Unknown:
And to that point of complexity, streaming systems started to become in vogue, I wanna say, maybe creeping up on 10 years ago now, as a lot of the so called hyperscalers were getting to that point of hyperscale where they said, oh, we need to be able to start processing all of this data in real time because all of our batch jobs are taking a week to complete. So I know Twitter was 1 of the early ones with, I don't remember if they were the ones for Storm or Samsa, but I remember, like, hearing about Storm, Samsa, Flink was a little bit later on, Spark Streaming. You know, there were dozens of streaming systems. Now it seems like it's coalesced mostly down to a handful of 2 or 3.
Within the overall space of complexity in streaming. You hear a lot about things like checkpointing, windowing, at least one's consistency, exactly one's semantics, challenges of the, kind of late arriving data. Wondering, given the fact that we've been dealing with these problems for the better part of a decade now, how many of them are still something that somebody coming into the space really needs to understand still? How much of it are you still dealing with with building decodable, and how much of it has been solved through just kind of force of will or new updates and understanding to how we can address these systems in a more
[00:08:05] Unknown:
sane fashion? I mean, that's the spiciest of spicy questions. Because, I mean, that's the complexity, I think, that we're talking about, all the things that you mentioned. So I think that that's right on the money. And I'll be really honest with you. I think that we've we've handled some of those things. So there's quite a bit of configurability in this is in these systems. And in fact, you're talking to Robert Metzger, the the PMC chair of Flink. He's he's 1 of our engineers, and he and I were talking about this problem. And he's like, you gotta understand that, like, Flink was built in a way that allows quite a bit of tuning where you know a lot about the workload.
And if you compare that to something like Postgres, where, like, you just fire queries at these things and, you know, and, like, yes, there's there's tuning quite a bit of tuning you could do on Postgres, but for the most part, it just kind of does what you expect it to do. So I think I think that's the right question. So a lot of what we do is trying to, you know, solve exactly those issues. So some of them, things like checkpointing and memory tuning and buffers and and and sort of configuring at least once versus exactly once and getting the semantics right and the retention and the timeouts. Those kinds of things, I think that we've actually done a pretty good job, you know, here at Decodable, and I think it is possible to paper over a lot of that complexity. Some of the challenges around late arriving data and state management in terms of, like, replay and, like, what happens if you have bad bad data and, like, how do you sort of compensate for that? You know, those things, I think, are intrinsic to data engineering.
They're not even strictly stream processing problems, although they do tend to rear themselves in stream processing in ways that you probably don't encounter as much in batch. So I think that there's definitely some of that. And there, it's not so much about making that problem go away. Because like I said, I think it is, like, an intrinsic thing that you have to think about, but I think you can build work flows and processes and systems around that to at least guide people down the path of making sane decisions. Right? So, like, if you care about late arriving data, like, there's actually a way of presenting the concept of, like, watermarks and, like, all this deep complexity to somebody in a way that, like, is more intuitive, I think, might come out of the box with, some of these projects. And, again, not for lack of trying, these things are incredibly sophisticated and complicated and have to handle this, like, bevy of use cases that vary in the in their needs with respect to timing and and all and state and and, you know, later on in data and all these other kinds of things. So I think that we've actually come quite a long way, especially in the Flink ecosystem where I think things are starting to really coalesce just in the last couple of years. I mean, Flink has been around since, you know, as a research project. I think it was, like, 2010.
I think at 2014, it actually became the open source project, Apache Flank, you know, from the from the research project. But, like, Sam's out of LinkedIn and Storm, which I think primarily came out of Twitter, there's been so many systems that I think have carved a a wonderful path here, but I really think that the the industry is coalescing around Flink as being sort of the the de facto winner, you know, maybe the way that Kafka has. I I don't know that I'm allowed to say that out loud, but, like, you know, may maybe there's a parallel there.
[00:11:26] Unknown:
With all of that complexity, with all of the different systems that are and were necessary to be able to even think about addressing these streaming use cases, that raised the barrier to entry quite substantially so that most people said, oh, streaming would be great, and then they started to dig into the problem and said, oh, heck no. I'm not gonna deal with that. I'll just deal with maybe smaller batch sizes and figure it out. And now that we do have platforms like decodable and many others focusing on different aspects of this streaming ecosystem, how does that influence the types of problems, the types of businesses that are actually starting to implement streaming data as a core capability of their business?
[00:12:11] Unknown:
It's a great question. I think we certainly are seeing it objectively more and more. There are companies, you know, just even a couple of years ago who said, like, we don't have a streaming use case that have, like, come around to the idea, like logistics, for instance. Like, I the joke I always make with the team is, like, nobody really thought that they needed to know exactly where their pizza or Chinese food was until, like, Grubhub and DoorDash and, like, all these other companies started, like, showing you the little dot on your on your mobile device. And now customers just expect that. So I think the use cases have really pushed a bunch of people into thinking about these things, especially in retail, Fintech, logistics, inclusive of, like, these delivery services and those kinds of things, and gaming. You know, systems like Fortnite or companies like Epic Games and stuff like that have done quite a bit with real time processing based on on things like Flink.
I think the first step for a lot of people was adopting, event streaming system like Kafka. Right? And I think that, like, the when you look at, like, the Fortune 500 or even, like, the Global 5, 000 or what however you wanna slice the market, I think, like, most people have done that. Well, maybe that's an exaggeration, but, like, most of the Fortune 500 have done that for sure. You know, obviously, companies like Confluent and AWS have have really popularized the notion of event streaming in general. And then, like, we look at that the way I think people look at, like, s 3. Right? Like, that's the primitive that enables a whole bunch of other things to follow, and I think stream processing is the natural thing that follows that. You know, first, I move the data, then I might able to to process that data without actually having to write microservice that listen or produce directly to those Kafka topics, which is a natural extension. And I think that we're sort of at that phase now where people have, like, figured out event streaming in general, and now they're like, oh, yeah. This is like this handles ingesting to analytical database systems. This handles connectivity between microservices.
And, like, now they're starting to get to, like, okay, but, like, doesn't connect to everything I need it to connect to, or it doesn't handle change to the capture versus the pen building streams. And a lot of the things that need to talk to each other don't speak the same language in terms of, like, the payload or the semantics, or there's govern governance and compliance stuff. Like, that service isn't allowed to have PII data, but this 1 produces it. But they still need to talk. So, like, how do you deal with that? Well, the answer is stream processing. Like, that's where that fits. So, use cases like that, I think, have have pushed people into it. And so, you know, I think it's it's a little bit of a renaissance, like a second renaissance.
We just came out of Confluent Current last week, or or a week and a half ago. There's, like, 25 100 people, and it's like all the stalwarts. It's every company you could ever imagine, you know, in that Fortune 500 space and and some of the sort of higher end of what we think of as, like, the more sophisticated mid market, I don't think we're there yet in the sense of, like, it's nowhere near as robust, a market or a technology stack as, like, the data warehouse. You know? People go like, okay. Like, it's either Snowflake or it's Teradata or it's BigQuery or something like that. I don't think stream processing is quite there yet, but I think it's on its way to that. And we're very lucky to have been I feel like we built Decodable, like, exactly the right time where we're sort of cloud only, and, like, we're able to really take advantage of, you know, kind of where where people are at and meeting them where they are.
[00:15:39] Unknown:
And an interesting thread to pull on as well is you have appropriately used the right semantic terms for separating these 2 concepts of streaming that will often get muddled between with people who are less aware of the space of event streaming and stream processing. And I'm wondering if we can maybe clearly draw the line between those 2 because particularly with things like Kafka or with Pulsar, where they have some measure of transformation available or Red Panda in particular with their in line Wasm modules being able to do some measure of transformation on events as they get, you know, entered into a topic or move between topics as opposed to what you would do with a flint. Wondering if we can maybe kind of make that line a little bit brighter for people who are confused.
[00:16:24] Unknown:
Yeah. I mean, I think you're absolutely right. So when we say event streaming, and I understand the language is confusing on this. I I kinda wish that there were less, that these 2 phrases had fewer words in common. Right? You know? But when we say event streaming, what we really mean is the durable storage and movement of data in real time. And so some of these projects, like you mentioned, like Red Panda with its WASM support or Pulsar with its Pulsar functions, you know, have different capabilities that are sort of either squished together in 1 box and some in certain cases, they are sort of effectively 2 different boxes. So I think that further makes things mushy for people. So we think about it as storage and durable storage and movement of data, and then we think about the processing and connectivity of that data. And connectivity, of course, being, like, connect to my source systems, connect to my destination systems. We can kinda get it onto and off of, like, a Kafka topic or something like that. So, like, in the using the Kafka ecosystem as an example, storage movement is the Kafka broker.
The connectivity would be Kafka Connect. And then the processing, that's where it starts to get a little bit sort of interesting. You know, there's k streams. There's ksqlDB. There's obviously Flink. Our, you know, weapon of choice in this is at decodable is is Apache Flink. And there's other systems that you can sort of pair with that. And so that's the way we think about it. And I think I don't know that these are proper nouns, event streaming, event processing, or stream processing. Some people, like you said, sort of abbreviate that as saying, like, Kafka, and, like, to them, that means everything. To me, that means the broker. Right? So, like, you know, there's a there's certainly some squishiness in there, but I think that that's the right way to think about it because there are cases where, like, we love Kafka, the broker. We really do. In terms of connectivity, I think that, you know, quite frankly, like, you know, I'll I'll just say I'll say the quiet part out loud. Like, you know, I think that the way we do connectivity, which is actually based on Apache Flink, has advantages over doing that with Kafka Connect. I think that processing, Flink has advantages over k streams. I know there are friends of mine who will throw rocks at me for saying that, you know, who are sort of diehard pay streams people, which I appreciate. You know, we could debate that all day, but I actually think that there are cases where it makes sense to think about those things as a as, individual blocks that you can sort of stitch together.
[00:19:06] Unknown:
This episode is brought to you by DataFold, a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. DataFold leverages data diffing to compare production and development environments and column level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. DataFold integrates with DBT, the modern data stack, and seamlessly plugs in your DataCI for team wide and automated testing. If you are migrating to a modern data stack, DataFold can also help you automate data and code validation to speed up the migration.
Learn more about DataFold by visiting data engineering podcast.com/datafold today. And in the product journey that you have taken at Decodable since we last spoke in October of 2021, I'm wondering how the adoption curve has influenced your thinking around your product focus of how to reduce the level of upfront complexity that people need to address to start using these technologies and experimenting with these techniques? And some of the ways that the overall industry trends have also factored into the ways that you think about the solution space that you're targeting?
[00:20:21] Unknown:
Yeah. Absolutely. I mean so there's a couple of things here that I think have been, you know, just things we've learned and, like, you know I don't know. Maybe somebody would tell me I shouldn't do this, but I'm gonna tell you all the stuff that we got wrong, you know, and that we've got we've, like, since fixed since 2010, not just, like, in our product, but, like, in our understanding of the space. So, like, you know, in 2010, just for context, Decodable was about 8 months old. We had just raised our series a. We're now 2 and a half years in and have spent quite a bit of, additional time with, you know, customers and things like that. So a couple of things. 1, I think, initially, we thought that this space, stream processing space in general, was at the point where people it was really going probably more mainstream than it really was. Like, we we thought it was further along, and so we kind of went like, well, obviously, the right thing to do is focus on SQL because everybody knows it and, like, you know, that was the right way to think about it. And I think we were correct, but incomplete. You know? So, like, 1 of the things that I think we have learned since then is that when we talk to customers, it's about 80%, 70%, 90% of workloads that can be expressed in SQL. Because, like, a lot of what people do with stream processing is, like, I just need to, like, route this stuff, or I need to knock out some PII data, or I need to filter records for just the successful HTTP events. Like, whatever whatever is, like, really simple stuff that is, like, a 1 line SQL statement that, like, effectively replaces, like, a whole, like, chunk of ream of, like, Java code and, like, all these other kinds of things. It's 1 less service to explicitly, like, build an instrument and monitor. Like, you can sort of push that to a vendor, and, like, that has sort of advantages. And then there's, like, this 10, 20% of use cases that are just different. And they're different because few reasons. 1, purely from an expressiveness perspective, SQL is just hard, and that's typically because there's, like, more sophisticated state management, really, really complex sessionization use cases or, excuse me, data enrichment use cases, for instance, can sometimes require very specific kinds of, like, time management and window functions and state management.
There are cases where people have to reach into third party libraries, and I think as, like, the like, LLMs and AI and, like, things that are above my pay grade are are sort of, like, all the rage. I think those kinds of things, like, there aren't SQL functions or they're difficult to express in SQL for various reasons, or they have deeply, deeply, deeply nested data that is complex to think about in the relation of model for a variety of reasons. You know, some of those things could probably be SQL, but it's more natural for people to think about it as imperative code. And so people just, like, were like, we have to write code. So we had to wind up exposing full support for the Flank APIs, and that allowed at least 50% of the people that we were talking to who said, we love what you're doing, but we can't use you because you're SQL only to be able to use us. And, like, maybe that's internal secret sauce at Decodible, but that was, like, a major thing that, like, in retrospect, you're like, yeah. Of course. But the market just wasn't so far along that, like, everything was SQL. And so, like, I I truly believe that you do need both. Flink has, of of course, recognized this from very early on. We wound up exposing that functionality and sort of building support around that in a way we think is is is sort of super nice to use, But, that was 1 big thing. The other big trend is this issue of especially as more of the older or highly regulated businesses or just more, complex businesses move to cloud with their data platforms, not just like their application stack, but their data platforms. I think this question of, like, can I actually ship all of my event data to a third party vendor to be processed and then have it returned to me? Some companies, banks, insurance companies, health care providers just, like, sort of, like, recoil when you're like, yeah. Just, like, give us all your data. Unless you're like a snowflake, unless you like a like a like a a Salesforce or somebody like that. But, like, let's be honest.
You know, decodable is not a snowflake at the stage. Right? And so, like, the level of trust we enjoy with customers is probably not as robust as, like, a Snowflake or somebody like that. And so I think that customers rightfully want to be able to maintain control of their data. And so our answer to this, and Red Panda has done something similar and a couple of other vendors in the space have done something similar, is just like bring your own cloud model where we cut the platform into a control plane and the data plane, and the control plane is only control you know, command and control messages, and the data plane is the part that actually processes data and touch customer infrastructure. And the data plane runs inside the customer's cloud accounts. So it's still cloud. It still has a lot of the features of a managed offering because we're able to collaborate on, like, the management of that, but it is resident within the customer's account. And so, like, they have full control, over over the data that data that happens there. And I think that I don't know whether there's some debate raging online right now about, like, whether or not that's, like, the future of cloud services. Frankly.
Some people prefer 1 over the other. And frankly. Some people prefer 1 over the other. And I think that that was an interesting thing that we've learned. I think data infrastructure in the cloud in particular has a high walled climb until you are as mature and well developed
[00:25:54] Unknown:
as, as a Snowflake. Sorry. That's a super long answer. Hopefully, that makes sense. No. That makes perfect sense. That's great. I love the detail. And as to the types of problems that people are trying to solve when they come to decodable. You mentioned that you targeted SQL. You said, this will be great. Everybody will know SQL. This will solve everybody's problems. Oh, shoot. They actually wanna program. And I'm curious what the process looked like from that realization to the point where you said, okay. Here you go. I've given you what you want, and some of the minefields that you had to traverse on that journey. So I mean, the issue,
[00:26:30] Unknown:
in that particular instance is not so much, like, how do we support it? Because like I said, we're sort of building on the shoulder of giants here, standing on the shoulders of giants with with Flink. So, like, Flink already has the super robust data stream API and table API API that are different levels of abstraction that are, quite frankly, like, really ergonomic and nice to use for for most use cases, for maybe for all use cases. I'm hedging a little bit because there's gonna be somebody who's like, well, I hate it. Okay. Fine. Like, you know but I think that, I think that Flink already has a lot of this capability. The issue for us was, like, how do we offer it in a way that is safe? And because, like, the the the real issue for a cloud provider is that, like, that is untrusted third party code. Right? You basically have customers uploading, like, arbitrary code to a cloud service and say, run this for me. And for us to build a service around this, and we've even talked to people internally building data platforms who are like, the users could I might trust it to not do sort of adversarial things, but we do worry about supply chain attacks. We do worry about, like, exfiltration of sensitive information.
Or more often, we worry about that data that that code just, like, chewing up arbitrary resources in a way that, like, impacts, like, other workloads inside of, like, the bank or the, you know, the the retailer or whatever it might be. So I think resource management and then, like, security and safety and isolate basically, isolation. Resource isolation is the biggest challenge there. And so we spent so much time trying to find the safe, correct, performant, isolated, but still cost efficient way to do that, and wound up having to thread a pretty rough needle on that. We think that we've actually done the right thing in that, like, foremost, the security and safety. And so, like, we spent a lot of time and effort basically trying to figure out how to do that inside of decodable.
I'll just say that, like I won't go into a ton of detail. I'll just say that, like, if you upload an existing Apache Flank job to decodable, it effectively gets isolated hardware on an isolated sub subsegment of a network. Right? And all that infrastructure is is dynamic. And we, you know, we had to work pretty hard to find the reasonably efficient and cost effective way for us to do that. In the BYOC model, it's a little bit easier because that's actually the part that's running on the customer's infrastructure. And so, like, it's easy for them to have, like, isolated infrastructure because it's single tenant by definition.
But, basically, user code inside of the managed, fully managed version of Decodable is also a single tenant for that job. And so there's there's quite a bit of infrastructure that goes into that. And then the resource isolation from a, like, an efficiency perspective, like, if you're doing it from a security standpoint, you're kind of, like, also doing it from a just like an isolation like, a workload isolation perspective. But that was, like, the biggest nut for us to crack by far. And, and then there's, like, secondary stuff, which is, like, what if people touch parts of Flink that we sort of have to control and manage for a variety of reasons for, like, observability infrastructure and those kinds of things. Those are those are secondary and probably easier to solve. They're quite a bit of work, but, like, conceptually easier to solve.
But, man, that that that user provided code thing is just, you know, it's rough to get right. But once you get it right, it is so, so exciting. Because, like, the UX on it is is basically the same as the SQL UX. Like, you upload a chunk of SQL, you upload a JAR file that's built against the open source APIs, and, like, magic happens. Right? Like, it just works. And, like, that has actually been really, really nice to watch customers be able to take preexisting stuff, and without making, like, any code changes, without even recompiling, just kind of go, like, run this. And it sort of gets all of the managed goodness, you know, that sort of comes from that. But, man, is it a lot of work?
[00:30:29] Unknown:
Well, I mean, at least they actually have the deliverable of a JAR file that you can plug in. Imagine if it was PHP or Python.
[00:30:37] Unknown:
Oh my goodness. You know, I mean, so I there you know, this is a place I think of more deep research on our part. We talk a lot about expanding the language support. So, like, Java like, Flink is for better and worse is Java. You know, it enjoys a rich ecosystem of tooling and compatibility, and the perform the runtime performance is is arguably better than it should be. I'll be honest with you. Modern Java is, like, incredible. Sure. Somebody is like, wow. I could hand code it in Rust better. Yeah. Yeah. Yeah. You know, I agree. I'm a c plus plus Rust person myself, but, like, you know, yes. But, you know, sort of, like, strikes the right balance.
We but, like, opening Flink and those capabilities to languages like Python, Go, JavaScript, type script is actually real you know, maybe even PHP. No. It's not biased. But anything is possible. But, I mean, but I think those kinds of things are actually really important because, certainly, for data platforms at larger organizations, you're typically not homogeneous on language. Right? You know, it's polyglot programming is like a whole thing these days. And so I think that being able to, support that is something that we know we need to improve, not just within Decodable, but but the larger, I think, Flynk ecosystem. And I think other people in that community probably would agree with that. Pipeline support is okay. It needs to be better. You know, it really does. To that point of the experience for people who are working with Decodable, you mentioned upload some SQL or upload a JAR. I'm just wondering if you can talk through the overall surface area of developer experience
[00:32:20] Unknown:
and the onboarding process, the concepts people need to be aware of and factor into their work as they're using decodable and some of the challenges that you've had as far as how to improve the overall experience and reduce points of friction in their journey?
[00:32:37] Unknown:
Yeah. I think this is, again, 1 of those areas where I don't think the work is ever done. You know? Systems like Postgres have, like, existed forever. And so the UX on that or the DX on that is the developer experience is is relatively sophisticated, excuse me, and well known. There's patterns around it. And I think stream processing, because it is effectively an integration technology. It's all about connecting to sort of upstream systems, downstream systems, and it has the overhead of effectively being, like, a a query engine. Like, you know, it's not a database per se, but, like, it has all the hallmarks of the query engine portion of of the database system.
Arbitrary workload, definition, those kinds of things. There's quite a bit of sophistication and complexity in that. So inside of Decodable, we've tried to distill this down to, like, as few concepts for somebody to learn as possible. So we think about the world in terms of connections to things, streams that are produced or consumed, by connections, and then, what we call pipelines, you know, for lack of a better word, that actually process data in between those streams. And then sort of, like, that allows us to then compose those 3 core primitives into effectively sequences or or DAGs of, like, this connection goes to this stream, which feeds these 5 pipelines, which feed maybe these other pipelines, which feed these other connections, and so on and so forth, and give people the ability to effectively, design the end to end data flow, from, you know, arbitrary sources source systems to to destinations.
But there's, like, issues of discoverability and impact analysis. If I change this connection, what does it mess with downstream? There's I have malformed data that's, like, stuck in this 1 part of the DAG. Like, what is the implication of that? Helping people to, like, navigate that and to discover, diagnose, trace, whatever the the the verb is, I think, is always laden with a certain amount of friction because you sometimes wanna think about the end to end data flow and infrastructure through different lenses. 1 is, like, what processing supports this microservice or this, like, you know, flows into this snowflake table. There's how is this data used, which is more of, like, a source side concern.
There is the dependency analysis portion of this that is so difficult. So we've actually tried to create this, like, for every object, what's upstream of it, what's downstream of it, and, like, basically, lineage information to be able to walk the graph as you sort of go through this, and then given people tooling, command line, API, UI, DBT. So there's, like, lots of different interfaces to this to be able to operate on different resources within that graph. And so, like, if you're a SQL person and you're thinking about using declarable to get data from, like, Postgres to Snowflake with a bunch transformation in the middle to, like, cleanse it or to, like, better structure it to flow into Snowflake so that you're not sort of burning warehouse credit, you know, to be able to do things. You're probably thinking in, like, a DBT and SQL mode. And in that case, you wanna be able to fit into the existing work flow of DBT. You don't wanna rip somebody out of that and try and teach them a different way of working. So thinking about it from the perspective of, like, jobs to be done and sort of, like, the tooling, the different personas, like which are really different flavors of the same persona. They're, like, roughly someone operating in, like, a data engineering or application like, back end developer type role, but they might be, like, coming at it from the lens of data warehousing or microservice development. Like, you know, like, there's, like, different dimensions on on that. And I think as much as we can fit to someone's existing workflow, that's our goal. So, like, we have to offer APIs for absolutely everything. We have to offer command line tools to, like, script CICD, you know, stuff, you know, that happens in GitHub actions and, you know, all, you know, harness and, like, all these other kinds of systems.
We have to provide DBT tooling for the data warehousing flavor of data engineer. We have to provide a UI to allow people to very quickly sort of navigate and understand the way that they would, like, in AWS console. Right? Like and, like, let's be honest. There's probably people who do more inside of those UIs than are like, I know I should be doing this in Terraform, but, like, I'm not. And, like, you know, someone's gonna catch me. But I think it is the right environment to be able to, like, develop and experiment and try things, and then you kinda go back and codify it once you know that it works. And so I think we still have tons of work to do on this, but I think we're starting you know, like, we're we're down a path where a lot of people are able to be pretty productive in in short order. You know?
But, lots of work left to do there for sure.
[00:37:52] Unknown:
As more people start using AI for projects, 2 things are clear. It's a rapidly advancing field, and it's tough to navigate. How can you get the best results for your use case? Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI powered apps. Attend the dev and ML talks at Nodes 2023, a free online conference on October 26th featuring some of the brightest minds in tech. Check out the agenda and register today at neo4j.com/nodes. That's n e o, the number 4, j.com/nodes.
And to the point of the APIs integrating with people's existing workflows that also brings and also the the, much maligned click ops approach of just do it in the UI. That also brings up the question of the, I guess, the the control surfaces of where do you want people to be thinking about how to interact with decodable? Should it be fully managed via their data orchestration system? Does decodable want to own the whole interaction? You know, should decodable be something that people don't even know exists and it just plugs away and does it does its thing? I'm just curious how the thinking around that also influences the ways that you think about how to present decodable and the types of control capabilities that you want native to the platform versus just exposing those capabilities to the, other orchestration systems that people wanna bring to bear?
[00:39:34] Unknown:
You know, I think it's a really good question. And this is a place I think where streaming is a little bit different from the batch side. So, like, when we we don't necessarily think in terms of orchestration because these things although I suppose there's life cycle. Right? Like, you start a connection, you start a pipeline, it runs forever, until you have to, like, update update it, you know, or or something like that with, like, new version of the code or or something. But, I mean, so the orchestration doesn't look like an airflow. The control surface, to your point, doesn't look like an airflow. It looks a whole lot more like a code deploy. So, like, we have customers today who are like, we think about this as being something we do through tariff, you know, and, like, that's the way in which they manage or want to manage connections and and and pipelines and those kinds of things. We also have customers who think about this as, like, I'll be honest with you. Like, I think it's part of, like, scripting as part of, like, inside of a make file. Like, I know at least a couple of customers that run, like, make pipelines, and, like, what it actually does is call our CLI tool to, like, go, like, you know, sort of provision a bunch of things. And then somewhere I shouldn't wrap people out. There's definitely a click ops, demographic.
You know what I mean? Like, I think there are people who are like, look, man. Like, sure. I should do all that, but, like, what if I just run decodable create stream you know, decodable create connection, you know, from the command line where I go into the pipe into the UI and I, like, plug in my Kafka parameters and click start. You know? And, like, there there are people who do it. And then, typically, they do go back and, like, they figure out how to do it. And then there are people who have intensely sophisticated internal infrastructure that is part of their own sort of deployment and development stuff, development processes.
And they're they just directly touch rest APIs. They really do. And so for some people, their developers are using decodable and they don't even know. So, like, they have this I define some SQL and some connection information inside of a YAML file, and then, like, that gets committed to, like, Git or GitHub, and something wakes up and takes that YAML file, parses it, and then makes a bunch of API calls to decodable, and which is plus minus what our CLIs do, give or take. But I think that, the important thing for us is that it almost does I mean, we have a an opinion about sort of, like, what we think is, like, acceptable ways to run the stuff in production. But, I mean, like, I don't know if you've ever tried to tell a bank how they should do code deploys, but, like, let me tell you, like, it is a losing proposition.
You you I think as vendors, like, we should have a vision. We should have a a way of thinking about things. So, like, if you don't have an opinion, we'll we'll tell you, like, well, here's, like, the a good way to do it. But, I mean, you have to fit the internal platforms that people have built on top of you within a bank, an insurance company, a retailer. Otherwise, you don't get to sell to them. Right? You just you just don't. If you're like, well, you should you have to do it this way. Your opinion does not matter, you know. And I think I mean, I I think that that to me is developer centricity and empathy. It's, like, making sure, like, what how do you guys wanna see this? Like and if you have no opinion, then, like, here's 3 things we see customers do. You can pick 1. But, you know, I think we constantly get feedback around that. I would say that most people use some portion of the UI during the development process, some portion of the c I CLI for deployment and things like that, And then maybe they use it directly or maybe they build tooling on top of it. You know? But, like, it's approximately like that. And then, like I said, DBT is kind of like a special work flow, and, you know, Terraform is probably a special workflow. But those things are, like, approximately built. They they are built on top of our APIs. Right? So, like, that's not, like, really that special, but, like, it's a it's a workflow. It's a it's a it's a lifestyle. It's not just the technology. It's a lifestyle.
[00:43:43] Unknown:
Yeah. It's funny how they grow to be that way. And another element that we haven't dug into yet, but that keeps popping up throughout is these connections where you have the source systems you're pulling data from. You have the destination systems you're pushing data into. And earlier on, that was I have a lot of source systems. I have 1 destination system. So all of the complexity goes into the destination system, and I just make all the source systems look mostly the same. That's the approach of things like air bite, singer, etcetera. Now that set of destination systems keeps growing, and then the idea of reverse c t l just ballooned it even further, and I'm wondering how you're thinking about how to make that a manageable problem without having to just spend all of your engineering time on writing connectors for all the different source and destination systems, and then the whole rest of the system just stagnates.
[00:44:33] Unknown:
Yeah. No. I I think anytime there's connectors you know, prior to starting in Decodable, I was at Splunk, and Splunk has, like, this incredible long tail of connectors for sort of various various systems. And so I think anytime you're building connectors, you have to be careful of the long tail problem. So I think that, like, we look for points of leverage in this, and we think customers should look for points of leverage in this. There are natural aggregators, and what I mean by that are sort of places where things natively integrate and get sort of first class support. And I think that there are 2, maybe 3 natural points of gravity in the data platform.
The operational database, the event streaming layer, and the data warehouse. Right? Those are your 3 natural aggregators. There are an increasing number of systems, obviously, that support systems like Snowflake and and things like that. There's obviously a bunch of things that directly support systems like Postgres, you know, MySQL, Mongo, Cassandra, whatever whatever your flavor is. And there are increasingly a bunch of things that support Kafka as or Kafka, Pulsar, Kinesis, whatever, as natural aggregation points. Like, as an example, getting, like, the equivalent of NetFlow data out of something like AWS, you can get that on a Kinesis stream. Right? And, like, you know, or, like, security audit logs, you can naturally get, like, in a, you know, in in a Kinesis stream. And so, like, we think that those things are sort of natural aggregators. I should also include object stores, s 3 and and and TCS and those kinds of things. And so I think to whatever degree we can, we spend most of our time thinking about the natural aggregators and then how to get data in and out of those, natural aggregators.
I think, and then some of those are natural source systems and natural and sync systems. So I think the reverse ETL thing like, listen. I don't think that the data warehouse is the right source to drive operational systems. I will die on this hill. It doesn't have, like typically, that team is not on call. They don't have the the the same, uptime SLAs. You don't think about feeding microservices from your data warehouse just doesn't make a ton of sense to me. Now I think that there are vendors who, you you know, who will always try and consolidate in those places, but they're not gonna meet the SLAs on, like, latency. They're not gonna meet the SLAs on, like, a bunch of these other thing uptime, maybe.
I won't speculate on that. So I think that a lot of the things that are done through reverse ETL should almost certainly be driven from the event streaming layer. And so, like, I think that if you're getting operational events from Postgres databases or Kafka, that should not pass through the data warehouse to on its way to a HubSpot or a Salesforce. I think it can also go to the data warehouse. Right? So I think the event streaming layer and the stream processing layer is the point where you actually do distribution of this data to terminal systems, of which I think the data warehouse is mostly a terminal system.
There are exceptions that there's, like, model training and and those kinds of things that happen there. But I think that pushing, like, this increased well, like, let's sync faster to and from the data warehouse or or at least from the data warehouse. I think to the data warehouse probably makes sense. But, like, from the data warehouse, I think, is a losing battle. I think it that is isn't cost efficient, and I don't think that applies. So so I think that I mean, this is a a horrible answer to your question, but, like, you know, I'm I'm down this path and I I'm not letting go of the bone. So I think that, like, I the the stream processing layer, I think, is is really the right place to do this collection, cleanse, and distribution.
And then there can be further refinement of data inside of the data warehouse. I think that that makes sense. I I like I said, I think I did a poor job of answering your question, but that's sort of, like, where my brain goes when I hear that. No. I I think that that's a a great point, and I appreciate your
[00:48:56] Unknown:
opinions on this matter where people hear something. It it mostly makes sense, but they don't take the time to evaluate it in detail. And so that's how things grow. That's how if somebody says, oh, of course, reverse ETL comes from the data warehouse because where else would it come from? So I think that that's definitely a useful insight for people, and lot lots of interesting things to dig into. On that point of the streaming system should be what's responsible for pushing to reverse the ETL destinations, HubSpot, Mailchimp, Sales Force, what have you. The the converse to that is, well, it needs to go through the data warehouse because that's where all my other data lives, and that's how I enrich those records.
And to which I assume the response is, well, then you just use the data warehouse for the enrichment, but you pull that through the streaming system to inline that data into the event as it comes from Postgres on its way to whatever its eventual destination is. I'm wondering if you can give a bit more color to that.
[00:49:52] Unknown:
Yeah. I mean, so this is where I think it gets interesting. I think that the question then becomes like, well, you could do the enrichment in the data warehouse, but if the streaming layer is the thing that actually collects data from all of the operational systems and is feeding the data warehouse, it is also capable of performing the enrichment, you know, to go to those other systems. So I think that it is very much unlikely that there is 1 place, I think, that fits the uptime, the SLAs, the latency requirements, the throughput requirements of all these things. So I don't think it's actually a bad thing to have, to be able to do enrichment with, like, operational information from a post grass in the streaming layer and also to have that same data inside of the data warehouse. Right? So, like, a lot of our customers will use decodable or something like it if you don't use decodable. To collect data from all these operational systems, they put effectively the raw data or mostly raw data is slightly massaged into the data warehouse, batch, workloads inside of the data warehouse can, of course, use that data directly.
But when they are feeding systems like, you know, online operational systems like transactional messaging with, like, a Mailchimp or SendGrid or something like that. They may be also, doing that enrichment in the streaming engine, and this is why we support things like joins and, you know, all of this kind of sophisticated processing, stateful processing inside of the stream processing layer is so that you can do data enrichment at the streaming layer and then directly feed the operational system, effectively bypassing the data warehouse for that use case.
So I don't think that people need to be thinking about this as, like, either or. And I'll be honest with you. People go, like, well, what about data duplication, and what about logic duplication? And our point is, like, well, that's why we support SQL. So the same SQL that you can run-in the data warehouse, you can also run-in the stream processing layer. That's why that is important. That's why we also support things like DBT so that you can actually push that processing upstream when you need to, not all the time, but when you need to in order to be able to fill fulfill that. And the upside of that is that you get these decreased latency, better cost efficiency, you know, richer data in you know, richer data that may be coming from sort of even even more systems, you know, at scale in the hot path. And then you can feed the microservices.
You can feed SendGrid or Mailchimp. You can feed, you know, all of these other systems in parallel, and we've tried to make that as painless as possible. But I think people like engineers in particular, myself included, will sometimes over rotate on, well, you're sort of repeating yourself and these kinds of things. But I think for online operational infrastructure, which is what that is, transactional messaging or updating records in Salesforce, you know, like, having your sales team have up to date information about who's inside of your product or something like that, that is operational use that is an operational use case, not an analytical use case.
It supports analytics, but the the infrastructure there is is operational. And I think that, you know, we're probably over rotated on these on these issues of, like, well but you're storing that record twice. And it's like, you know, if you think that's bad, wait until you learn what indexes in database systems are. Right? They're really just different organizations of a lot of the same data or materialized views or any of these other things. They're just different copies of the data. So, like, you know, I think that what we should be thinking about is, is, like, how do I meet the the operational requirements, the cost requirements, or, you know, the cost optimizations, especially in this, you know, this, like, macroeconomic situation. I think that this actually this architecturally has so many advantages, and it is not dissimilar from what, you know, it is, what, I think folks like Jay Preps have talked about, like, back in, what, 2014 about the Kappa architecture and, like, all these other kinds of things. This is what it's getting at, is that you can actually drive multiple systems with the same data that are and allows those systems to be optimized for specific use cases and workloads.
[00:54:18] Unknown:
And as you have been building decodable, working with customers, helping them figure out how best to apply streaming to the problems that they're approaching? What are some of the most interesting or innovative or unexpected ways that you've seen your product used?
[00:54:33] Unknown:
Frequently surprised with what people are doing with Decodable. I mean, we definitely see what I would call, like, bread and butter use cases. The filter route, transform aggregate, you know, type of trigger, you know, type of use cases that happen in the stream processing layer. I think that, you know, 1 of the things that I have come to, like, really appreciate is this combination of and I don't think this is, these days, all that unique, but I think at the time, it sort of surprised me how simple it it made things. This change data capture from an operational database system through the stream processing engine for munging of that data and then pumping that back into other operational systems like caches and full text search. So, like, we work on a use case with a company that takes in resume data, does a bunch of from MySQL, does a bunch of parsing and cleansing and and transformation, and then indexes that data back into Elastic Search to make resumes searchable by sort of different basically, different features of of, a candidate. And this process wound up being, like, 0 code for them. It's like 1 SQL statement that does, you know, very low latency, you know, processing of this data and then sort of indexing back into the full text search systems.
Then sort of indexing back into the full text search systems. And I think if you think about you extend that to things like caches, like Redis, and and and those kinds of things or elastic cache, like, you actually start to see all this opportunity to be able to create what amount to materialized views that are optimized for different for different workloads. Again, that's probably not super interesting these days, but, like, the kinds of applications that you can build and how fast you can build them, that's the thing that surprises me. You know, we worked with a telco provider to take data out of this, I won't name it, this legacy ERP system with, like, hundreds of tables, be able to cook that data for fast lookup by things like customer ID and those kinds of things so that it can be served out of, basically, specialized systems like caches and those kinds of things to, support different kinds of applications.
And, like, what would have been weeks of microservice development and, like, batch processing stuff winds up being, you know, I don't know, a day or 2 of, like, a a pretty gnarly SQL statement. But, like, that thing had 17 joins correlated subqueries. Like, it was just, like it was a 220 lines to take this, like, legacy ERP, highly third normal form, like, database schema and cook this denormalized record for, like, fast lookup by by customer ID and see, like, all the devices and, like, all these other kinds of things. So, I mean, the speed and, like, time to market, I think, is the thing that is just, like, so interesting.
But, man, people get creative. I'll tell you that. Like, they can they can do some stuff that, I mean, surprises me, the scale of it or the complexity of it and, like, the fact that it kinda just works for people, like, that is I woulda killed for this stuff, you know, 10 years ago. I really would've. But, yeah, I mean, I think every week, we probably see something new and interesting. You know, it's it's it's exciting.
[00:58:03] Unknown:
And in your work of building this system, building the product, working in this ecosystem, what are the most interesting or unexpected or challenging lessons that you've learned personally?
[00:58:14] Unknown:
Oh, man. That's a good question. I think, the amount of glue so I think, like, the big thing is, especially as an engineer, like, you look at these open source projects, and they are, like, so sophisticated, so powerful. Debezium, Flank, Kafka. And, you know, everybody kind of goes like, yeah. Our stack is, like, Debezium plus Kafka plus Flank. And the plus in that sentence, those 2 pluses are doing a lot of work in that sentence. And I think the thing that has surprised me is just how much time and effort we spend on the glue between these otherwise fantastic LEGO bricks. And so, like, I'll be honest with you. Here's probably the dirty secret about declarable. Our differentiation is not like that we're 3 milliseconds faster than the other guy. Our differentiation is, like, all the glue in between those systems, the DevX, the deployment, the APIs, the orchestration, Like, all the the observability.
Oh my goodness. The observability alone around data pipelines in a streaming context, data quality, like, looks different in a streaming context. It functions differently in a streaming context. So I think a lot of those kinds of things are just, like, much harder than, I would have expected or, like you know, and maybe that's me being naive. But, like, the joke is, like, the, you know, the old, like, draw the rest of the owl, you know, kind of meme where you draw, like, 2 circles, and then it's, like, draw the you know, step 2, draw the rest of the owl. I feel like that with, like, step 1, deploy Kafka, Flank, and Debezium. Step 2, like, build the rest of the platform. I mean, like, it's just the amount of time and effort is platform. I mean, like, it's just the amount of time and effort is just daunting, especially, like, given that's, like, our our full time job here at Decorable. We spend all our time thinking about it. And I I think if anything, it sort of shows that the need for, you know, these data platform teams inside of organizations and and the value that they provide. Because the usability difference between here's 3 open source projects. Go you know, good luck. And here's, like, the 3 open source projects with all of the tooling and the glue and the UX and, like, all this kind of stuff around the observability around it is the difference between being successful with these platforms and not. It really, really is.
[01:00:43] Unknown:
Absolutely. And so with that context, for people who are considering decodable or considering streaming? What are the cases where that's the wrong choice?
[01:00:52] Unknown:
I think that, I don't think that people should think about, excuse me, decodable or stream processing as, I'm gonna shut down my data warehouse. Right? Like, I'm gonna, like, take my airflow DBT and Snowflake and shoot it into the moon. Like, that's just not what's gonna happen. I think that we don't we don't make that claim. I think that stream processing, at its core, is something that typically sits between 2 other systems. You know, some source, some destination, and processes data to make it available in that system. I don't actually think that it's meant to be a serving system. So, like, we purposely decouple ourselves from, like, an Apache Pinot or an Imply or Snowflake or Databricks or an s 3, a Redis because we believe that you you actually want those as 2 separate systems so that you can cook the data and then serve it in whatever most appropriate storage and query engine, you know, that makes sense for a particular workload.
So I don't necessarily think of I don't think that people should be saying, I'm going to replace Snowflake with decodable or Databricks with decodable or Postgres with decodable. Like, that is not necessarily how you should think about stream processing. So don't do that. You know, I think that there are plenty of other cases where if you're thinking about how do I get data from Postgres through the outbox pattern to be available for microservices or, like, into HubSpot or something like that or or or into Snowflake or s 3, I think those are the places where decodable and stream processing fit.
[01:02:37] Unknown:
And as you continue to build and iterate on and scale the decodable platform, what are the things you have planned for the near to medium term, or any particular problem areas or projects you're excited to dig into?
[01:02:49] Unknown:
Yeah. I think there's a lot we can do on I mean, there's some stuff that's, like, evergreen. We'll always be adding new connectors. We'll always be improving, like, the SQL dialect. We'll always be sort of enriching the APIs to allow people more, you know, expressive, you know, capabilities. I think that there's some higher order functions that we can provide to people, like, more sophisticated, management of time and these kinds of things that codify some standard ways that people use this technology that basically makes it easier to build the jobs that run on top. Developer experience, I think, is always top of mind for us. It is not where it should be. There are certain things that are probably harder to do than I wish they were.
Think operational flows, like making a backward incompatible schema change, for instance, is, like, a thing that sometimes needs to happen. You should avoid it at all cost, but, like, when it does happen, when it does need to happen, it's it's messier than it should be. Reprocessing of data, I think that's an area, and, like, bootstrapping net new things. Like, I wanna bring up a new Elasticsearch instance, and I wanna replay the last 12 months of data or something like that into it. I think those kinds of things and then switch over to the stream. I think those things, you can do them today, but I think that they can be easier. So I think we're thinking about those those kinds of things. And and then I think there's probably a couple of things that we're working on that, you know, we're probably not quite ready to talk about that we think will be really exciting for people, you know, that are probably a little bit more nascent, a lot of maybe a little bit more moonshot. You know, we'll we'll figure out, when the right time to sort of come out with those things. You know? Absolutely. Interesting stuff. Are there any other aspects of the work that you're doing at Decodable, the overall
[01:04:34] Unknown:
ecosystem of stream processing, and when and how to apply it that we didn't discuss yet that you'd like to cover before we close out the show?
[01:04:41] Unknown:
Honestly, I feel like, you know, this is, like, probably 1 of the better, like, overviews that that, you know, I really appreciate the questions. I think we probably hit, like, all the critical stuff. I would say that, like, a lot of people are still sort of struggling, I think, to understand where stream processing fits into the stack, you know, into the and which workloads should move to it. And we're, you know, probably spending more and more time trying to think about how to help people understand that. I think, I mean, look, you know, at its core, people can think about this as, like, low latency ETL that can power both analytical systems and microservices.
I wouldn't stress much harder on it than that. You know? But I think, like, you know, the operational to analytical and operational to operational data infrastructure is probably, like, the, like, the the mental model that people should have for stream processing. But I, you know, I think that might be 1 of the areas where we have some more work
[01:05:41] Unknown:
to do, and and it's probably an evergreen topic, you know, to talk about when we talk about this stuff. But, otherwise, it's been a real pleasure, man. I appreciate it. Absolutely. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:06:02] Unknown:
Oh, man. What is the biggest gap? I, I think the largest portion of the problem here is really 1 around developer experience, and I think my anecdotal evidence for that is that every team at every company at every Fortune 500 has, like, what I would call, like, the company specific data platform layer that glues this, like, set of things together in a way that makes sense for a particular company, for a particular team. And I don't think that there are yet developed standards the way I think we have on the application development, like the layers above. And I think that that creates so much 6 of 1, half a dozen of the other, just like it's different for the sake of being different, development efforts that burn time and money, and, like, I think that that hurts people.
I don't know the solution to that, but, man, I wish we could come up with better patterns and practices for this stuff that encode it in a way that vendors can then sort of more tightly integrate with so
[01:07:11] Unknown:
that there is less work for these data platform teams to do. Because they really are doing, like, yeoman's work out there. You know, it's rough. Absolutely. Well, thank you very much for taking the time today to join me, share the work that you've been doing at Decodable, the changes that have happened since we last spoke. Definitely appreciate all of the time and energy that you and your team are putting into making stream processing a more tractable and approachable problem for people who don't want to have to invest their entire lives into it. So, appreciate that, and I hope you enjoy the rest of your day. Thank you so much for having me. It's always a
[01:07:46] Unknown:
pleasure. Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com. Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Eric Sammer: Starting Your Stream Processing Journey
Eric Sammer's Background and Journey in Data Engineering
Overview of Decodable and Its Purpose
Complexity in Streaming Systems
Adoption of Streaming Data in Businesses
Distinguishing Event Streaming and Stream Processing
Product Journey and Adoption Curve of Decodable
Supporting Both SQL and Code in Decodable
Developer Experience and Onboarding with Decodable
Control Surfaces and Integration with Existing Workflows
Managing Source and Destination Systems
Stream Processing for Operational and Analytical Workloads
Interesting Use Cases of Decodable
Lessons Learned in Building Decodable
When Not to Use Stream Processing
Future Plans for Decodable
Closing Remarks