Summary
Streaming data systems have been growing more capable and flexible over the past few years. Despite this, it is still challenging to build reliable pipelines for stream processing. In this episode Eric Sammer discusses the shortcomings of the current set of streaming engines and how they force engineers to work at an extremely low level of abstraction. He also explains why he started Decodable to address that limitation and the work that he and his team have done to let data engineers build streaming pipelines entirely in SQL.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it!
- Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
- Your host is Tobias Macey and today I’m interviewing Eric Sammer about Decodable, a platform for simplifying the work of building real-time data pipelines
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Decodable is and the story behind it?
- Who are the target users, and how has that focus informed your prioritization of features at launch?
- What are the complexities that data engineers encounter when building pipelines on streaming systems?
- What are the distributed systems concepts and design optimizations that are often skipped over or misunderstood by engineers who are using them? (e.g. backpressure, exactly once semantics, isolation levels, etc.)
- How do those mismatches in understanding and expectation impact the correctness and reliability of the workflows that they are building?
- Can you describe how you have architected the Decodable platform?
- What have been the most complex or time consuming engineering challenges that you have dealt with so far?
- What are the points of integration that you expose for engineers to wire in their existing infrastructure and data systems?
- What has been your process for designing the interfaces and abstractions that you are exposing to end users?
- What are some of the leaks in those abstractions that have either started to show or are anticipated?
- What have you learned about the state of data engineering and the costs and benefits of real-time data while working on Decodable?
- What are the most interesting, innovative, or unexpected ways that you have seen Decodable used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Decodable?
- When is Decodable the wrong choice?
- What do you have planned for the future of Decodable?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Decodable
- Cloudera
- Kafka
- Flink
- Spark
- Snowflake
- BigQuery
- RedShift
- kSQLDB
- dbt
- Millwheel Paper
- Dremel Paper
- Timely Dataflow
- Materialize
- Software Defined Networking
- Data Mesh
- OpenLineage
- DataHub
- Amundsen
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you're looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for reverse ETL today. Get started for free at data engineering podcast.com/hitouch. Your host is Tobias Macy. And today, I'm interviewing Eric Sammer about Decodable, the platform for simplifying the work of building real time data pipelines. So Eric, can you start by introducing yourself? Yeah. Absolutely. My name is Eric Samra. I'm the CEO and and the founder here at Decodable.
[00:01:45] Unknown:
And like you said, I mean, our job is to make stream processing accessible to mere mortals. What we like to joke about, no longer the realm of PhDs and PhD dropouts. That's sort of where we wanna be able to help people.
[00:01:57] Unknown:
Absolutely. And do you remember how you first got involved in data management?
[00:02:00] Unknown:
Yeah. Like it was yesterday. In fact, you know, I originally wound up in a lot of ad tech out of New York City, which is where I'm originally from, and built a lot of internal systems for ad targeting and ad serving and campaign management and those kinds of things. Eventually, over, you know, sometime around 2009 or 2010, I got connected to the founding team at Cloudera, and that's sort of where I made the flip from building internal systems to being more sort of on the, you know, the evil vendor side of the equation and sort of building enterprise software for other people. You know, sort of ever since then, you know, like late 2009 or early 2010, I've mostly been involved in open source data infrastructure, everything from, you know, the Hadoop world, you know, to sort of nowadays probably, you know, more Kafka, Flank, Spark, you know, that sort of universe of technologies, and have been working with that stuff ever
[00:02:58] Unknown:
since. And that brings us to what you're building at Decodable. I'm wondering if you can describe a bit about what it is that you're focusing on there and why it is that you decided to spend your time and energy on simplifying the work of interacting with and building platforms on top of streaming architecture.
[00:03:16] Unknown:
Yeah. Absolutely. It really came from the last, you know, 10, 12 years or so, watching people, you know, build this sort of first generation of what I think, you know, we we wound up calling, like, the big data ecosystem. And initially, it was really focused on analytical workloads and, like, the data warehouse and sort of growing and scaling that, you know, was sort of popularized, at least in the open source world with things like Hadoop and all that other kind of stuff. But increasingly, as that world sort of worked itself out, what we saw was, like, the advent of, you know, of Kafka and, like, this renewed interest on you know, today, what we call probably, you know, real time data engineering or stream processing and that universe, but historically has been messaging and event driven architectures and, dare I say, service oriented architectures that we would have called it, you know, in sort of the mid 2000 or so. It's not strictly about just moving data around, but it's actually, you know, caused this resurgence in building these, like, reactive or sort of event driven applications.
And I think it's sort of interesting to see the adoption of Kafka in the world, you know, and what that has done, especially with the, you know, birth of things like Kubernetes and microservices. So now you have sort of this decentralized way of building applications which solve message passing and things like that. And it is maybe not easy, but much easier to stand up and build individual microservices that are sort of responding, you know, to API calls and event data. So the 2 things that we saw were, 1, as people started to do that, all of a sudden, this sort of system of record moved to the stream.
It really became the event stream that was coming out of applications, whether that was click stream data, whether that was explicitly instrumented events in sort of various kinds of applications. And that data was it's kinda teeing off into 2 directions. 1 was it became the ingest stream into the analytical systems. You know, these days, that's Snowflake, BigQuery, Redshift, Athena, s 3, all these other kinds of things. But also feeding a bunch of real time microservices, ML pipelines for scoring and enrichment, and all these other kinds of things. The short answer is that, like, right now, it is kind of where, you know, the big data ecosystem was in 2010 or the relational database ecosystem in the eighties where it's really low level. Right? We're asking engineers to deal with, like, messaging guarantees and distributed checkpointing systems, you know, job recovery semantics.
And, like, most engineers either don't want to or just don't have the time to deal with that level of complexity. They kind of want something that is simpler and kind of just does what it's supposed to do on the tin. And so, you know, while lots of people are building this stuff from Kafka and Flink and Spark Streaming and and all these other kinds of systems, which are all great systems, but they're really low level building blocks. And so we said, well, you know, can we take advantage of what has happened in the open source ecosystem? But then add back the higher level data engineering challenges, like making schema management possible, making reprocessing of data possible, you know, and really up leveling this away from dealing with, like, clusters Java class loaders and messaging guarantees and toward SQL pipelines.
[00:06:52] Unknown:
Not that we love SQL, but everybody knows it, you know, and enable people to sort of quickly build real time data pipelines the way they do in the batch world so that we get away from this idea that, like, streaming is just way too hard for people or isn't worth it. Does that make sense? Yeah. It makes absolutely sense. And I think your mention of SQL is interesting as well because, yeah, nobody well, some people might love it. A lot of people don't, but it's, you know, the best tool we have for the job because everybody knows that it's 1 of the sort of it has become the entrenched player in the space. And so you have to use it in order to be able to take advantage of it because the ecosystem has gained so much momentum similar to where Kubernetes is starting to go within in the infrastructure space.
I think it's interesting too that I remember, you know, when I was first starting this show maybe about 4 years ago now, there was a lot of conversation about the idea of streaming SQL and adding different semantics to SQL to be able to do continuous evaluation of queries, and then that subsided for a while. And now SQL on top of streaming engines has gained a sort of a new focus, but without necessarily focusing on that continual evaluation of the query with things like Flink SQL and Spark SQL and ksqlDB and all of these different products.
[00:08:08] Unknown:
Right. I mean, I'm with you 100%. I mean, look, you know, like I said, I think SQL is, like, deceptively simple or deceptively complicated depending on how you look at it. And when people say they know it, what they really mean is that they know sort of, like, the bare sort of minimum of that. I mean, you know, I have been dealing with it for a very long time. You know, I have lots of gray hair over it. And every time I have to, like, write a window function, I still have to look up the syntax over certain things. Right? So, like, there's just, like, endless complexity there. What I will say though is the time to value it. Like, with my vendor hat on just for a second, like, the fact that somebody can do a hello world ETL job, you know, in, like, 30 seconds or less or your money back, you know, kind of situation is high value, not just for us because, you know, it sort of shows well, but also for people who I mean, like, let's face it. You know, 80% of the data engineering jobs that we write are stripping PII data, filtering data, routing data to the right place. It's not necessarily launching rockets into space. There's lots of complex business logic that goes on on it. Right? I'm not trying to badmouth it. I'm just saying that, like, we over rotate on these, like, hyper interesting and sort of theoretically, you know, exciting pieces of stream processing semantics when in fact 80% of the jobs are just, like, I need to rename a field and, like, destructure or normalize, like, a nested JSON blob. I think we've, in a lot of ways, as engineers are want to do, like, we've overcomplicated things, you know, for sure. And so our perspective is, look, we don't think that we replace Spark or Flink or any of these things. Like, if you need to build really complex, like, online ML infrastructure, You need sort of, like, low level access and that level of API sophistication.
That said, like I said, 80% of the time, like, prepping the data to get into 1 of those systems or, you know, POC ing, you know, a new data product or product recommendation system or something like that. That actually doesn't have any aggregations. It often doesn't have joins. Right? It's often just the dumbest thing in the world, and what you want is for it to just work. And so we've taken a very pragmatic view here, which is, like, yes. We will go deal with, like, all the nuances of, you know, how to make the engine sort of do the right thing. But for the most part, people just don't bump into that stuff unless they're doing the really crazy stuff. And there, it's about being open and speaking defacto open source standards and integrating with sort of all the tools and technologies that you already use today so that you don't have to, like, replatform, and that we don't have to be sort of the hammer for every nail, if that makes sense.
[00:10:58] Unknown:
And so in terms of the focus of what you're building at Decodable, who are the main group of users that you're focusing on, and how has that persona informed the prioritization of the features that you're building at launch?
[00:11:12] Unknown:
Yeah. No. It's a great question. So the 2 groups of people we see are application developers and data engineers. They sort of map to the 2 use cases that we see. 1, where decodable is sort of a means to collect a whole bunch of data and pump it into various analytical systems, like the snowflakes and the big queries and, like, all those kinds of things. And there were sort of, like, a very, very real time data ingest system that has, like, all of the stuff that you would expect, you know, with with SQL and connectivity and that kind of stuff. The application developers tend to be working more around, like I said, building these event driven architectures or reactive architectures, whatever the buzzword is to shore on this stuff, where for security, for instance, it might be watching all failed logins and looking for dictionary attacks or looking for, you know, sources of, you know, source CIDR blocks where they're coming from, and then, like, automatically, inserting firewall rules to, like, shut down, you know, API access in a web application firewall, like a WAF or something like that from a particular CIDR block. You know, it could be product recommendation systems. It could be enrichment systems.
You know, it could simply be real time user facing analytics, you know, where you're using a social network and they're recommending content, you know, and stuff like that to you. So those are the kinds of systems that we see for application developers. And in terms of, like, what that does for us in terms of priority is, like, 1 thing's for sure. Developers, we're not as concerned with sort of the business analyst kind of persona. We are more concerned with, you know, data engineers wanna store pipeline definitions and, you know, those kinds of things in Git and manage it through, GitHub or GitLab or something like that and deploy pipelines as part of, you know, CICD processes and stuff like that. So we've designed the system around, at least right now. I think we will in the future. But right now, we don't even have a UI.
Our 2 sort of things that we give people are a REST API to sort of, like, dynamically sort of, you know, create pipelines and activate them, or a command line tool that is, you know, somewhere and and sort of accessible to people who know things like DBT, you know, where they're used to working in particular way. And we wanna make sure that we're not sort of asking people to relearn, you know, how to build campfires or something like that. I think, you know, as we do more around pipeline observability, things like data quality and malformed data handling and and sort of visualization of performance characteristics of pipelines, There, we get into going like, okay. It it makes sense to sort of have a UI to allow people there. But we actually specifically didn't start with drag and drop pipeline builders because every engineer in the world goes like, well, that's not that's not how I work. You know? And I think, like I said before, you know, we're sort of really practical around this. We made sure that we can knock those data prep and sort of ETL style use cases out of the park. You know, there are other vendors and open source projects working on, like, really fancy real time analytics out to, like, the dashboard and stuff like that. I think that stuff's really exciting.
I think we're natural sort of, like, feeders of those systems. But our sweet spot is really around, I need to prep data. I need to clean it. You know, it's data janitor work, you know, which I think is underserved. And there, the main features are when you say activate pipeline, it just works. You know, when you submit a chunk of SQL to us, we are going to ensure that you're not referencing columns and fields that don't exist. We're gonna tell you what the quality of the data looks like. We're gonna make, you know, the actual process of deployment and and sort of previewing what that pipeline does really easy. So it's built for people like you and me who sort of have to build these pipelines on a daily basis.
[00:15:11] Unknown:
In terms of the complexities that crop up when you're dealing with these different streaming systems. You've touched on a few as far as, like, the exactly once or at least once or at most once semantics and Yeah. You know, dealing with back pressure. And what does that even mean? And how do I need to deal with that and the systems that are feeding into the streaming system and all of the sort of distributed systems complexities that come up with it. I'm wondering, what are some of the main pain points that engineers run up against when they're trying to build on top of streaming architectures and some of the reasons that they might throw their hands up and give up or, you know, some of the workarounds and wasted engineering cycles that they've had to spend being able to overcome some of those challenges?
[00:15:55] Unknown:
The complexity is not in actually building the pipelines. Right? All of these open source projects, you know, everything from kstreams to Flank to Spark, all have, like, reasonably intuitive APIs. You know, you're sort of arguing the last 10% of complexity there. To your question, the complexity comes in from, as it always does, the operational characteristics. How do I make schema changes that are safe? How do I deal with, you know, actual, like, data quality issues? And in streaming systems, it is more complicated than batch systems because a single bad record will basically shut down a pipeline to maintain ordering guarantees and, like, you know, state depending on sort of, like, how the pipeline works, whether or not there needs to be state, and have things partitioned a certain way, which, you know, forces us to establish ordering guarantees and all these other kinds of things. You bump into quality issues. You bump into messaging semantics.
You know, if I had a dollar for every time, I went through the at least once, at most once, exactly once discussion and, like, you know, every time, like, you know, Twitter erupts into a debate about whether or not exactly once can even be a thing and, like, all these other kinds of things. So, like, I think people underappreciate the complexity of, like, job failure semantics and recovery. I actually believe that people underestimate the accuracy of their batch jobs. A lot of people talk about time in streaming as super complicated because of event time versus processing time. But, like, if you zoom out, that problem exists in the batch world. The only thing is that, like, you're crossing your fingers that your processing lag sort of, like, catches all the events. But, like, whenever I have dug into someone's batch infrastructure, you find just as many sort of time problems. So I don't know that that's necessarily exclusive to streaming, but it definitely comes to the fore. You know, people are sort of used to thinking about it, talking about it.
State management, back pressure, and then, like, you know, you get into any of the individual sort of systems that are popular around this. And tweaking and tuning those things is super complicated. I, you know, some number of years ago, wrote a book on Hadoop operations, and, like, 90% of my job there was documenting all the configuration parameters. And, like, if you look at I won't name names, but if you look at any of these systems, they all have, like, a couple 100 parameters that are really, really complicated, including things like perf tuning Rocks DB databases that are used for state management and stuff like that. You know? No shade on those systems or good systems, but it's complicated.
And, you know, I would argue with hindsight now, you know, back in 2010, it wasn't that people wanted a distributed file system in MapReduce. They wanted a data warehouse. And, like, when you look at streaming systems today, I have a feeling that we're going to be thinking about it the same way. People don't want a stream processing engine. They want a data engineering platform for real time data. You know? Again, that's me with my vendor hat on. But, like, I think all of that complexity comes to the fore. We love it. That's what we spend our lives on, working on. And we're super excited about checkpointing algorithms and, you know, consistency.
But, you know, our hope is that most engineers
[00:19:14] Unknown:
don't have to deal with that if they don't really need to deal with that. And continuing on the sort of distributed systems concepts and some of the design optimizations and some of the engineering effort that's required but often overlooked or skipped because of time constraints? How do these different elements of distributed systems and streaming systems in particular start to bubble up when people are trying to build pipelines at a higher level to just get the business needs done, you know, talking again about back pressure. We talked about semantics. Isolation levels is another thing that people will talk about, you know, ad infinitum for people who are deep into the space. You know? Is it, you know, read after rate consistency? Is it eventually consistent? You know? What are the delivery time and latency guarantees? What is your, you know, p 99 tail latencies for message delivery end to end?
[00:20:09] Unknown:
Yeah. I think all of those things, yes, is the short answer. I think, you know I mean, the stream processing engine and probably inclusive of the modern scale out messaging systems like Kafka as well are really just the sort of, like, you know, almost the the purest example of a distributed system where they're just a bunch of processors, you know, performing RPC between 1 another. And we're sort of arguing the correctness of the operations that they're performing. You know, whenever you have 2 machines, you have all the problems around message passing, you know, from the distributed systems literature and FLPM, you know, uncertainty principles and, like, all these other kinds of things. There are people much harder than me who have explored the universe of why those things are hard.
They look like database systems, you know, state full stream processing systems and messaging systems. So you have all the challenges around durability and, like, what does it mean to actually write something to disk, and you get into whether or not f sync actually does what you think it does, you know, and kernels, you know, and and IO subsystems. And then you throw in the complexity of, like, the cloud where storage, you know, is an even thicker stack than it ever was with sort of object stores and ephemeral disk and attached disk and all these other kinds of things. You know, you have, like, a soup of really complex theoretical challenges to solve. I think there have been some really great papers in this space that we pay a lot of attention to.
The Google Cloud Dataflow team has put out some, like, really, you know, interesting papers in this space. Everything from, you know, the mill wheel paper to, you know and I would even say the database literature, you know, drum roll papers and, like, all these other kinds of things that overlap the space. Obviously, UC Berkeley has done a lot of research here. And and like I said, I think that, you know, you look at Spark, you look at Flank, you look at GCP, Cloud Dataflow, and things like timely data flow, what the Materialise guys are doing, that's really exciting stuff that gets into, like, the nuts and bolts of this. But that is venturing into a land where, you know, you read all those papers and, like, you know, if you're smart enough to internalize it, which I'm not sure that I am, like, as you read it, at the end of it, you put the paper down and you kinda go, like, how do computers even work? Right? Like, what are we even doing here? And you're not sure that anything works, and so then you have to, like, start trimming down to, like, practical terms and removing, you know, adding back real world constraints.
And so I think finding the right set of constraints to add back to make that system practical is where things get super complicated,
[00:23:02] Unknown:
you know, and where you can make good trade offs and where you can make bad trade offs. That's a painful answer. It's not it's probably not even a good answer. But No. That's a great answer. And to your point of, you know, digging through these papers and, like, how do any of these things ever work is about the point where you either double down and dig deeper into it, or you go live on a remote mountaintop away from any technology and never look at a screen again.
[00:23:25] Unknown:
I think we all know what the right answer to that is. That said, before prior to choosing that answer, we'll all be here trying to solve these problems.
[00:23:36] Unknown:
And then in terms of what you're building at Decodable, I'm wondering if you can talk through some of the architecture that you've settled on for being able to paper over some of these complexities and provide a robust and understandable API to end users while still being able to actually I'm not gonna say guarantee any of these capabilities because, you know, guarantees in computer systems are, you know, worth about as much as the paper they're written on. But, you know, at least do best effort of being able to, you know, maintain the uptime and correctness of these systems that you're exposing to the engineers who are relying on them. There's a couple of different aspects to this that I think are particularly interesting. 1 is, what's the conceptual model that you expose to people? So we talk about pipelines.
[00:24:24] Unknown:
I am not convinced, or at least I wasn't convinced, obviously, if we get this credit decodable. But the relationship between connecting to external data infrastructure, the streams that represent, like, the durable buffer of data, which is these days almost exclusively backed by messaging systems. It certainly is a decodable. And then the pipelines, which are sort of the active processing component. I think that there's a lot of systems that get the boundary of, like, what is a connection? What is a stream? What is a pipeline? Or sort of blur those things. Because of our audience, you know, being engineers more than sort of business analysts, we can actually make those relationships very explicit and sort of push the complexity to the part that makes the most sense. So for instance, in Decodable, pipelines only operate on streams. They get 1 or more streams as input.
They output 1 stream. Multiple inputs can be unions and joins and those kinds of things. And then connections bridge streams to the outside world. And so the complexity of where we touch an external system versus where we are operating internally, that boundary and getting that right actually reduces failures. It actually allows you to better support the messaging guarantees again to get back into that discussion, like, actually get that right. And so we settled on this connection stream pipeline sort of relationship that's served us really well for a variety of reasons.
Like I said, the pipelines themselves are singular SQL statements. Right? So we allow you to stack pipelines against each other, you know, and and create effectively a bigger DAG, you know, directed a cyclic graph. But all of them are attached via streams, and so, like, you understand very clearly sort of, like, how things break down and the unit of deployment, which makes versioning much easier, which makes schema changes very you know, much easier. I won't say very easy. Makes them easier, so on and so forth. I think the other thing, you know, now is sort of more about the architecture and the philosophy of the architecture is that there's 2 camps of thinking on this, and I don't know that we've been explicit about this. 1 is the centralized camp. All data flows to 1 place. You do all your stuff to it. And, like, that is consistent with the traditional data warehousing world.
The other case is the distributed case. Is that, like, life is complex, it's messy, and more aligns with, like, the microservices world, where you kind of go, the best place for someone to build and maintain a pipeline is the person who owns the source of the data and create curated streams for the people who receive that data or make that self-service. Right? We can sort of talk about that. We view the especially where we sit in the architecture as being decentralized. And so we purposely separated decodable into a control plane, which is where you submit pipelines and you see lineage and, like, all the, you know, monitoring and observability of those pipelines. And then the data planes, which are distributed. So you would have a data plane, for instance, in each cloud region of each cloud provider and maybe even on prem, you know, if you have that kind of architecture.
This matches what the software defined networking people have figured out, like, some number of years ago, which is that switches and routers are wherever they are. But what you really want to be centralized is the control and management and observability. And that the control traffic is not in the hot path of the data traffic. And so we've designed Decodable around the separation of control plane that is centralized and distributed data plane, which I think gives us quite a bit of an advantage around how you manage these complex systems. You know, any company sort of bigger than a breadbox has, you know, GDPR compliance and data sovereignty challenges. This data can't leave that region and so on and so forth. And so we believe that this actually mirrors both the modern enterprise, you know, kind of company, but also jives with what people are starting to talk about with, like, data meshes. You know? Again, just to throw another buzzword onto the pile. Right? Like which I actually think does make a lot of sense. It does for data what microservices and rest APIs did for behavior or compute. So I think that there's something worth looking at there.
Under the hood, Decodable is built on all the same things that people are building on today. So, internally, we are very heavy users of, you know, currently, Apache Kafka. We have a deep investment in Apache Flink. We believe that, you know, of all the stuff that's out there, Flink's trade offs between, you know, performance and correctness and scale and, you know, especially, like, the community around Flink has just sort of had, I think, a really good eye to how these systems should work. And I'm biased. I've had prior experience with Flank now for a number of years where I've seen it in some really, really tough workloads and do reasonably well. So I have sort of high confidence in Flank and the Flank community.
So we're heavy users of Flank. We have to, you know, do some stuff to it to to be able to to handle, you know, some of the simplification that we've been talking about. But, there's certainly more to it. Happy to answer other questions. But that's sort of like the high level tidbit on decodable on the inside. Yeah. Definitely a few things to dig into. And 1 of the things that I think is interesting is you mentioned that
[00:30:04] Unknown:
as far as the pipeline processing, you're able to combine multiple streams into 1 output stream, but you explicitly did not say that you're able to take 1 input stream and split it into multiple output streams. I'm wondering if that was on purpose or if that's something that you intend to be able to explore for being able to do sort of fan out topologies, or if it's something that, you know, if you want to do that, then you need to, you know, jump to a different level of the sort of API abstraction.
[00:30:33] Unknown:
Yeah. No. Absolutely. I mean so to be clear, you can accomplish the same architecture by having 2 pipelines that filter the data in different ways or do whatever they need to do that consume from the same sources and then output to different destinations. So that's how we would accomplish, like, the fan out kind of routing or distribution kind of architecture. But we specifically did it that way rather than having 1 pipeline that outputs to multiple streams because, 1, SQL doesn't really do the latter. Right? Select statements have effectively 1 relation as its output.
And so, 1, we found that the community of people here this actually jived with sort of their natural way of thinking. If I want a different output, I have a different query. And for us, that's like a different pipeline because a query is a pipeline. And so, 1, it sort of matches the mental model of people, and you don't have to, like, untrain them from that way of thinking. But, 2, you start to think about the messaging guarantees of outputting to multiple pipelines. You get into these, like, weird failure cases where you can output to a but not to b, and then you fail and the power goes out, and now you have a duplicate. You know, like, life just gets super complicated in that 1 case, whereas on the consuming side, you know, we do have things like offsets and sort of, like, all the stuff that, you know, Kafka, I think, has probably popularized, you know, in that way of working. And so it's easier to get the correctness the expected correctness for the user's definition of correctness.
It's easier to hit that sweet spot for them without making them think too hard about all of the ways in which you can go wrong. And so the atomicity of the failure to them can feel like the pipeline versus a partial failure of a fan out pipeline. And so we recognize that that means that in order to do fan out to, like, a 100 things, maybe actually have to build a 100 pipelines. That becomes harder. But the reasoning about the system becomes substantially easier. And we found that that makes sense both, you know, like I said, to get the system right and also sort of, you know, bang some preexisting knowledge. But we found that people sort of, like, mostly get that, you know, especially if they're coming from the batch world where they're sort of, like, dealing with, like, you know, SQL queries and and Snowflake or BigQuery or Teradata or something like that. Another thing that you
[00:33:08] Unknown:
briefly alluded to earlier in the conversation is the idea of being able to do enforcement of elements within the schema in the processing and the definition of the pipeline. And I'm wondering if you can dig into some of the sort of schema metadata management, the introspection of the input messages to understand what are the elements that are contained within them so that you're not trying to select a value that doesn't exist in that nested JSON object. Or if it's supposed to exist, identify that. You know, sort of at what point do you understand where that is and sort of the enforcement of schema within the overall propagation of the pipelines?
[00:33:44] Unknown:
So within the engine itself, you know, within the stream processing engine, you're effectively in the SQL world, and so everything does have a type. You do have these open types like arrays, maps, those kinds of things. And to extract something out of a map, we sort of, like, can force you to say, like, hey. Like, that key may not exist. And so, therefore, you need to use this function, which takes a default value or something like that. So there, it's about providing the guidance through the right functions that you provide to people, you know, when they build the pipeline and write patterns. The place where it becomes really complicated is where you touch the connectors, you know, because, you know, just to give an example, you know, our Kafka connector it's funny because, like, Kafka's, like, you know, a good example of this. The actual general purpose Kafka schema is and the only thing that's guaranteed to you is, you know, key value, partition ID, offset, you know, and maybe time stamp depending on sort of, like, whether, you know, that's event time or processing time that's sort of inserted into the queue.
And then there's sort of, like, these sub schemas that exist within the value and the key of the payload, and that's where schema registries and, like, all these other kinds of crazy things come into play. Our perspective is that that dichotomy between, like, the user defined schema and the actual connector schema is, like, super complicated. Nobody wants to deal with it. And so what we've done is that connectors are sort of always going to output a well known schema. And by schema, I don't include format, like Avro versus JSON versus protobuf. So I'm talking just like the semantic structure of, you know, fields and their types and things like that. The connector can define that, and if it receives anything, and I say the connector, I mean the user who configures the connector says, this is what should come out of this, you know, what should come out of the value. If anything doesn't match that, we ask the user to pick a strategy to deal with that. You can drop it. You can block forever and, like, fail, and, like, we will set off an alarm and wake you up in the middle of the night. You don't want that 1. Trust me.
Or the 3rd option is, like, you can dead letter queue this. Right? We can remove it from the stream. We can park it in another topic for you to examine when you do wake up. If you choose the 3rd option, which is, like, mostly the right option, we'll sort of capture that for you and give you the tools that you need to inspect that queue. And then you can decide to reinject that data back into the pipeline effectively as late data or discard it and just ignore it forever if it's, like, truly malformed. But, you know, it does break the ordering guarantee.
Right? Because we're pulling those events out of the stream. And so I don't know of a great 4th option. I'm taking sort of, like, all pitches on the 4th option. But I think the thing that we can do is create that structure for people and say, like, those are your options. These are the guarantees. This is the net result of this. And we can do the same thing sort of, you know, because the streams are internal and the pipelines are off the streams. At that point forward, it actually becomes just type checking all the way through. And so, like, decodable will make sure that the pipeline only references fields available on the stream, and the stream only gets data that was valid as produced by the connector. And that goes all the way through to the sync side as well. There's lots of nice things that come from that.
1, not at pipeline activation time or job deployment time, but at the time you create that pipeline, we can immediately tell you the stream doesn't have that field or it's the wrong type. You know, you have to do some kind of conversion to be able to use it. And at runtime, we deal with that through this, you know, malformed data strategy that you can attach to different parts of the system. And then we can effectively guarantee that the pipeline never sees malformed data because it can't make it into the stream. So, like, that's how we sort of deal with that. The downside, of course, is that, like, there's a whole lot of schema management. Right? Like but I think, like, the idea that that schema management doesn't exist with other systems is, at its core, the problem. And so we sort of promote that to the forefront and say, like, look. You just can't do it. There are things I think we can do going forward to make that process even easier, including things like coordinated schema changes.
For instance, I want to add a new field from the connector. Let's make sure that doesn't break anything downstream, and, like, maybe I wanna automatically make schema changes to everything downstream. Whether or not that's a good idea, like, I think there are different use cases. But I think we wanna be able to support what we call safe schema migrations, including things like, I wanna drop a column. There's actually a safe way to do that that involves a multistep process. We can actually enforce that. We'll let the user opt out of that. Don't get me wrong. If you really wanna, you know, be a sociopath and delete a column and break everybody downstream of you, You know, like, we're happy to let you do that, but, you know, there is a right way of doing that. And so we can put quite a bit of stuff around that. And then, of course, we have all this access to metadata. We can actually do lineage analysis in a streaming context. We can do data quality measurements and sort of, you know, all the stuff that you would expect with excellent batch systems, like, you know, I think Monte Carlo and Great Expectations and, like, all these other kinds of things that data engineers sort of know and love. You know, the techniques are equally
[00:39:32] Unknown:
viable in a streaming context, although the implementation does look a little different. A couple other directions I wanna go from there. So 1 of the things you mentioned is sort of the deployment of the jobs. I'm wondering if you can talk through some of the sort of software life cycle of building the pipelines, validating them, rolling them out, and sort of what that overall workflow looks like with Decodable.
[00:39:54] Unknown:
Yeah. Absolutely. So the we version all of the pipelines internally. We basically have, like, CRUD style operations. You can create, you know, you can view, you can delete, you can update your pipelines. But we've added an activate and deactivate process. And so as you update your pipeline, that pipeline doesn't actually get deployed until you tell us to activate it. At which point, we will effectively roll over onto the new version of the pipeline. And because we've implemented the engine to be exactly once by default, we effectively I don't even really expose the processing guarantees other than exactly once, you know, to the outside world because that's what most users want anyway.
But we will actually checkpoint and sort of save the system. The process for the users, we offer a real time preview function where you can actually give us a chunk of SQL, and we can show you, like, 10 records of output to help you figure out that, like, that schema is valid, that you're accessing the right fields, that the data looks the way you expect. And once you get it to that place, which is the equivalent of what you would do in the batch world at, like, a SQL shell, you know, with Postgres or any of the other database systems. Once that query looks the way you want it to, you can then actually tell us that, like, this should actually be the version of the pipeline. And then you simply tell us to activate that version, and, like I said, roll over to that new version of the pipeline.
You know, if the pipeline is not backward compatible, that's where you get into, like, actually, like, all the really sort of difficult elements of data engineering where maybe you need to make stream changes, you know, to coincide with that version of the pipeline and things like that. And there, you know, we maintain, you know, version relationships to different streams and stuff like that. We can actually stop you from making changes that would break other pipelines and other connections and things like that. So it's just a little bit of additional work there. But the big thing, I think, for users is when you from our command line or our API, you say, like, here's a new chunk of SQL, you know, save this as the pipeline.
It's at that point that we will tell you this pipeline is not valid, your SQL is not valid, you're missing fields, the the wrong data types, all these other kinds of things. So we effectively won't let you save a pipeline that won't run. And so this means that at any time, if there's some kind of failure, you know, or the power goes out and we restart a bunch of nodes, we know we're always restarting from a well known and good place. And then, of course, people version control their pipelines and stuff like that, but, you know, we have sort of the next level of protection there. The preview feature on a pipeline works by attaching to the actual streams.
So it's not, like, fictitious data. You're actually seeing real world data passing through the pipeline, assuming you have access to do so. And we have found that people very quickly figure out that they're seeing more nulls than they expect or, you know, the dates aren't parsing quite the right way or something like that. And so I think that kind of iterative ability to inspect not just the schema of the output, but the actual data on the output eliminates quite a bit of challenge there. And in the streaming world, that's just much harder to do than in the batch world. Normally, that would be a fancy feature, but it is a, say, a fancier feature in the streaming world.
[00:43:37] Unknown:
Struggling with broken pipelines, stale dashboards, missing data? If this resonates with you, you're not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world's first end to end, fully automated data observability platform. In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem with broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing the time to detection and resolution from weeks or days to just minutes.
Start trusting your data with Monte Carlo today. Visitdataengineeringpodcast.com/impact today to save your spot at Impact, the Data Observability Summit, a half day virtual event featuring the 1st US chief data scientist, the founder of the data mesh, the creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 people who RSVP will be entered to win an Oculus Quest 2. Another interesting topic that's worth digging into is the question of the sort of lineage calculation and metadata propagation and how you're able to pull and push that into the other systems that people are building and relying on and the context within which Decodable is operating to be able to complete that overall metadata graph beyond the confines of your platform?
[00:45:07] Unknown:
Yeah. I mean, this is a place where, you know, transparently, we haven't propagated it beyond our platform just yet. Right? So we have that information. We can read it from certain external systems, you know, again, like, you know, schema registration, those kinds of things. But we don't yet have a good way of completing the larger graph. And I think part of the problem here is there aren't good standards for how you expose metadata information beyond, like, the SQL information schema, which is, you know, not as rich as you would like it to be, especially across systems. So I think we're still thinking about different ways that we can make that externally useful so that the schema of the output of a particular job in Decodable is guaranteed to match the schema of a snowflake table. You know what I mean? Like, being able to fully extend it there. Now that said, because we can guarantee that the output of a pipeline schematically matches that of a stream, the output stream, which is attached to a connector, and that schema has to be at least compatible, like a subset.
If that connection is actually to a relational database system or something like that that does have a well defined schema, we actually do have the opportunity to validate it. I don't know how we would visualize that and expose that. My guess is that there's another company being funded someplace in the valley right now to do that and and and to sort of take on the larger burden there. Several, in fact. Yeah. Yeah. I'm my hope is that all of these data catalog companies that have popped up around things like Open Lineage, and I can't think of the 1 that came out of LinkedIn and the 1 that came out of Lyft, you know, the 2 different data catalogs that I neither 1 of which I can remember. My guess is that, like, there's a role to do with, like, integration with those kinds of systems to help people, you know, manage this in the large across products.
[00:47:17] Unknown:
Absolutely. Yeah. Data Hub is the 1 from LinkedIn, and Amazon is the 1 from Lyft. That's it. The open lineage specification is definitely exciting to see. And there's another effort that started up recently called open metadata, and it seems that, at least for now, the 2 are relatively disjoint where open metadata doesn't cover lineage, and open lineage doesn't cover more sort of generic metadata. So it'd be interesting to see sort of if they emerge or if they decide to expand to flick with the others and, you know, the classic problem of, you know, I have 5 competing standards and I so I created a new 1, and now there are 6.
[00:47:53] Unknown:
You know, coming most recently, you know, prior to starting Decodable, having spent some number of years at Splunk and looking at the observability space, You know, this problem exists no matter no matter where you look. Vendors are always and customers are always trying to sort of tackle the proliferation of standards, and hopefully, 1 gains enough steam. But I think that that's really exciting.
[00:48:15] Unknown:
In your efforts of building the decodable platform and determining what the sort of useful APIs and abstractions are. What are some of the most complex or time consuming engineering challenges that you've dealt with to date?
[00:48:30] Unknown:
Our internal mission is to build software that just works. And so all of the stuff that we were talking about earlier around challenges around the complexities of the engines and the messaging guarantees and state management, we spend an undue amount of time, you know, working on those kinds of problems and making sure that performance is above board and, you know, throughput looks the way that it should, you know, for all of the different measures of performance. But I think we've also spent a ton of time on the developer experience. Right? Just because it's APIs, just because it's a command line tool, doesn't mean that there isn't, like, actually a UX component to this.
And the systems I have seen, you know, over the 20 something years I've been doing this be the most successful, it's not strictly about being just simple or being just powerful. It's nailing the intersection of, like, you know, easy things are easy, hard things are possible. Like, that sort of adage actually has to be true, And the functionality needs to be really easy to discover and access for, you know, a 100% of people who need it. So I think we actually we agonize over what we call things, over what pieces of metadata exist, and those kinds of things so that when you look at a pipeline or a stream or a connection, it's just, like, super intuitive, what that thing is and how it works and, you know, the functionality that lives there.
And so I think that's something that we think quite a bit about. You know, transparently, we are still early in our journey toward thinking about the right way to expose performance metrics and observability of the pipelines themselves, not necessarily the infrastructure. But, you know, do we actually need to provide sort of all the way down to field level statistics around data quality and, you know, value distributions and those kinds of things in order for people to be successful? You know? Or at the other end of the spectrum, is it really just bites in, bites out, events and events out? You know? Is that is that actually useful? But I gotta tell you, like, you know, not a day goes by that we don't have some conversation about some new failure mode that we have sort of discovered that is possible and sort of work to seal that up. We're at a point now where I feel really confident around, you know, when when we say that a SQL statement is good, it's good. And we're able to run it without failures, without issues, and things like that, which, you know, sounds like a low bar, but it turns out not to be in the streaming context.
So I think those are some of the places where we spend an awful lot of time, but it's really about guarantees and correctness.
[00:51:19] Unknown:
In your experience of building decodable and working with design partners and early customers, what are some of the things that you've learned about the overall state of data engineering and some of the costs and benefits of working with real time systems?
[00:51:33] Unknown:
Yeah. This has been sort of really interesting. I was blown away by the simplicity of the pipelines and sort of how people spend their time. You know, we have seen pipelines that are effectively, like, you know, select a subset of columns from a source and then filter it by 1 or 2 fields that adds, like, a mountain of value for the ML engineers and the training datasets and things like that that sit on the other side of that pipeline. And so what we found is, like, the issue here isn't necessarily that people are running a small number of these really big, really complex pipelines, which is actually true in, like, the observability space. Right? You have logs, you have metrics, you have traces. You've basically got 3 datasets, and they are mammoth, you know, and they operate in petabytes per day and, like, you know, there's a different set of challenges around that.
But in this case, we actually find that the complexity comes from customers having literally 100 or thousands of pipelines that all do slightly different things for slightly different audiences and maintaining operations around each 1 of those pipelines. And each of them actually winds up being really simple. Right? They're typically these, you know, small simple SQL statements. And so giving people a system that lets them operate quickly over a large number of pipelines, you know, is something that I don't know that I fully expected. You know, just sort of that complexity and everything that goes along with it.
I think the other thing is, like, with respect to real time data, what we found is that for people who are just ingesting data into analytic systems, I'll level with you. Like, I think they're like, well, batch is good enough. That's where I hear that argument. For people who are building these event driven services, you know, various microservices that either produce to Kafka or consume from Kafka and I say Kafka, but, you know, GCP, Pub Sub, you know, Kinesis, whatever. That's where those people sort of need real time. There isn't a lot of debate about that.
You know, the business basically doesn't work without that being real time. The joke I've made is, like, imagine if Lyft or Uber used batch processes for driver dispatch. You know, like, you order a Lyft, it shows up 3 days later, you know, because the batch was delayed. Like, that business doesn't work. And I think logistics, retail, oil and gas, gaming, all of these kinds of products and industries have real time components to them these days. You know, DoorDash, you know, food delivery doesn't work unless it's a real time system. So I think that, you know, the applications are going that way. You know? I think the analytics systems are gonna wind up having to catch up to the operational systems.
But the operational side of the house is very quickly going full real time if it isn't already.
[00:54:40] Unknown:
In your experience of working with your design partners and seeing how they're building on top of decodable, what are some of the most interesting or innovative or unexpected ways that you've seen it used? I mean, most of the use cases we've seen
[00:54:54] Unknown:
have actually been about what we expected. It's data prep. It's streaming ETL. There's some folks looking at sort of aggregation and real time analytics and things like building and populating feature stores, you know, for ML pipelines. There, they wanna, you know, roll things up into aggregate counts and those kinds of things. But for the most part, I think the products are innovative, but the use of real time streaming is, like, I wanna say mundane, but, you know, it was needlessly complex. I think it was super exciting before they started using Decardable because they had this, like, crazy Rube Goldberg machine of things that were all stitched together. But, you know, these days, I think it's actually less exciting, and that to me is what success actually looks like, is that it's becoming a boring problem. Right? Can we make it a boring problem?
I think that that's what success looks like for us. So we're excited by that. On average, I'm curious how many different sort of infrastructure
[00:55:54] Unknown:
and system components you've seen people replacing with decodable.
[00:55:59] Unknown:
We've seen people shut down, you know, effectively the triplet of this is probably not fair, but we've taken on the stream processing engine and, to some degree, parts of the messaging system. But, realistically, it's actually our job to stitch those systems together for you. And so for people who have something like Kafka, you know, in place simply to facilitate, like, the data integration and the event driven architecture back end, we see that get subsumed by Decodable. For people who use Kafka as effectively a part of their system, then I think it makes sense for people to continue to run that stuff. So, like, I don't know that we compete. Like, I would never say that we compete with Kafka or, you know, any other sort of messaging system and things like that. We have seen people shut down complex streaming, you know, do it yourself streaming systems and those kinds of things. So I think there and a whole lot of operational infrastructure built around those systems to, like, deploy pipeline safely or perform data quality checking and, you know, all these other kinds of things. So I think it's really in sort of simplifying the operational side of the house in a lot of ways by giving them a turnkey system for this. In your own experience of building the platform
[00:57:27] Unknown:
and going through the engineering and working with your customers, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:57:35] Unknown:
I think, like, as an engineer, I've always wanted to believe that, like, the pipeline's the hard part, and it's just not. It's the operations. And I think, again, like, in retrospect, this seems obvious. You know, you build a microservice and I deploy it to Kubernetes. Like, you know, life is easy. And it turns out that, you know, using that as an analogy, that's when the hard part starts. Now you have, like, 1 more thing to monitor, and you've got, like, you know, live misprobes, and you've got, like, observability infrastructure, and you've got security stuff, and you've got key management, like, role based access control and name space management, like, all that kind of stuff. I think in the data engineering world, the actual act of writing the pipeline is like the smallest part of our job.
It's the, you know, impact analysis and lineage tracing and sort of operational aspects around schema migration. Oops. We accidentally spilled a whole bunch of PII data into a system that shouldn't have it. Now we need to fix that. You know, Not gonna name names on on where I saw that happen, but, you know, like, those kinds of things are the kinds of things that I think data engineers spend most of their time. And then just context switching between, like, the 5, 000 customers that people have, you know, internally and sort of managing all those different use cases. So, you know, in a lot of ways, this is why I sort of emphasized, can we make it just work, you know, and do what it's supposed to do? Because I think, you know, we're not looking to replace
[00:58:58] Unknown:
or make it easier for people to build the pipeline. We're making it easier for them to run the pipeline, if that makes sense. Absolutely. Yeah. But that, I think, surprised me, you know, where the complexity of the problem is. Although, like I said, when I look back at it, I feel naive believing that, you know, the the problem was anything but that. Yeah. Speaking as somebody who spends a lot of time in operations and infrastructure engineering, I can definitely concur that, you know, writing the thing is, you know, well, it has its challenges. That's the easy part, and then you actually have to be able to run it. And that's why I always get frustrated when I look at these, you know, demo applications. They're like, oh, this is how you wire these things together, or, oh, this is how you get this thing running in Kubernetes. It's like, okay. It's running technically, yes, But that's nowhere near complete. And now I need to add these 15 other systems and manage how do I pull get the credentials into there without, you know, exposing them. And, yeah. Absolutely.
[00:59:51] Unknown:
Yeah. And especially in the data world, as concerns around privacy and governance come, you know, to the fore, there is, like, a whole another aspect of, like, operational complexity that comes as it should. Right? You know, somebody shows up and says, like, show me all the places you're using email address. Like, how do you even answer that question? I mean, like, people have the right infrastructure to answer that question, but, like, those kinds of things become really, really complicated. I'm with you. I think that, you know, hello world is where things start, not where they end, you know, if that if that makes sense.
[01:00:28] Unknown:
And so for people who are interested in being able to build streaming pipelines and real time systems, or they already have some of that infrastructure in place and they wanna simplify, what are the cases where a decodable is the wrong choice?
[01:00:40] Unknown:
So I don't think we are a system of record for, like, storing data. I actually don't think that using messaging topics as, like, persistent databases is, like, the right sort of thing to do. And so we don't view ourselves as the data warehouse. We don't view ourselves as an analytics system. You know, I actually think that even in the streaming space, you know, we aren't like a streaming data warehouse. We don't sort of pitch ourselves that way. Instead, I think that if you need to, like, collect a whole bunch of data, turn it into a different kind of data, and, like, send it someplace, we are definitely the right system.
You know, we call it data engineering now, but in my day, we used to call that ETL. But don't tell anybody I said that. I think we are sort of data janitors. We're definitely not like an ML system, although we feed those systems. So we're not gonna be, like, operationalizing ML pipelines and stuff like that. On the other hand, if you need to pump data into there's a bunch of, like, really sophisticated MLOps kind of systems that handle, like, model training and, you know, model deployment to basically do all the same stuff but for models. You know, I think we complement those kinds of systems. And I don't necessarily think we're a general purpose, like, messaging replacement.
We're not just like Flink as a service or a Kafka replacement. There are plenty of people sort of working on those problems. We view those things as things that we work with, not things that we necessarily replace. Now that I've probably talked myself out of a whole bunch of business, you know, like, I think that said, there's lots of things where, you know, if if what people need to do is stand up pipelines, you know, quickly that work and, like, always work and don't wanna manage infrastructure, don't wanna deal with, you know, tweaking and tuning all these distributed systems, like, you know, that's the place where it fit really well. And as you continue to
[01:02:34] Unknown:
build out the platform and as you go through your initial public launch, what are some of the things that you have planned for the near to medium term future of Decodable?
[01:02:43] Unknown:
We have focused on getting the basics really right and to the exclusion of fancy features. So when we demo it to people, it's like a SQL statement processing data, and everybody goes, well, you know, is that it? And, like, yeah, there's a whole lot there. But I think there's lots of things I think we're gonna wind up doing around connectivity to different systems, data observability. I think we are just scratching the surface around helping people to understand the impact. Like, can I shut down this pipeline? If I do, who does that impact? You know, I think there's so much there that we can do and quite a bit of, like, what I would think of as workflow.
The way I pitch it to investors is, like, you know, Git works really well. GitHub and GitLab add tons of functionality on top of that to, like, make it, again, save from your mortals, put all the workflow around it, integrate it with all the right systems, and things like that. I kind of view that as our relationship to things like Kafka and Flink. You know, we sort of use those systems, you know, but there's a a lot of work around, like I said, the safe schema migrations, even dealing with, like, the equivalent of canary deploys for pipelines. We're actually working on some stuff there that I think is really interesting, where you can sort of like a Kubernetes rolling restart. You can basically say, here's my new pipeline. I wanna run a shadow version of this.
Watch the data quality. And if the data quality stays within some bounds, I wanna promote that to be the pipeline and have it sort of take over for the old 1. So there's lots of, you know, interesting workflow stuff there, I think, that we can do to make it much, much easier and much safer to build data infrastructure at scale, you know, for companies out there. Are there any other aspects of what you're building at Decodable or the overall space of building pipelines on streaming systems that we didn't discuss yet that you'd like to cover before we close out the show? No. Just that I think it's a really interesting space. There are a bunch of people doing work to carry real time all the way out to applications with, like, a sync API, you know, sort of like a standard to sort of make that. I think Slack and a bunch of other people are sort of, you know, working on that. I think there's lots of interesting stuff, and I do think that there's going to be a change in how data warehousing and analytics look as the world moves more and more real time.
I am excited about what other folks in the space are doing there. Materialise comes to mind. I think it's really exciting. I think it's an interesting space to watch. It's certainly incumbent on us as as, you know, people who build tools and systems to make these things more accessible,
[01:05:31] Unknown:
you know, to the general public and just not have so many sharp edges. And, you know, I think we're sort of seeing a streaming renaissance there. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:05:53] Unknown:
I really think that I don't know necessarily what causes this, but every company under the sun has a data platform team. And I can't help but wonder the collective effort we spend replicating whatever this sort of same set of capabilities is across, like, the Global 2, 000 and even start ups, it does sort of feel like we're doing something wrong and repetitive, especially in the data platform case. Right? I think compute platform is really centered around you know, it was like VMware, and then it was like AWS, and then it became Kubernetes. Like, is there something that looks a whole lot more prescriptive but still flexible enough to be the kind of platform that comes not just with technology, but with, like, an architecture that is, like, effectively guaranteed to give you what the most tech savvy companies out there already have in place.
I don't necessarily know what that is. I do think there's, like, a data substrate component to that, and quite frankly, that's our vision is that, like, there's this what does the software defined network for data look like? And that's what I think decodable winds up, you know, growing up into becoming. But I don't know that that's the totality of the story. In fact, I'm positive that's not the totality of the story. So I don't know what that is, but, you know, we certainly have our work cut out for us.
[01:07:25] Unknown:
You know, that's for sure. Yeah. Well, if if you look at all the blogs that have been coming out recently, it's the quote, unquote modern data stack. But I think that that's only the right answer up to a certain point of scale or complexity where it's the 80% case for, I just need to get something up and running. But then when you actually start doing something interesting or specific to your business or organization, then it is no longer the complete solution.
[01:07:48] Unknown:
Right. So, I mean, not just something that scales up to, you know, like Google scale or Facebook scale or something that's happened. Something that scales down in resource cost and complexity. Something that can easily be deployed by a 3 person startup and then sort of grow with them. I mean, MySQL and Postgres are actually really good at this.
[01:08:12] Unknown:
You know? I don't know what that looks like for the rest of the data platform. I think we have a lot of work to do to figure that out. Well, thank you very much for taking the time today to join me and share the work that you're doing at Decodable. It's definitely a very interesting platform and an interesting project and solving a very real need for people who have banged their heads against the semantics of streaming systems and distributed systems and real time guarantees. So appreciate all the time and effort you and your team are putting into that, and I hope you enjoy the rest of your day. Thank you so much. It was a real pleasure. On podcast.com to learn about the Python language, its community, and the innovative ways that is being used.
And visit the site at data engineering pod cast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Eric Sammer: Introduction and Background
Decodable: Simplifying Real-Time Data Pipelines
Challenges in Stream Processing and Real-Time Data Engineering
Target Users and Use Cases for Decodable
Operational Complexities in Streaming Architectures
Decodable's Architecture and Design Philosophy
Handling Schema and Data Quality in Streaming Pipelines
Software Lifecycle and Deployment Workflow
Lineage Calculation and Metadata Management
Lessons Learned and Future Plans for Decodable
Future Directions and Closing Thoughts