Summary
Batch vs. streaming is a long running debate in the world of data integration and transformation. Proponents of the streaming paradigm argue that stream processing engines can easily handle batched workloads, but the reverse isn't true. The batch world has been the default for years because of the complexities of running a reliable streaming system at scale. In order to remove that barrier, the team at Estuary have built the Gazette and Flow systems from the ground up to resolve the pain points of other streaming engines, while providing an intuitive interface for data and application engineers to build their streaming workflows. In this episode David Yaffe and Johnny Graettinger share the story behind the business and technology and how you can start using it today to build a real-time data lake without all of the headache.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack
- Your host is Tobias Macey and today I'm interviewing David Yaffe and Johnny Graettinger about using streaming data to build a real-time data lake and how Estuary gives you a single path to integrating and transforming your various sources
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Estuary is and the story behind it?
- Stream processing technologies have been around for around a decade. How would you characterize the current state of the ecosystem?
- What was missing in the ecosystem of streaming engines that motivated you to create a new one from scratch?
- With the growth in tools that are focused on batch-oriented data integration and transformation, what are the reasons that an organization should still invest in streaming?
- What is the comparative level of difficulty and support for these disparate paradigms?
- What is the impact of continuous data flows on dags/orchestration of transforms?
- What role do modern table formats have on the viability of real-time data lakes?
- Can you describe the architecture of your Flow platform?
- What are the core capabilities that you are optimizing for in its design?
- What is involved in getting Flow/Estuary deployed and integrated with an organization's data systems?
- What does the workflow look like for a team using Estuary?
- How does it impact the overall system architecture for a data platform as compared to other prevalent paradigms?
- How do you manage the translation of poll vs. push availability and best practices for API and other non-CDC sources?
- What are the most interesting, innovative, or unexpected ways that you have seen Estuary used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Estuary?
- When is Estuary the wrong choice?
- What do you have planned for the future of Estuary?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Estuary
- Try Flow Free
- Gazette
- Samza
- Flink
- Storm
- Kafka Topic Partitioning
- Trino
- Avro
- Parquet
- Fivetran
- Airbyte
- Snowflake
- BigQuery
- Vector Database
- CDC == Change Data Capture
- Debezium
- MapReduce
- Netflix DBLog
- JSON-Schema
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png) RudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines. RudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team. RudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again. Visit [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack) to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.
Hello, and welcome to the Data Engineering podcast, the show about modern data management. Legacy CDPs charge you a premium to keep your data in a black box. RudderStack builds your CDP on top of your data warehouse, giving you a more secure and cost effective solution. Plus, it gives you more technical controls so you can fully unlock the power of your customer data. Visitdataengineeringpodcast.com/rudderstack today to take control of your customer data. Your host is Tobias Macy. And today, I'm interviewing David Jaffe and Johnny Graettinger about using streaming data to build a real time data lake and how SGRA gives you a single path to integrating and transforming your various sources. So, David, can you start by introducing yourself?
[00:00:53] Unknown:
Yeah. Sure. I'm Dave Yappe, cofounder of Estuary. I I'm more on the business side. So myself and Johnny have been working together on various different projects for a number of years in in different companies. We were formally in the marketing space, built some technology that got us where we are now. And, you know, we've been working together now for about 15 years. And, Johnny, how about yourself?
[00:01:16] Unknown:
Sure. Hey. I'm, I'm Johnny. I'm the CTO of Estuary. As Dave mentioned, we've been working together for quite a while. I'd say in many respects, I'm I'm kind of like a real time data plumber. I'm just deeply invested and deep to, like, the intricacies of moving data that's scaling quickly and just have been spending a lot of my time there for it's about a decade or more at this point.
[00:01:37] Unknown:
And going back to you, David, do you remember how you first got started working in data?
[00:01:41] Unknown:
Yeah. So the last industry that I was in was marketing and before that, advertising. So we had built a a company called, Invite Media a long time ago, and that was, it was a real time bidding platform. And, you know, so in that space, the the goal is to listen to all of the potential requests for data for, for ads on the web and then to respond to them. That's a very highly intensive data problem. You have to know exactly, like, what you wanna show, where, and and really that space just, like, set the the framework for us getting into the real time space. We did that through a number of other companies. So after that, we we built a company called Arbor. Arbor was a a data platform, and that that kind of led to the next step in our evolution towards, you know, where we are now.
[00:02:30] Unknown:
And, Johnny, do you remember how you first got started working in data?
[00:02:34] Unknown:
Yeah. I, I had a brief stint in natural language processing which is, you know, even at that time, and this was over a decade ago, it was a very data intensive domain. Looks very different from where we are now just with, deep learning. Yeah. And the deep learning. But since then, as Dave mentioned, in the advertising technology space, there's just data everywhere. It's you you have to move a lot of data very quickly in order to succeed in that business. So it was either, you know, sink or swim a little bit, and it essentially requires getting really into it. So digging now into
[00:03:09] Unknown:
Estuary, can you give a bit of an overview about what it is and some of the story behind how it came to be and why you decided that this was the problem that you wanted to spend your time and energy on?
[00:03:18] Unknown:
Sure. So I can probably start off, and I'm sure Johnny will add a bunch of up light to what I say. But at a very high level, the last company that we built was a data platform. It helps advertisers and publishers share their data assets with each other. And for that, you know, it was in the marketing space. The marketing space turns out that you really care about low latency data. When someone's on a website, they express interest in a product. They want that product for the next maybe 10 or 15 minutes before they go and buy it. So if you get in front of them very quickly, you might have an opportunity to influence their purchasing behavior. Because of that, we ended up valuing real time data very highly for the last company.
Marketing is is an interesting problem too because you you care about the history. You also care about what's happening right now. So we wanted to solve that problem well. And we ended up building a platform, a streaming platform for that called Gazette. Johnny Johnny built it for, real. So the reason that he built it, he can kinda get into those details, but, you know, I'll I'll leave that there and and have him kinda fill in the gaps.
[00:04:25] Unknown:
We built, Gazette for the last business, and, honestly, the core reason for building it at the time, this was, like, 2013, 2014. We needed a lot more flexibility in terms of, how we are moving data and being able to experiment and try things. And the the main problem that Gazet solved and continues to solve is, being able to work with sort of whole collections of data rather than buffers data. We needed to be able to take a dataset, kind of transform it in a different way, push it out to a different buy side partner, and we needed to be able to do that at scale and also really quickly. So once we decide that we want to transform and push data into a place, we need to be able to backfill over months, potentially more of history and also kind of stream from there in an ongoing fashion. And we needed to be able to do this without putting our production infrastructure at risk, because we're, you know, moving a ton of data through the system.
And, also, you know, we were a startup at the time. We needed it to be something that we could do quickly and easily. That was the genesis and the core sort of motivation for that for that business, and we've kind of been building on it ever since.
[00:05:33] Unknown:
And so the streaming ecosystem in general has been in development and in some measure of active use for on the order of a decade plus. Some of the kind of notable early entrants were things like Samza, You know, Flink is 1 of the current favorites. Spark, Storm, I think, was 1 of them. There there are a number of the kind of, like, faded from recent memory. I could probably dig them up if I wanted to. But in the early days of streaming, it was very much a, you can do it, but only if you have a lot of engineers to throw at the problem because it was kind of a new area, a lot of still undefined problems in terms of the distributed systems capabilities, You know, what what are some of the edge cases that we need to discover?
Wondering if you can talk about what you see as the current state of that streaming ecosystem as it stands today and some of the lessons that we've been able to carry forward from some of those early experiments.
[00:06:31] Unknown:
Yeah. In many ways, it looks a little bit like a snapshot of where it was 10 years ago. In in a lot of respects, it hasn't really moved that much, especially in comparison to sort of integrations of technologies like the modern data stack. I think there are a few things that are missing still. 1 is sort of streaming is or you know, the streaming ecosystem is oriented around buffers. You have buffers of data. Your Kinesis topic has a buffer of data that's 24 hours. Kafka, we just set it up. It's still giving you 7 days of buffer. And, yes, you can do segment offload, but there are some downsides of doing that as well.
So these systems are basically oriented to think in terms of very recent data and not sort of the history of the dataset. You like, being able to think in terms of an entire collection of data that is continuously evolving, but where you need to go visit all that history from time to time. So that's 1 big piece. It's just being able to think in terms of collections over buffers. Another is that state is still really hard. If you wanna do some kind of stateful transformation, you can do it with, you know, Kafka streams, formerly Samsa. You can do it with, Flink and Spark and so on. But these, these frameworks really require that you adjust your thinking to the way that they model state.
So, if you want to work with a stateful streaming application, you basically need to design your application from the ground up to work with Flink's checkpointing mechanism, for example. And that's not really practical. You know, basically, it's at a place where some organizations use it for really, really crucial operational use cases, but it's never gonna be something you pick up for the 3rd most important priority in your organization. Another challenge is that streaming is still very sort of events by events. Like, if you're writing a streaming application, you're picking up 1 event, processing it, and moving it along. But that doesn't really work with a lot of tools that you might wanna integrate with particularly well. Like, for example, how do you take a torrent of data and then maintain a a view of that data in a Google Sheet, for example, where you can only update that sheet once a second? You know, you're not gonna be successful moving an event at a time into that Google Sheet or updating it based on 1 event at a time. And then, finally, there's something that I think of as the the tyranny of partitioning, which is if you have, like, a Kafka topic, for example, you have to choose a partitioning of that topic. And that partitioning is a super important knob, because once you start writing data into that topic, you're generally sort of mapping your data based on, like, hashing and mod you know, doing modular arithmetic over some kind of routing key, within your topic. And once you've arranged it into these partitions, it becomes a little bit set in stone, and it becomes very difficult to to scale if you realize that you got that partitioning wrong or you need to scale up. You're basically having to kinda take down or pause entire workflows and bring up replacements.
So that can be quite awkward and it's something that the modern data stack has solved really well.
[00:09:31] Unknown:
And 1 other thing to add to that, change data capture is actually a place that streaming shines. Right? If you're extracting data from a database, you wanna do it at the at the log level. You wanna do it as an individual row level because that's just gonna be the a much more efficient way of interacting with the database. So I think that's a, you know, a positive place on streaming.
[00:09:52] Unknown:
Yeah. Change data capture is definitely the thing that comes up most frequently in conjunction with the idea of streaming and the kind of main entry point that most folks are going to grab for it. And so in terms of the overall ecosystem and what you're building at Estuary, I'm curious if there were any kind of core pain points that you were experiencing that pushed you into building a whole new streaming engine because that's definitely not the kind of first answer to, hey. This thing doesn't quite do what I want. And I'm wondering kind of what were some of the fundamental architectural issues with some of the existing systems that prevented you from being able to just do an evolution from the the current state of 1 of those platforms to saying, no. I actually need to do a ground up re redesign, rewrite for being able to achieve the specific kind of optimizations and capabilities that you wanted to focus
[00:10:44] Unknown:
on? I'll I'll point at 1 place to kind of answer this question, and it's it's casting back to, like, why why create a streaming system? Why create cassette? And this is again 2014. You know, Kafka was on the scene at that time. Why create a whole another streaming system? There's 1 architectural capability that is pretty core to cassette and still is core to what we do. So what cassette does is it's basically representing sort of a streaming log of, you know, of your data in a sort of an append only log. What it does is it's really just dealing with the really recent data that's being written to that log. And then pretty much as soon as possible, it's writing chunks of, like, historical log into cloud storage as, regular files. So you're, like, regular files of JSON and cloud storage for all the historical data, and then just the really recent stuff is managed by this replicated broker integration. And 1 of the key capabilities that that enables is when you need to go reread a topic of data from cassette, what you are actually doing is you were just asking the broker, hey. Where do I go find the next gigabyte of data in cloud storage? And then you're going and you're reading it directly from cloud storage. And this is a huge architectural advantage because it allows for the brokers just to focus on the really near real time stuff, what's happening right now.
And if you have really large scale consumers of data, they are able to just go fetch as much data as they can from cloud storage, which is a highly elastic resource intent in terms of IO operations. So it's really an architect it it's, like, crucial from an architectural perspective that you're able to get that kind of scale when consuming historical streams. And that's that, you know, to to the state, Kafka still, if you're trying to read a ton of historical data out of a a topic, you've gotta go through the brokers to do it, and that's an operational scaling concern.
[00:12:39] Unknown:
And I've definitely seen this kind of spill to disk, if you will, of being able to when when you get to the point of, I can't keep all of these events in memory or on local disk. I need to, you know, page that out into cloud storage. And when you mentioned saying, okay. Would when I need to replay everything through the broker, I'm wondering what you see as the trade off of saying, when I do kind of page that out to cloud storage, do you do that in a fashion where it's in our k files or something that it can be accessed via something like a Trino? Or is it something where we want to actually page that out in a format that is optimized for the broker because we want the broker to be the single point of access for working with this data versus we actually want to age data out of the real time stream into a batch oriented mode where if you want to do aggregate analysis on this, you just use a different query engine kind of a situation.
[00:13:38] Unknown:
Yeah. In truth, you can have both. So quite literally, a a collection of data and because that and and Flow as well is if you're writing JSON documents into that collection, you get files of your JSON documents in cloud storage, laid out in sort of a nice hive metadata layout. So the the true answer is you really can have both. JSON ends up being kind of a a nice format for stream oriented processing anyway. The, you know, parquet, for example, is a great format for analysis. But if you are needing to read individuals, like, row by row in a streaming fashion and you're doing some kind of data shuffle, the columnar representation of a parquet file is not helping you. It's actually hindering you. So there are only a couple of formats that really make sense for true streaming
[00:14:28] Unknown:
reads of data, and that's, you know, JSON, Avro, couple others. I was just going to ask about Avro, so you beat me to it.
[00:14:36] Unknown:
Yeah. Yep. The other, you know, factor on this is that if you've got a a layout of files of JSON, you can plug that right into BigQuery as an external table, for example. So interoperability is a design goal from the get go. Absolutely. But going back to the question of batch versus streaming
[00:14:51] Unknown:
where everybody has their own opinionated viewpoints about it. Sometimes people just have no opinion. They just do whatever comes easiest. There's definitely been a lot of investment in optimizing for the batch oriented workflow with tools like Fivetran, Airbyte, and then, obviously, the cloud data warehouses, the Snowflakes, BigQueries, Redshifts of the world. I'm wondering what you see as the default mode that most organizations today would fall into when they say, I need to be able to do a bunch of stuff with data. Do they go straight into streaming land of streaming actually solves all of their problems, makes everything easier, or is it still easier to just use these batch oriented tools because that's where the ecosystem has invented invested all of the time and integration energy?
[00:15:38] Unknown:
That, that's actually a a question that falls right into our wheelhouse. Right? Our goal is to make streaming as easy as batch. Right? And that requires a couple of things, and it requires being able to have historical data and not just real time data, being able to look at the world in terms of collections so you get, all of your data, including incremental updates. And also being able to take the paradigms that someone like a Fivetran presents and be able to present them in a streaming space. So that's really what our platform has tried to do. We'll we have 3 legs to it. 1 is captures. Captures is, you know, you're taking data from external systems using configuration. The second is transformations. So it's not exactly a 5tran where you you're doing ELT as the only mode. You can actually do ETL or ELT.
And ETL, you know, it turns out, has some good benefits in the in the real world where you don't wanna necessarily join all your data in the warehouse every single time. So sometimes it's good to do stuff like that before pushing it into the warehouse. And then the third leg is, materializations or pushing data into destination systems. 1 of the things that batch systems really focused on was was data warehouses. Right? Like, that's what the 5 trans of the world are are focused on. And that's something you can do in a streaming fashion if you do some sort of batch batching. You actually have to batch since most of these warehouses don't take every single event update. But, you know, you can also work with other types of systems. So think of NoSQL stores. You could work with, search indexes or vector DBS. You know, a lot of different applications can come out of this, which you really couldn't do in the in the batch space.
[00:17:23] Unknown:
I would also, just taking a step back and looking at the ecosystem, there's very much a gray line between batch and streaming. If you look at what companies are doing with, you know, Fivetran, for example, it is streaming most of the time. You're you're doing sort of an incremental, like, just moving new data from 1 place into another. That is a stream. So I would you know, if you're making an architectural choice here, I would point out you can always go from a stream to a batch, but you can't go the other way. So if you have to pick 1, you know, what does that what does that mean? And especially given that you pretty much want streaming anyway given the benefits of incremental processing and cost reduction.
[00:18:02] Unknown:
Yeah. It seems like streaming ingest as a thing is something that enterprises are really starting to to, latch onto. They're they're saying that they want a new reference architecture, which loads their systems using stream processing. That is probably limited for the most part to big enterprises at this point, but I think we're starting to see it trickle down a little bit.
[00:18:27] Unknown:
And in particular, for that kind of initial evaluation, the, you know, proofs of concept for a team that is building out a new capability or starting from scratch, what are some of the potential roadblocks or stumbling points that they're going to hit in the process of saying, we actually want to go with streaming? What are some of the problems that they have typically run up against that then push them into the batch paradigm and some of the ways that you're trying to address those shortcomings in the work that you're doing at Estuary?
[00:18:59] Unknown:
Yeah. I'll start with, let's talk about CDC for a moment. A challenge that often comes up with CDC, especially when you get into really large databases, is the the difficulty of backfills. So that's been a challenge in the space for quite a while. You know, the museum is a great tool, but if you hook it up to a a, you know, 2 terabyte or more database, 1 challenge it has is that it is it essentially, like, your database is gonna fill up it's right ahead. It's gonna fill up the disk right ahead lock segments, while your backfill is running. So, 1 point difficulty is the challenge of doing CDC from larger databases.
That's something we've seen several times now, and it's something we've invested a lot in kind of addressing and fixing.
[00:19:46] Unknown:
That 1 isn't gonna make you go into the batch base because there's no better solution in the batch base, unfortunately.
[00:19:51] Unknown:
Yeah. That's true. Unless you're doing a full refresh of your table every time, which is, you know, lots of money. So other challenges tend to be sort of transformation around transformation. You you almost always want some kind of transformation of your data. As Dave mentioned, ELT is great, but I I kind of think we're like, the pendulum is swinging back a little bit towards, like, ETLT, which is really kind of just saying, like, transform it where it makes sense. Transform it while it's in flight if that works and helps you reduce cost or or achieve the outcome you want. Transform it in the warehouse if that's easiest and, you know, works for for what you need. So it's really about flexibility and being able to transform what you want. But in practice, when organizations have deployed streaming, transformation has been challenging enough that very often they just end up, like, streaming everything into a data lake right away and then doing transformation there. And at that point, you've already kind of given up the potential benefit of streaming in the first place.
[00:20:54] Unknown:
Another another thing is just trying to do the use streaming for the wrong workloads. Streaming is very good for, data pipelining, you know, for reshaping your data and putting in into a batch system, actually. So you can do queries within a batch system, potentially. So I think a lot of the time, you'll you'll see a big, workload where they're trying to take everything and just put it in into a stream processing space where, you know, that's probably not the right way. Starting small, like any big engineering project is probably the right way to go and, you know, starting maybe with 1 critical workload and then expanding from there.
[00:21:30] Unknown:
And 1 of the things that I'm particularly interested in as far as the paradigm mapping between batch or streaming is that for these batch workflows, there's been a lot of time and energy invested into these data orchestration utilities. So the Airflow, Daxter's, prefix of the world. And I'm curious how that manifests in the streaming flow where you are doing these transformations. You know, maybe you need to be able to do a data enrichment in flight of that transformation where you're either joining across streams or you're pulling in data from an API as part of that transformation or querying a table to be able to pull in some additional context and managing the kind of dependency flows of data as it traverses, you know, 1 or multiple streams into and out of different batch systems. I'm just curious kind of what the data orchestration landscape looks like and its relative level of support for streaming and maybe any kind of custom solutions that you've had to build up to support that and kind of bridging those paradigms with Estuary?
[00:22:33] Unknown:
Generally, sort of very application oriented, which has made it made it very hard to sort of corral a comprehensive picture within the ecosystem of what your streaming data flow is. So by application oriented, I mean, that you are writing a streaming application that's going you know, that is choosing the topics that it's going to read from that application run time, and it is choosing the topics that it's going to write to at application run time, which means that that application really understands sort of the provenance of data and, like, what it's doing. But if you are outside of that application looking in, it becomes really hard to understand, like, what is this application reading and, you know, for some topic that it's writing to, how did this data come to be?
What exactly produced this data into this this particular topic? So that that deficiency of sort of understanding, like, the data flow and provenance tracking of data, is something we've thought a lot about and kind of factored quite a bit into our design for how we model transformations within the tool, where the transformation is instead sort of declaring the sources of data that it reads and the transformations that are being applied to that data, but a transformation is 1 in the same with a collection of data. So, essentially, in flow, like, you build a collection of data, which is built through transformations of other collections of data. And 1 thing that allows for is a really tight understanding of provenance and exactly how that data product came to be, which is, you know, crucial.
[00:24:07] Unknown:
Yeah. The question of lineage is also interesting whereas you said, because you have all of the streams, you would know what those data flows are. But I'm curious if there are any interesting edge cases that you've had to manage as you move from the streamflow into the warehouse through some DBT transformations into a Hightouch or something and kind of just being able to understand from a metadata perspective how to represent those different transition points as you move between these different paradigms and use cases.
[00:24:38] Unknown:
Yeah. I'm not gonna pretend that we've entirely figured this out, but the the golden rule, I think, is really just consistency. Consistency in naming, consistency in, you know, essentially representing that this is the same collection materialized into multiple views in different places, which is really kind of how people think about it. You're you're not thinking about, like, this is my my user's dataset from my post press database or my MySQL database or in Snowflake. You're thinking, like, this is just my users people, you know, my collection of users, and it happens to be in 1 of these 3 places, but it is the same data.
So encouraging fluidity and sort of understanding that it's the same data product and it's just being synchronized between these places is is part of the answer here. And
[00:25:27] Unknown:
another element of this space is the rise of data lakes and data lake houses as the place where you do these different batch transformations, particularly with the rise of different modern table formats. And I'm curious what you have seen as the impact of that evolution in the ecosystem and the, viability and appetite for streaming as a data ingestion mechanism and some of the ways that those are playing off of each other?
[00:25:55] Unknown:
I think in some respects, the the rise of modern, table format is a it's organizations kind of wanting to own their destiny a little bit just in terms of they have an enormous amount of data in these these warehouses. The way in which that data is represented has huge implications on their cost, their cost to serve, and being able to sort of self host and own that is appealing to a lot of organizations quite reasonably. So in some respects, I see it as kind of just a progression of, you know, we've had BigQuery. We've had Snowflake on the landscape for a little while. Databricks as well. These are sort of a an evolution and, you know, a lot of companies are gonna wanna in house this, and that's great. I I would kind of lump them a little bit together categorically as, a bit of the same thing, and you you just have more choice now in the ecosystem of whether this is gonna be something you're gonna self host or whether you're gonna use the vendor to, to manage your warehouse. But it's fundamentally the same techniques and the same architectures.
[00:26:59] Unknown:
And digging now into the specific implementation of what you're building at Estuary and with the flow platform. Curious if you can start by just describing some of the system architecture around that and the, different optimizations that you're focused on in terms of the use cases that you're supporting?
[00:27:17] Unknown:
Sure. So some of the big architectural points are, number 1, sort of thinking in terms of entire collections of data that are principally living in the customer's cloud storage, the user's cloud storage, that the streaming, you know, we call it a real time data lake. And we mean that quite literally where it's literally a data lake of files in the user's cloud storage that also has a millisecond latency streaming component on top of it. So that's 1 key architectural capability. And when we talk about a collection of data, we're talking about this this hybrid, you know, real time data lake that's actually enabling the storage of that that collection. 1 of the other big, architectural, investments that we've made is really kind of bringing the MapReduce paradigm, not map like, MapReduce is in the job implementation, but just the paradigm of map and reduce, into the streaming ecosystem, building it into the tool as a really first class concern, making data reduction something that just happens all the time. So data shuffles and data reductions are just kind of built in and always there for you.
And that's been honestly a very challenging engineering task because when you talk about a data reduction for, you know, moving data into a, into a NoSQL like a Redis cache, for example, that's very fast. You don't need a lot of data around for a data reduction you might be doing. But we have users who are, managing, materializations into Snowflake where we're we're needing to sort of roll up, like, tens of gigabytes of data every transaction, and essentially do reductions all over everything that's going that's about to go into that warehouse. So enabling that scaling is has been a challenge. And then final piece is probably connectors.
Part of the sort of full philosophy of what we built is we really wanna get out of the user's way in terms of the user being too able to articulate, you know, connectors for different systems they wanna capture from or push data to and even connectors that are performing transformations. So transformation in our platform is also a connector, and, you know, we very much have a goal of just being able to bring your own transformations over time in your own arbitrary languages.
[00:29:34] Unknown:
Yeah. The only thing that I would add to that is, change data capture. That's a pretty important concept. And and when you get into very high scale use cases, and, you know, it it had there are some implications of the way that you do it. So for us, that was something that we focused on, upfront and made sure that we were building our own framework, which is kind of built based on the Netflix DB log implementation, to be able to do that very efficiently for both big batfills and, you know, instant transition over to reading the right ahead of log. Actually, doing both of those things at the same time so you never have issues with with the transition.
[00:30:13] Unknown:
Another interesting consideration in this streaming versus batch paradigm and the data lineage question is the way that you think about data modeling where in a streaming system, you know, with CDC, obviously, you're dealing with tabular structures. But if you're pulling in event based data, then it's largely going to be denormalized and probably nested and have different kind of variability as you evolve the event tracking system. And I'm curious how you think about the data modeling question in a streaming flow and managing those transformations and some of the ways that the downstream consumption factors into the considerations of how you think about the initial data structures that are entering into your streaming workflow?
[00:30:59] Unknown:
Yeah. It's a great question. So we've built around, the JSON object model. So there is a, alright. Let me start this answer over again. Yeah. That's a great question. We've built around the JSON object model. Just as you said, if you want to deal with sort of event based data to normalize data, it's a little bit table stakes that you'd be able to represent truly deeply nested data, potentially even sort of documents modeling graph structures. So we built around the the JSON object model, and, we leaned heavily on JSON schema, actually. So JSON schema, which is a a standard, it's part of the open API specification. It's a record it's a it's a, essentially, a a lightweight language for describing how a document should be validated, and for representing, like, the the structure and even some of the semantics of your documents.
So when we talk about a collection of data and flow, that's a collection of JSON documents with a JSON schema, which is validated every time a document is is touched within a collection whether it's read or written. So, naturally, that that leads to some interesting questions about how do you map that to sort of tabular systems, like databases where they're picking in terms of columns. And part of the answer here is what we call projections. It's essentially being able to define locations within a document and fields that that name them. So, when you materialize into Snowflake, for example, there's a bit of a negotiation that's happening between Snowflake and the, flow platform to figure out, okay, what are the locations that we're gonna project into fields, into columns within that table?
Most of that is automated, but, basically, the users is sort of empowered to fine tune exactly what it is that's gonna be mapped into columns and how.
[00:32:53] Unknown:
Yeah. So to expand on that a little bit, if there's a specific type of, data structure that that isn't well, well met by the destination system, the user can decide how they wanna map that in if they want to, you know, do something to make it appear as just a raw JSON document, or, you know, something like that.
[00:33:13] Unknown:
And so for teams who are interested in integrating Estuary into their data platform and into their data flows, can you talk through what that onboarding and integration path looks like and some of the ways that they should be thinking about the data modeling, system design, workflow, management as they go from, I you know, idea to initial proof of concept into larger scale production use cases?
[00:33:40] Unknown:
At a very high level, we target, working with as a SaaS platform at this point. So, you know, getting started is pretty straightforward. You can you can essentially just log into it, configure your captures, get data into the system, and start working with it directly. It's designed to be a system that you can work with with different personas. So you can work with, you know, for instance, if you have a DevOps person who's going to be setting up that database, they can log in. They have some sort of permissions. And then maybe you have someone a data engineer who's creating transformations if you if you even wanna do that. And they can all kind of work together. The the data modeling aspect of it is, you know, we we provide tools for that. You can use different languages to, create transformations.
Right now, we support SQL and TypeScript. We hope to support Python, relatively soon as well. So, you know, you you you're able to work with the kind of tools that you're kind of expecting to to to do that and then, push it into systems that you care about. So, really, like, that's the high level answer. Johnny, what do you have to add to that?
[00:34:48] Unknown:
Yeah. No. We're we're targeting a glide path where you can kind of come in and just purely through UI, capture data from a system you care about, and move it into another system that you care about. You know, I've I've mentioned that we you know, collections have JSON schema. You can very much get into writing your own JSON schemas to ensure that, you know, the the, that your collections of data are pristine, or you can ignore that aspect because our connectors generally write that schema for you, and you just use it.
So it's a it's a bit of a choose your own adventure type answer where you can get value from just connecting to data, you know, systems that are not currently connected. And as you pull back the layers, I mean, obviously, we're operating in kind of a a space where there are sort of deep architectural considerations that an organization might wanna make, around how they're moving data through throughout their entire platform. I'm you know, of course, we would love it for people to be using Flow for for more of that stuff, but we very much encourage and and support sort of picking a single use case, starting there, seeing how it works for you, and then take it step by step.
1 other angle I'll I'll throw in here is just the there's really kind of a hybrid workflow where it's both UI and also GitOps. So you're able to sort of take the, work in progress that you are, for a transformation you're building, for example, and then take that to a Git repo and manage it there and share it with your team through that Git repo and it's, you can go back and forth.
[00:36:20] Unknown:
Yeah. The deployment management was gonna be my next question as far as how do you go from, you know, testing into production and managing that propagation and versioning of the transformations and data flows that you wanna incorporate?
[00:36:35] Unknown:
Yeah. You can either sort of do it strictly through the UI. If that's not something that you wanna take on, you just wanna make a change in the UI and ship it and call it good. That's, that's entirely that's something you can do. You can also manage this in a, a git repo. 1 other capability of the platform is is writing tasks. This is actually something that's really hard to do for most transformations. You know, if you're working with VVT, for example, testing is actually pretty challenging to do. 1, capability that that flow gives you is an ability to sort of write contract tasks where you're essentially saying, like, if I have some data that gets written to this collection up here, this, you know, sort of upstream collection up here that's going through potentially multiple rounds of transformation, I expect to see this output in this collection over here and, you know, you can write test about that.
And Flow will verify them before pushing out any changes that that might, that might impact that data flow. So, it essentially has CI kind of built into the end of the platform where, if you're making changes, you can test it locally, and then we will also test it before we set it live just to make sure that you're not breaking a data flow inadvertently.
[00:37:51] Unknown:
And then for the case where you are dealing with a source system that doesn't understand incremental flows or push based flows, so CDC, but thinking in terms of API sources. So being able to pull from a HubSpot API or the Salesforce API, for instance, being able to map between that very required batch oriented flow into the streaming system. And in particular, I'm curious if you have seen any movement in these different SaaS systems toward a more push based or pull friendly API design versus we only offer a batch bulk extract interface, and that's all there's ever going to be.
[00:38:39] Unknown:
Absolutely. I mean, to name 1, Salesforce, which you just mentioned, we have a real time connector for Salesforce that is entirely end to end streaming. HubSpot as well has, incremental, updates. So for a lot of these SaaS APIs, it is a a polling based approach, but it's polling to fetch sort of changed data since the last polling interval and, you know, different people have different views on this. But depending on where you set your polling interval, it starts to look a lot like a real time stream. You're hand waving a little bit around the definition of real time. So for these APIs, essentially, yeah, in some respects, they are bold. Some APIs, you're just doing a full refresh every time, and then others are truly streaming.
[00:39:22] Unknown:
And so in terms of your experience of building Estuary, working with your customers, seeing some of the ways that streaming data is being integrated into different platforms and data flows, what are the most interesting or innovative or unexpected ways that you've seen your technology used?
[00:39:40] Unknown:
What what I would say is, something I was not expecting. We have a an HP file connector, which allows for just sort of fetching arbitrary HP URLs. And something that has been a little surprising to me, though, obvious in retrospect, is how many, users essentially just use that connector to grab from a slew of different bespoke REST APIs. So if that's proven to be a pretty easy way for, people to sort of integrate SaaS systems or APIs that that we don't have first party, you know, 1st class connectors for, into the platform.
Another is we have a Google Sheets materialization, which is keeping a Google Sheet up to date in real time. And that was originally built as kind of a cute toy example of, like, what flow is capable of that it could maintain a materialized view over, like, a fire hose of data and just kind of produce and keep it up to date into a Google Sheet. But, you know, I built this for a demo, and yet, users are using it in production quite a bit. Yeah. That's probably the first thing that they they do. They try the Google Sheets connector and and just, like, see a view of their data, make sure stuff is working.
[00:40:50] Unknown:
Yeah. I think that's that's right. Like, 1 thing that we learned was just it's really important to have a toy use case available, a toy that you can play with and just kind of you know, obviously, in hindsight, in retrospect, it makes a ton of sense. No one's gonna go in and, hook up their production database to some SaaS system that they they just found on the Internet. Right? Like, they they want to try it out and do stuff like that with it. So definitely something we see.
[00:41:17] Unknown:
And I was just thinking of another question that I'm interested in understanding a bit more about is for those CDC use cases, particularly for transactional databases. So being able to pull those rate of head logs from a Postgres or MySQL. 1 of the challenges that I've often heard related is being able to reconstitute those very minute details into something meaningful at the other end where, particularly for the case, if you have a transaction in the database that ultimately ends up failing and getting rolled back, how do you ensure that you capture that in the end result and being able to turn these very minute details into a representation that you can actually run analysis on without having to spend a lot of engineering effort dealing with some of these edge edge cases. I'm wondering if there are just any off the shelf libraries or built in capabilities into Estuary or some of these other warehouses that help with that translation from those low level details into a working set of data?
[00:42:19] Unknown:
Sure. So 2 2 parts to this. 1 is that databases are pretty good about only exposing committed transactions in the right of headlock that that you are you know, that our connectors are reading or generally that you're reading through logical replication. Another aspect of this is time. If you have a capture of a database where you've got multiple tables that are being captured and those are landing into multiple different collections of transformation or a materialization that is reading from multiples of those different collections of data, you really want sort of the progress of that materialization or that transformation to be in wall clock time order based on the, the time at which rows were added to those various collections so that you're seeing the effect. You know, you don't want an a customer order to show up before the product record does that you would join against it. So to, you know, 1 particular answer to your question is that when reading data, we are doing sort of a a live streaming shuffle of data from these different collections.
And they're while they're not to control this, generally, we're always reading in time order. So you're
[00:43:30] Unknown:
seeing or transforming or materializing documents in the wall time order of which they're originally written, irregardless of which particular collection of data they're in. And to add to that a little bit, so the way that we we kinda mentioned that the system has an underlying data lake. And the way that that underlying data lake actually works is it's an immutable log. So it has every single event, the deletions, the, you know, updates, everything. And then we put the burden of constructing that into something that a human can read onto a materialization. So materialization can, can then take those events that are a lot more aggregated, and and low level detailed and and kind of reconstruct
[00:44:10] Unknown:
the the view of the world on it. Probably getting a little too deep in the weeds here, but another interesting use case is for people who are very heavily invested in microservices or maybe you're streaming information from multiple different data stores that in aggregate will give you the information that you want, but in isolation aren't terribly meaningful. So maybe you have an ecommerce service that captures the payment info, and then you have a shipping service that will capture the, you know, bill of lading, and you actually want to make those 2 data flows dependent on each other in the end system. And I'm curious if there's any capability to be able to note that dependency between those different data flows within Estuary to ensure that they are materialized in conjunction at the output.
[00:44:57] Unknown:
Yeah. When you build a materialization within, within flow, a materialization is not 1 to 1 with a single source of data. A materialization is actually multiple sources of data, multiple collections, which are being materialized together by a single task into a destination. So if a materialization connector within flow is not 1 to 1 with a source of data to, you know, a single materialization. Materialization connector is, actually taking multiple sources of data that it's been configured with and keeping them up to date collectively in an endpoint system.
So if you have multiple sources of data where you might have, you know, these different microservices that you're referring to, and you're materializing them together, then the guarantee that's made is that all of those tables are updating in the overall wall time order at which those events happened. So that's a big part of the answer here.
[00:45:51] Unknown:
And in your experience of building estuary and exploring this streaming integration space? What are the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:46:04] Unknown:
So as mentioned, CDC has been a pretty tough nut to crack. That's been a place where we've had to invest a lot of effort in order to achieve really good outcomes for very large databases. You know, in this space, a lot of people just kind of build on top of the museum and call it good. We really tried to do that. I promise we tried, and we're kind of tracked kicking and streaming into into building our own solution here just to make sure that very large databases have a good experience of of, doing backfill. It's not acceptable to fill up database disk. So other, other aspects of this are that building a poll based capture is fairly I'm I'm not gonna call it easy, but it's a simple conceptual model that you're you've gotta connect with. It's sort of reaching out into a system and pulling data from it. But there's a really key capability that also comes up, and you mentioned it a little bit earlier around micro microservices where you really want to be able to push data into a capture as well. If you've got, like, a, a serverless function, for example, that's, you know, interacting with the user through an API that you're providing and then it has some events that gets produced through its action and you wanna put those events somewhere, what's the simplest thing you could do to to to do that? And I would say that it's probably just like calling a webhook, doing an HTTP post to basically dump that event data somewhere.
And being able to we we offer an HP ingest connector that is push based in nature, and and making that work was actually quite interesting and got us into this whole, like, connector networking, epic of work.
[00:47:40] Unknown:
I would add that, you know, we all talk a lot about data quality, but a lot of the times what data teams really want is they want their data to keep flowing. And, so something that we definitely learned is that you you really need to prioritize that and at least provide the tools to make sure that even if schema's changing or something like that out from under you, there's a way to ensure that that goes downstream, and a way to, like, bypass, you know, standard data quality checks that you might want to impose around schemas and stuff. So automation of of that automation of schema drift and going and making its way downstream into a database, new tables coming, you know, downstream. All those things are really, really important, and and potentially more important to a lot of data teams than than quality is, for instance. And expanding on that slightly,
[00:48:30] Unknown:
there is not a single schema for a collection of data. There's actually really, you want a right time schema and a retime schema potentially because that right time schema, if you've got a source of data that's not under your control, you kind of want to accept those rights no matter what. And you still might want very strong schema validation, but if you do it at real time, you can ensure that you're not losing any data. And if your pipeline breaks because the read time schema is strong, you have an opportunity to fix it without losing any data along the way.
[00:49:01] Unknown:
For people who are building their data systems either from whole cloth or evolving from an existing state, what are the cases where Estuary is the wrong choice?
[00:49:13] Unknown:
I would say a lot of the people that I talk to here that we have SQL transforms, and they think we can be their analytics systems, which which is not what we're intended to be. Right? We're not an analytics system. We are a, data data pipelining platform. So we're used to create long lived transformations that feed into other systems and and, and populate those. And then a batch system is much better suited to ask point in time questions, to answer point in time questions, versus a streaming system most of the time. Right? If you're doing exploratory data analysis, you probably want a batch system versus a streaming system. And the advantage of using a streaming system in conjunction is that you can have a real time answer. Right? Like, you can have all of your data, including what's happened very recently.
So I'd say that that's, like, 1 of the things that we see often.
[00:50:04] Unknown:
Yeah. We're we're great at operationalizing the answers to questions that you've already learned how to ask. For a question that you're still figuring out the question, just an easy ad hoc, query capability is probably most suited for what you're after.
[00:50:20] Unknown:
And as you continue to build and evolve the Estuary platform, what are some of the things you have planned for the near to medium term or any projections that you have about the streaming ecosystem as we move forward?
[00:50:33] Unknown:
Yeah. He I think the biggest is transformation, and I I don't think there's a single right answer here. SQL is obviously great in many cases for adding transformations, and we've invested in it for that reason. However, we also see plenty of people use TypeScript. Python is extremely popular, and I think there's a whole new class of streaming transformations and just transformations in general, which I'll call, like, AI pipelines and just grounding it out specifically, like, prompt engineering. If you wanna integrate Chat GPT into a operational pipeline where you're asking it to do something over, different context, you've gotta call out to OpenAP OpenAI in order to do that.
And you can't do that with SQL. So just, you know, capabilities of transformation where you might be integrating vendors and third party services that are helping you with that transformation is another ingredient here as well. So, something I'm excited about is just being able to deliver, like, a rich, you know, canvas of options for doing transformation within the streaming pipelines.
[00:51:41] Unknown:
Are there any other aspects of the work that you're doing at Estuary or the overall space of streaming data integration that we didn't discuss yet that you'd like to cover before we close out the show? I feel like you've done a pretty good job of nailing a lot of a lot of the questions that we could have possibly thought of. So, yeah, thanks for having us. Alright. Well, for anybody who wants to get in touch with each of you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get each of your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today? I would say we've spent a lot of time talking about sort of some of the the details of batching and streaming data and how we're transforming data and that sort of thing.
[00:52:24] Unknown:
I think the the larger work still to be done is democratizing this within organizations and allowing more stakeholders with different capabilities and different technical capabilities to access and leverage the the data that exists within an organization. We've spent quite a while as an industry where we've had these, like, data teams that are this narrow roadblock and and bubble deck for the, the needs of their organization and where we really need to move to. And, you know, the modern data stack has helped here some. Where we really need to move to is enabling much larger cohorts personas within an organization to effectively work with and understand and discover the data that's available within their org and understand how it was built and why. I would probably say
[00:53:17] Unknown:
that's that's a very good 1. I'd add that, a lot of the tools that we've seen come out in the last few years have been very purpose built for a specific single problem. Right? Like, you have get data in its exact format from 1 place to another. And I think those are the probably not the way that we're gonna be working with tools in the future. Right? We're gonna be looking at more, platform type approaches, tools that allow you to, to do more of the holistic data engineering story within a single place. 1 of the problems that I think we see and that had we have all these point solutions is that it they encourage people to not talk to each other. Right? Like, people do their own little thing. You have software engineers that are producing data, data engineers that are consuming data, and this big gulf that happens maybe between them. It'd be really nice if they they were more collaborative and we could bring the 2 together so that, you know, we had, just a much easier way to when we're advancing datasets, we had a much easier way to,
[00:54:18] Unknown:
to make sure that that's going to work, to filter it downstream, to know how it's going to affect all of our analytics that are highly dependent on that. Yeah. Absolutely. The the get rid of this situation of, hey. I added a new column to the database, or, hey. I dropped this column or changed the type, and now everything else downstream is broken.
[00:54:35] Unknown:
Surprise. Yep. Exactly.
[00:54:38] Unknown:
Absolutely. Well, thank you both very much for taking the time today to join me and share the work that you're doing at Estuary. It's definitely a very interesting product and platform. It's great to see folks investing in making streaming data a more viable option for a wider range of, industries and people and organizations. So I appreciate all the time and energy that you're putting into that, and I hope you enjoy the rest of your day. Thanks for having us, Sebastian. Thanks. Good to meet you. Yep. Thanks for having us. Cheers.
[00:55:09] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at dataengineeringpodcast.com with your story.
And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to Guests and Their Backgrounds
Overview of Estuary and Its Origins
Evolution of the Streaming Ecosystem
Challenges and Innovations in Streaming
Adoption of Streaming in Enterprises
Data Orchestration in Streaming
System Architecture of Estuary
Onboarding and Integration with Estuary
API Integration and Real-Time Data
Handling CDC and Transactional Data
Microservices and Data Dependencies
Lessons Learned and Future Directions
Closing Thoughts and Contact Information