Summary
One of the perennial challenges posed by data lakes is how to keep them up to date as new data is collected. With the improvements in streaming engines it is now possible to perform all of your data integration in near real time, but it can be challenging to understand the proper processing patterns to make that performant. In this episode Ori Rafael shares his experiences from Upsolver and building scalable stream processing for integrating and analyzing data, and what the tradeoffs are when coming from a batch oriented mindset.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.
- Your host is Tobias Macey and today I’m interviewing Ori Rafael about strategies for building stream and batch processing patterns for data lake analytics
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by giving an overview of the state of the market for data lakes today?
- What are the prevailing architectural and technological patterns that are being used to manage these systems?
- Batch and streaming systems have been used in various combinations since the early days of Hadoop. The Lambda architecture has largely been abandoned, so what is the answer for today’s data lakes?
- What are the challenges presented by streaming approaches to data transformations?
- The batch model for processing is intuitive despite its latency problems. What are the benefits that it provides?
- The core concept for data orchestration is the DAG. How does that manifest in a streaming context?
- In batch processing idempotent/immutable datasets are created by re-running the entire pipeline when logic changes need to be made. Given that there is no definitive start or end of a stream, what are the options for amending logical errors in transformations?
- What are some of the data processing/integration patterns that are impossible in a batch system?
- What are some useful strategies for migrating from a purely batch, or hybrid batch and streaming architecture, to a purely streaming system?
- What are some of the changes in technological or organizational patterns that are often overlooked or misunderstood in this shift?
- What are some of the most surprising things that you have learned about streaming systems in your time at Upsolver?
- What are the most interesting, innovative, or unexpected ways that you have seen streaming architectures used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on streaming data integration?
- When are streaming architectures the wrong approach?
- What do you have planned for the future of Upsolver to make streaming data easier to work with?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
- Upsolver
- Hive Metastore
- Hudi
- Iceberg
- Hadoop
- Lambda Architecture
- Kappa Architecture
- Apache Beam
- Event Sourcing
- Flink
- Spark Structured Streaming
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Outland as an internal tool for themselves. Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to dataengineeringpodcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Ori Rafael about strategies for building stream and batch processing patterns for data lake analytics. So, Ori, can you start by introducing yourself? Yes. Thank you, Tobias. My name is Ori Rafael. I'm the CEO and 1 of the cofounder of Absolver.
[00:02:12] Unknown:
I'm coming from the data space. Data is basically all I've done before. I started as a database engineer and led all of data integration, my unit in the Israeli intelligence. And so you mentioned that you've been doing data your whole career. Can you just share a bit about how you first got involved in the area and sort of what it is about the space that keeps you interested and motivated to keep working in it? When I arrived to my unit, we weren't allocated with a specific team, so I had to interview with a whole bunch of teams. And then kind of the odd person out was the database team. People not necessarily like the idea about going to something more support or maintenance oriented, but I like the fast treatment loop of working with a database. I think it also combines a lot of the different disciplines in computer science. You can think about performance and reliability and concurrency. Like, there's a lot to pack there. So I think it's a mental challenge. This is what I liked about it. This is what I liked about it still today. Give me the option between 500 as SQL or 500 lines of code. I choose the SQL every day of the week. So as you mentioned, you have founded and you've been working at Upsolver, which is a platform for being able to apply streaming patterns to building out data lakes. But can you just start by giving a bit of an overview of the state of the market for data lakes today and some of the prevailing architectural and technological patterns that are being used to manage these types of systems?
Sure. And I think that the term data lake and what it's supposed to do is definitely subjective if you look at what different companies are doing. I think 1 pattern, let's say, that will be left field would be I have a data warehouse. I'm able to query JSON files from a bucket that's doing a data lake, so basically just querying it. And the other side would be go full on Hadoop, today full on Spark, where I'm building the populating the storage. I'm making data queryable. I'm doing all my data processing in the lake. I can even query the data eventually from the lake. I think the open source community today would look more about on the latter, the open data lake pattern, whether I work with a Hive metastore, I work with Iceberg, I work with PuTTY, like, there are a few options to build an open data lake. And, yeah, the other side would be, let's say, have a data lake more as a staging area and form the data lake. I'm gonna send data to different targets, data warehouse being 1 of them. The 3rd would be I have a data warehouse. I just wanna query the data in the lake. So that would be a more light option. So 3 options.
[00:04:41] Unknown:
And so the early days of data lakes were dominated largely by Hadoop and the large suite of technologies that grew up around that, and the majority of that was built around the MapReduce paradigm. And so that's definitely very much a batch oriented workflow, but there were cases where streaming updates were necessary and people wanted to be able to incorporate data as it was generated and as it landed in the data lake. So that led to the creation of the Lambda architecture, which has largely been abandoned at this point. But I'm wondering if you can just give a bit of an overview of the state of streaming as an architectural paradigm for being able to load and process data in these data lake architectures and some of the ways that it manifests now and also some of the complications and challenges that people face when they're trying to go to a fully streaming architecture and why we still end up landing in these batch paradigms.
[00:05:38] Unknown:
So first of all, as you said, there was batch, and then companies had a real business need to analyze data in real time. So they added streaming, and streaming started as an additional layer on top of that. So I did everything I could with batch, and I would finish the last frame of data in in in streaming. 1 of the things that I remember that very well, and I think machine learning really increased the pain Around that is the fact that you're calculating features in 2 different ways. So the problem with doing real time machine learning, I still remember looking at Yoni, my partner, going bug hunting where no 1 believes in the code. Calculating the feature on batch was different than was in streaming.
So people start to understand that if they're calculating the feature, it makes sense that they should should be calculated exactly as I'm going to use it in real time. And a nice example is prediction system for ads. For a batch could use all historical data, so all of users' historical clicks and engagement. But now when you go streaming, just because of the data ply pipeline, limitations, they couldn't get the last 3 hours of data in prediction time. So now there is a different situation in prediction time comparing what you had during training time. So you're doing a foul. You're doing something doing 10 that you cannot do during serving. That's a good example. So I think that when you look at use cases where real time machine learning, you definitely have a very strong pain around doing everything in streaming, it's, as opposed to doing streaming in batch. If your use case is reporting and you're able to live with a bit of latency, then starts the competition between, do I wanna do it with the stream, and do I wanna do it with the batch?
But I think that I see, as you said, less and less company saying, I wanna have batch and streaming at the same time because of that pain. I can tell you that in 2017, when I talk to people that have this problem, all I hear, I felt like I was introducing deep learning into the world. No 1 was really aware to the there is an alternative to the Lambda. By the way, it's called Kappa architecture. So basically using 1 code path, and I think, maybe the name Kappa architecture is still not that well known, but the pattern is definitely catching on. For people who were
[00:07:53] Unknown:
in that Lambda architecture space and they were trying to bridge these 2 different patterns, I'm wondering if you can talk through some of the manifestations of the pain and the types of bugs that were introduced by having to have those 2 different code bases. And I know that there were efforts to try and unify the logic with efforts such as Apache Beam to have a single code definition that would run across these different Bash and streaming architectures and just some of the reasons that that unification just never really worked out.
[00:08:23] Unknown:
As as as you said, the problem is in the name. It's a unification project. I'm trying to say, well, instead of you doing Blender architecture, I, the vendor, I'm gonna do that for you, and I'm gonna do a better job. Of course, there is some some tool to it. Like, if you're gonna do it as a vendor and that's all you do, you're gonna do it better, but you still get the same type of problem. You can't really exclude the user, the risk out of it. So if I'm a user and I wrote something 1 way for batch, and then the streaming way was a little different, then still I'm gonna come back and have the same problems. When I go to debug these pipelines, I still need to understand 1 way versus the other way. But I think I really look forward to a future where streaming is the way you're doing pipelines, stream, and batch data, And then you use queries for the use cases, we can discuss later, but the use cases where I can't really do streaming, where it's better to just, do a 1 offs on the data.
[00:09:18] Unknown:
As far as the sort of stickiness of batch as an architectural pattern, despite the fact that streaming as a technological capability has been around for decades at this point, as people are building new systems, they generally will automatically default to doing things in a batch oriented manner just because it's more intuitive. It's easier to think about and reason about because of the ways that we think about the world. And I'm wondering if you can just give some of your perspective as somebody who's been working in the space about why it is that batch continues to be so prevalent despite the fact that streaming architectures have been proven, in many cases, to be more efficient and potentially more accurate.
[00:10:01] Unknown:
I think 1 huge reason is SQL. So streaming is still tied to codes. Usually, we have very specific type of engineers. They're gonna do your streaming workloads, and, you know, database gave the power to the people. And the people still have SQL, and they can do batch as opposed to going and opening a ticket for someone. And I think that's a very big reason. Maybe I'm a little biased because this is part of what AppServer is doing. AppServer is SQL pipelines, which are continuous as opposed to code based, pipelines. I think that's a very strong reason why people defaulted to batch. I think the other and that the second 1 is the streaming system were limited. Like, I would use in the beginning just something that would process an existing if that's it. That's great for maybe a pen only. And then it grew to recent data, so starting from Apache storm or even the later streaming system that looked at the context and maybe allowed me to do sessionization and do some streaming join as long as I could already fix the data for the entire join in memory on the machine that I'm currently working on.
And the data lakes, well, you usually put a lot of data, they could not do any updates. You still were tied to the append only pattern. So if you wanted to fix your data, you needed to implement your pipeline once and then, implement it again for late data. So you needed to write 2 pipelines instead of binding just 1 pipeline. So all those limitations that, caused streaming to be the less functional version of batch was the second the second reason. You needed to start thinking about your problem differently to fit to accommodate the channel the limitations of the streaming system. And the more you reduce these limitations, the easier it gets to do streaming almost as difficult batch.
[00:11:50] Unknown:
In terms of the sort of different ways that streaming and batch are used throughout the data lake, there are a few different few different stages of the life cycle and transition points where 1 is the extract and load where you can use it for doing change data capture from a database or being able to pull updates from an API, although you might still need to do some batching for those types of workloads. And as far as the batch oriented sort of service option, there are things like Fivetran that will periodically do a bulk load from source to destination. There is the transformation aspect where you might be doing some manipulation of data either before it lands in the data lake after you've extracted it or after you've loaded it into the data lake, you wanna be able to do some transformations to get it ready for some downstream analytics. And, again, you might be able to do that in a streaming context, but a lot of tools will do it in batches of various sizes. Even things like Spark will use microbatches to be able to churn through all this data. And I'm just wondering if you can talk through some of the different applications of streaming in the context of data lakes and throughout the data life cycle and some of the ways that it can be beneficial to use that approach as opposed to batch and some of the ways that batch processing paradigms can come to become unwieldy?
[00:13:10] Unknown:
So I think the first thing I would visit would be the event sourcing architecture, which is a very strong fit for streaming. So you basically, the idea, you keep a raw copy of all the events. This is a good fit for temporal data, of course. So I keep a time based copy of all my events. This is an append only log. And then the way I'm doing pipeline is basically run the computation from that log to the target to some kind of projection I wanna create. Why do I like event sourcing so much? Because the database has a mutable state. If I take the whole data inside the database and starts changing and just giving people access to it, I lose that end only log. Why is the append only log so useful? Let's say I'm a researcher, and I have an an hypothesis on the data. I can now go back, look at the append only log, and check my hypothesis retroactively and understand the status of my data a few months ago as opposed to the status that's today. So I can emulate my knowledge of the world for every point in time and try to run a pipeline to check if I have any any value in my hypothesis.
Event sourcing is a research accelerator. Once you're able to do that, you see people are starting to do more and more replay or reprocessing operation to check their hypothesis. And when I started talking about reprocessing, that was the main value I saw, but there is an equal value for operational purposes. So everyone writes pipelines with bugs. All pipelines have bugs in the early versions of them. And then you go and just rewrite the data again and again. And since this is the data lake type of solutions, I can go and just rewrite the data, and I'm not creating any stress on my database because I'm just writing files to to objects or to s 3. So I can just reprocess my data and fix operational issues and not have any implications of that for the user. So event sourcing, for me, it's really a must, and I think stream processing just goes very, very well with that. I think that the second thing that's an important difference between streaming and batch is the concept of calculating data as materialized use on top of raw data. So the way we used to do data do data analysis with batch is I have my raw data. That's not something I usually query directly.
I create a table model, some kind of relational model. And from that, I'm starting to run views and run queries. This model adds a step of doing table modeling that keeps the consumers' data further away. In the past, we didn't have enough compute power, definitely not elastic elastic power to go and reprocess the data directly from the row. But today, we can do that. And suddenly, I see more and more companies doing event log into a materialized view that's already joining all the tables or other sources I want together without going to a relational model. I feel that has also been an accelerating an accelerating factor for research because you're basically moving to talk in the form of business problem. Each materialize you represent a business problem as opposed to batch where I'm doing 2 steps. And I I really like the materialized view approach.
[00:16:23] Unknown:
On that point of being able to reprocess data when you have some sort of logical error or when you want to, you know, build a new materialized view across the entire history of processing. As you mentioned, in the early days of the data lake, that wasn't really feasible because we didn't have the compute capacity, and the sort of algorithmic efficiency wasn't yet to the point where we have it now to make that a tractable problem, which is why we went with these batch approaches where saying, okay. I made a bug in my code, so now I need to reprocess everything since beginning of time, and I'll check on it in a week and see if I actually fixed it the right way. So I'm wondering if you can just talk through some of the improvements in technology, both algorithmic and architectural aspects of the underlying technology that allows streaming to be a viable option now? The first answer is cloud.
[00:17:15] Unknown:
Elastic compute whenever I want it for the peak that I needed. I grew up in a world where you plan for your maximum consumption, not for your peak consumption, and not for your average consumption with with peaks. That's a very different story. So cloud is definitely the first 1. I would say the second is the evolvement, the evolution of streaming technologies that are allowing you to do more and more use cases that are more stateful, and therefore, actually implement the same thing you do with batch with streaming. So extend the limitation of recent data with streaming.
I guess the third would be just the efficiency of compute of what you can do with it. So now a problem that used to that I remember having a database with a 1, 000 writes per second, and that seems like a very big deal. And every time I would want to touch an index in the database, I would need to go to operations and plan for a 2 AM shutdown to do that then when the users are not querying the database. Now that software allowed this customer has 3, 000, 000 events per second. So there is a big difference in how it goes. It's not just not only the software.
[00:18:22] Unknown:
I think the indexing aspect is another interesting point to dig into because when Hadoop first came onto the scene, you weren't really indexing all of the data in the lake. It was just you have all these files. There's some metadata about what's in there. But in order to be able to access any individual attribute, you were able to access individual attributes, but you weren't able to very easily aggregate across them. But, you know, with the advent of things like parquet and ORC for adding columnar storage in the data lake to be able to get better scanning efficiency and adding indexes in the metadata layer to be able to understand, you know, where the data lives on disk or in s 3 as it were. I'm wondering how those other sort of accompanying technological shifts have simplified the work of being able to do sort of massive aggregations in a streaming context or being able to more efficiently aggregate across a smaller subset of data versus having to do a full MapReduce
[00:19:20] Unknown:
style approach to being able to get those same data points at the end of it. I think that what you mentioned, by the way, I'm not sure if I would call this maybe it's my language, but I wouldn't call this in in this thing, I would call it statistics, call it metadata that you put in. That allows you to do data skipping when you query the data afterwards. So if, like, if I know for a certain file, the maximum value of column is 9, and I'm filtering for more than 10, then I can just drop that file and not actually read what's in it. And I think that the use of those statistics is still early in the market, and I'm talking about the open. So if I'm building my data warehouse, I can write statistics for myself and then query and then, use them in the query layer. But when you're doing an open data lake, you need to talk about standards. You need your catalog to be able to do it. And I spoke, by the way, are talking a lot about those topics, about adding those statistics.
But I wouldn't call of course, this is super important. Good for all the companies and users working data like this. Queries would run faster. But I think for indexing, for indexing, you want to be able to join 2 sources together and not have that annoying limitation of the amount of data that you have in memory. And then if your instance failed for some reason, you need to repopulate all the data in memory or you have lost data. You have lost the ability to join. It can take you a very long time to spin up a new instance. But I think that you can either go with the cool men's indexing solution, which is basically partitioning and putting some data in memory and hoping for the best, but you hit the wall pretty fast. The solution I've seen a lot of companies do is also use a key value store, like Cassandra already, so Rocks DB, and then you're writing your index there, but then you basically need to maintain another copy. You need to scale that database. That database serves your process streaming processing layer. And AppSolver, this is maybe the piece of technology that requires the most amount of research.
We have built a decoupled key value store for data lake. So we basically create an index that stores data on object storage like s 3, and then you are loading the index in memory. And we have done work on the format itself. So you're able to read by key full compressed data. So the column with column now data that you can't index it because you can't retrieve a single row. You can only retrieve the column. So we built a format that goes around that problem. And every time you're doing a join in absolute within 1 stream and another stream, you're basically doing a join with the index. Because the index doesn't have a local storage, it's just data on s 3 and in memory, it's very easy to manage that, and we can deploy those indexes to every 1 of our customers without the operational overhead. I think this has probably been the piece of software we have built that Lyft gave us the biggest leap forward, the biggest advantage out of everything else we've done so far.
[00:22:14] Unknown:
Going back to the question of being able to reprocess all of your historical events to, you know, build a new materialized view or to correct a logical bug in batch processes, you have a definitive start and end step of, I only load it up to this point of data. So I know that when I'm running this batch job, I'm only going to run, you know, from the beginning of time or from this specific cutoff point to whatever my most recent load happens to have been. Whereas in the streaming context, you always have new data coming in. So when you say, I want to kick off a reprocess of everything, I need to start at whatever the beginning of that happens to be, and then I need to run all the way up through. But I'm continually playing catch up with these new events that are coming through, and I'm wondering what have been some useful approaches to be able to appropriately scale out or parallelize or being able to apply windowing to be able to manage that reprocessing flow so that you don't run behind on new events, but you're able to, you know, reprocess everything and get back up to current time?
[00:23:18] Unknown:
So, basically, what AppSorvo is doing, how we solve that problem. For us, the data, maybe the file system would be a better words, is defined by time. So a minute of data is a task for us. And if I got a source file for that minute of data, 1 of the AppSable fleet instances instances in AppSolver fleet is gonna take that job and gonna output 1 file. That file is gonna look exactly the same. So the logic that AppSolver runs is deterministic. It's a depotent processing, which means I can't do random. I can't do an API lookup. This is not the way I would do it because every time I wanna run the same processing, I wanna get the exact same results. And the way we guarantee that is we by with the combination of the file system with the data processing.
Also, the indexes that I described before, the indexes also keep point in time for each index. So if I'm going to go back and reprocess the data, the index really leverages locality and assumes that it's gonna get the most recent data, but I can give you the sliding window aggregation result from a year ago. I just need to go bring it from object storage. So I can do 1 API call for them. But once I did that, it's on processing more data from this around the same time. I'm going to hit the the cache, and then I won't be able need to pay the penalty of going back to object storage. Basically, the concept is going back to event sourcing. So every time you run the processing, you're gonna get the exact same result, same order of lines in the file. If 2 instances are gonna work at the same time, they're gonna write the exact same file. It doesn't matter the 2 are gonna work at the same time. So, basically, building a headless architecture as opposed to having 1 orchestrator, making sure which data is being run, that really helped us grow the scale of the
[00:25:08] Unknown:
problem. In the streaming context, 1 of the perennial problems or challenges has been around being able to do large joins of data. Because for new data as it's coming in, it's easy to process events in sequence. And a lot of times, you might have, like, a RocksDB storage or a Redis key value storage to be able to do data enrichment for a small set of data. But if you're trying to do joins across 2 large datasets, that's generally been very challenging to do in a streaming context. Not that it's terribly easy to do in a batch context either, but I'm wondering what are some strategies that you and others in the streaming community have established to be able to manage some of these use cases where you need to be able to aggregate across multiple different datasets?
[00:25:55] Unknown:
I think that this question really refers to the challenge of streaming, both the technical challenge and the fact that you need to think about it differently. Because let's say I have a stream of impression and a stream of clicks that happen for some of these impression. I wanna join them together, and I wanna calculate, click tool rates. A very simple problem. How would I do that to the batch? I would look at the window and say, okay. I'm gonna process this data an hour after I got my last impression. So I know to a certain degree that the click is already going to be there. I'm gonna join doing that window, and that's it. I'm gonna write the result. And then you go to impress to a streaming system, and now the streaming system needs to keep all of these impressions and clicks in memory in order to do the join. So you're starting to have conversations with your user.
Well, you can have a 15 minute window size between the impression and click, but it's going to impact performance, and maybe we can go lower. We can go more. You're starting to optimize in something that doesn't really add business for the value to the business. If you're gonna ask your customer what's that window supposed to be, they're gonna ask you, can't you just support an infinite window or something like that? Why do I really need to be aware of that? I think that what really helps streaming systems succeed is the ability to start updating lakes. So if I'm getting the impression, I'm gonna write the impression. When I get the click, I'm gonna do an update and change the impression and show that there was a click attached to it. So once I can start removing those time window configuration, it's going to become much easier to communicate at the same level with the business people that actually are going to use the data. Because once they see the impression and they know there was a click and you didn't match it, they're losing some faith in the system. So even if you're managing to kind of force their hand and tell you what that window needs to be, eventually, we need to find superior technology to to do that. So for let's say, for example, can streaming be batch? Yes. If you can do updates in a very high scale on your data lake, and you can just update when the clicks come instead of doing the join mid air, and that will really help replace the batch use case with a streaming use case. And so as far as the
[00:28:10] Unknown:
differences in how you have to think about streaming versus batch, I'm wondering what have been some of the interesting lessons that you've learned or that you've seen your customers go through about how to actually design the structure of the events that are going into these streams and the ways that they wanna be able to access and aggregate these data and just the ways that that influences the overall design of the topologies of the different streams that they're trying to deal with versus how they might have approached that in a batch workflow or maybe everything just goes into 1 location, and then they can, you know, do some ad hoc transformation after the fact to be able to make it fit their particular use case. I feel that the this is actually kind of an advantage of streaming
[00:28:54] Unknown:
because in that, you're thinking about your relational model and how do you feed the data into the relational model. And with streaming, you speak the language of events, and it's more natural for the developers creating the source of data to just send you an event and let you figure that out. So you still have the work to do to unpack it into the model that you want, but at least you're starting from an event, and you're not trying to take very flat way. Think about the developer taking an object with a bunch of arrays and trying to feed that to the analytic system.
Somewhere, you need to unpack that logic, I think, for streaming that a little more native. You also talked about the different patterns or topics I wanna send the data to, but there is a difference if I'm doing an application use case or an analytics use case. For an analytics use case, I'm just gonna write the data. And from there, I'm gonna write processes that create table for me, and I don't need to send the data to other topics. Maybe sometimes you do, but in most cases, you don't. So, for example, I want to create a default. I'm gonna send the data to a bucket, and I'm gonna query that with Presto. If I'm doing an application, maybe what I wanna do is run some logic on my pipeline and send an alert, and that alert would be consumed by application. So the way to consume with the communicate with that application is send the alert into, into Kafka.
So it really depends on the use case. I've seen architect event sourcing holistic architecture, but, basically, the only way you see 2 applications communicate is toward the service bus, and they add metadata to each event, and they know how to do all the governance and management on top of it. Definitely requires a very disciplined and knowledgeable architecture team to be able to to I think that would be more for the use case of ESB, to enterprise service bias, and how to plan that and less for analytics.
[00:30:55] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done. DataFold's proactive approach to data quality helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection. DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values.
DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows. Visitdataengineeringpodcast.com/datafold today to book a demo with DataFold. And so for people who already have a batch architecture or a hybrid architecture where they're doing the Lambda approach, what are some options for being able to migrate that into a purely streaming architecture and migrate any existing MapReduce or batch processing logic that they have into a streaming paradigm and being able to validate the accuracy of that conversion as they go?
[00:32:19] Unknown:
So to value the accuracy, I would definitely do, like, an apples to apples comparison on the output. The process itself varies according to the streaming system that you choose. So in the beginning, we talked about streaming system being usually code based systems. If I have an SQL today that's basically doing my batch process in my database, If I'm taking that SQL and I'm gonna paste it into the streaming system, it wouldn't be able to do much. Like, I would need to just understand what to do for my SQL. I can tell you that in app solver, since it's a SQL based system, in many cases, we take an SQL you ran in batch. We paste it in a server. We edit it a little, and then we just turn it into a new streaming job, which was defined by SQL. So, actually, if you're moving from a batch to a streaming system and you did batch with SQL, not MapReduce, I can just it's not code based MapReduce. I can just take that into AppSoft, and that's been an accelerator for our projects.
If you're doing codes, that's a different, type of story.
[00:33:28] Unknown:
As organizations are making that transition from a batch to a streaming approach, what are some of the other technological or organizational shifts that come along with that process that are often overlooked or misunderstood?
[00:33:44] Unknown:
I think that, eventually, the business really likes it when you do streaming. They just don't know how hard it is to do. So if you tell them, hey. You're gonna get your data in real time, and it's going to be as fresh, they're very happy about it. If you're gonna tell them that they have to go through limitations, like what my example with the impression click, like, you need to define time windows and a whole bunch of stuff to serve streaming system, then that's a conversation you need to do in a very good way with the business. So this is where we are saving cost and pipeline, and this is where we are serving the business. It's 2 different conversation, and I don't think there is a single system in the world that will allow you to do any join between any 2 streams on any cardinality. Like, eventually, in some of the more edge high performance use case, high scale use cases, you have to tell the business what are the limitations of the system. That's the only thing, right, from an organizational perspective, because streaming is not that different for the organization to digest. It's more hard to for the data engineering organization to digest, and it's harder for the analysts to digest. Because now the analysts, they don't write streaming logic, they're not streaming code. So they have to go to data engineering, and definitely, that's a bottleneck. And talking about a software, but we work with analysts that wanna do streaming using SQL. I think that was 1 of the challenges that we see. Once you go streaming, you get a bottleneck in access to data.
[00:35:10] Unknown:
And as far as the overall streaming market and the technologies available, for a while, there was a lot of volatility in terms of which engines were capable of which use cases, and there were a number of different options contending for sort of priority. But it seems like at this point, they have largely kind of settled out into the use cases and usage patterns that they're best suited to, obviously, with, you know, incremental improvements and capability. And I'm wondering what you see as the kind of steady state right now of streaming engines and some of the ways that they can and should be applied in a data lake context?
[00:35:54] Unknown:
I think that today, the most common solutions that I'm seeing, excluding myself and AppServer, of course, is that I see Spark's Object Streaming, and I see Fling. I almost don't see Storm at all. I don't see a lot of, you know, enterprise streaming analytics software, which is a market that if you go to Infos, the wave has a market for for streaming analytics, and you see a whole bunch of companies that used to do what's used that market used to be called CP, complex event processing. So these systems are still deployed in at many organizations.
But, definitely, the idea of doing it with 1 of the 2 open sources I've mentioned, that's the most popular approach. I think that when you think about the data lake for streaming, 1 of the big challenges is performance because small files, which what you generate if you do a same process is really a performance killer for object storage. So we needed to start seeing table formats or other solution doing compaction instead of moving that burden on the user. App Store also does compaction out of the box. Like, every pipeline UI is actually 2 pipelines. 1 is the streaming 1, compaction 1. The compaction 1 is just in sales from the streaming 1. You see solution like Iceberg and Kulde offering compaction.
So the fact that the data lake allows you to basically write in any tool that you want, because that's the tool for the locked up storage, means that you can leverage that to do very good job in in streaming. But it's not just about the streaming system. It's also about how you manage your tables in the lake, and that was an incentive to improve data lake tables technology as opposed to only streaming technology.
[00:37:35] Unknown:
In your experience of working in the streaming space, I'm wondering what are some of the most surprising things that you've learned about the overall application and technologies of streaming systems.
[00:37:48] Unknown:
So there are some problems that it's very easy to explain for streaming, and no 1 would doubt that streaming needs to be the use case for that. And a good example would be sessionizations. So let's say you have devices that are saying heartbeats every 5 minutes, and you define that if in 20 minutes you haven't heard from a device, that device is considered down. So try to write the SQL that solves this problem without first adding some kind of session, streaming session logic that enriches the data and helps you on that calculation. That's a very much not a good fit for batch. So no 1 would argue that this use case needs to be streaming. The argument starts when you have both back, the option to do batch and the option to do streaming.
That's where the most of the, I think, the friction between batch and streaming still acts.
[00:38:42] Unknown:
And as far as the applications of streaming architectures, what are some of the most interesting or innovative or unexpected ways that you've seen them used?
[00:38:52] Unknown:
I think that 1 thing I found is how much streaming influence costs for companies with a lot of scale. I would expect that if I run something all the time as if only to running it some of the time, then streaming would be more expensive, and then our expecting, well, I'm doing streaming, then I'm gonna pay my dues for wanting to get data in real time. But the fact that you're processing 1 event at a time lets you lose less soft less resources in order to do it. I've seen companies take a data warehouse bill for 7 figures and cut that by 60 something percent just by changing into a streaming paradigm. So that was surprising for me. You can be rich and healthy, and you don't have to choose between the 2. Other than that, I would just think about business use cases that were very cool, and I especially like those examples. So each prospect is sending asked a question, how can I do this data pipeline? And it has a streaming challenge into it. I can spend 4 hours sitting on it and try to find the right solution for it just because it's kinda fun thinking about it in a different way than just doing a query. So, like, they wanna know my device heat map around the world. AppSolver actually had a challenge like that itself, trying to understand which tasks we are running have completed and that have not completed. And we were only able to solve that problem guiding a streaming pipeline.
[00:40:16] Unknown:
1 other thing that I forgot to ask about is in batch contexts, you know, the idea of the DAG is very prevalent where you say I need to do this 1 operation. And then once that's done, I need to use that output to do this other operation. And maybe I need to then split that into 2 other operations, and so you have this overall orchestration pattern of managing the life cycle of data through these discrete stages. And I'm wondering how that manifests in a streaming context, how you're able to build that same dependency graph or some of the ways that you're able to eliminate that need for explicit orchestration and just do it entirely in an event based manner, but being able to kind of maintain visibility into all those different interconnections at various stages of the streams.
[00:41:00] Unknown:
I think that was also a good answer for 1 of the previous questions, but I think that that was 1 of the biggest takeaways with streaming. Streaming is an orchestration problem, and we touched some of that during the call. So it's not just about writing the data like you do in batch in a 1 offs. Here, you're writing small files. You need to compact them, so there is a dependency. You need to clean intermediate data copies. You need to create an index and join with it and clean it and all of that. You need to go and update the metadata store. All of these operations need to happen in a certain sequence so you'll maintain consistency guarantees to the user. You can't show wrong data. I remember having a conversation with John years ago, and he was saying I'm just telling him, listen. A bunch of customers are asking for a visual pipeline where I can do a and b and c and d and connect, you know, these very big DAGs. And so let them build a DAG in a solver. Why not? And he said, there are over a 100 nodes in a DAG. That's not even talking about a difficult pipeline.
So who can under really understand that? And it really helped me understand that the difference between a pipeline and a declarative pipeline. If you look at language like Apache Beam, Apache Beam is the combination of transformation and orchestration in the query language, but the orchestration is too much weight to put on the user's shoulder. They're able to do it. It's not that it's impossible to do. It just takes a lot of time, and a good comparison would be running an execution plan in a database. So if I go to a database and write a query and click on execution plan, I see how the database is planning to execute the query. What's going to be sorted? What's going to be indexed? How the join order is going to happen? All of that planning is the optimizers work. And when you go to streaming, now it's the user's responsibility to be the pipeline optimizer.
And the idea of the collaborative pipelines goes very strongly with streaming because it's just too hard to do all the that orchestration yourself, and you're not just orchestrating the stream. You're orchestrating the work on the data itself in the lake, the data lake table, the metadata, like, all of that is part of the management that the pipeline needs to do. So when we got streaming and basically each pipeline is multiple pipelines, I understood this problem needs a whole different approach for orchestration.
[00:43:21] Unknown:
And in your experience of building and growing the Upsolver platform and working so heavily in this streaming ecosystem and the data lake space, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:43:36] Unknown:
The biggest thing I've learned, and I keep learning it for years now, is go with the flow. Eventually, we can think after researching this problem for years that this approach is going to be the easiest 1 for the users. But then the users need to actually go and learn what you've learned to understand why you you chose that option. And streaming syntax is a very good example. When we started, setting up solver, I would spend a lot of time educating users about how to live streaming and how to think about the problem. Today, we are investing these cycles in trying to write stream as batch. So basically, it removes many of the limitations you have in streaming, so a user would be able to write it as if they don't like, for me, a perfect system would allow me to hide batch codes and get a streaming solution.
You can never go a 100%. There are limitations in the world, but if you manage to carve a 95% and 5% of abusers that need some solution architect help or something like that to do the work, that will be great. And that's something that we have added and will add to the product. So AppSolver 2.0, there's gonna be a very large release of AppSolver. And a lot of the problems that will be very hard to solve with streaming, you will be able to solve, and almost think about it as if it was bad. So giving you an example, if I do a join between 5 different streams, and I want all of these to update the data, I need to write 5 pipelines because each stream is going to update the table. Now ask the user to do it. Instead of hiding 5 joins in a query, you need to hide 5 pipelines.
It's too much. So you need to abstract and abstract and abstract until you eventually get to that point. That was probably the biggest I really like product, work, lessons, installation. Going with the flow is something you keep learning as a product domain.
[00:45:31] Unknown:
And for people who are interested in adopting streaming for some of their use cases and to build and manage their data lakes or their data infrastructure, what are the cases where streaming might be the wrong approach and they're better served by batch patterns?
[00:45:45] Unknown:
I think that the first step to understand if you need streaming is if your data is continuous. So, for example, I have a genomics dataset. I don't need a streaming pipeline for that. That's not a good case. And, actually, earlier today, for our call, I was talking to a customer, and they basically wanted to create monthly reporting, but they didn't wanna see it last 30 days. They just wanted the end of the month report that process through the data form the entire month. And for that, why run a system that keeps an incremental state for every point in time where you just want to run 1 query on top of the results. But this was actually a challenge of what not to do, and we recommended to them not to use app solver and a streaming pipeline for them. We just wrote the whole data to s 3. They used Presto DB to query the data. The query took 7 minutes.
Fine. I think the pre obligation process took hours because it needed to pack there and kinda keep in minute data about the entire stream. That was really the wrong way to solve it. So if your data is continuous and you wanna query it all the time, not just look at 1 view and that's it, and you're done. If you wanna do a 1 off, streaming is not a good solution. If it's not a 1 off, then try streaming.
[00:47:10] Unknown:
As you continue to build and evolve the Absolver product, what are some of the things you have planned for the future and some of the ways that you're looking to make streaming data easier to work with? I know you mentioned a few different upcoming capabilities, but I'm wondering if there are any others that you wanna mention or any projects that you're particularly excited for.
[00:47:27] Unknown:
Yeah. Today in AppSolver, the way you create a job is you create it from our user's interface, and the transformation you define with SQL. So the SQL is a part of your transformation. So it's part of your pipeline. What we are going to do is create a SQL language, so the entire pipeline It's going to be SQL based, and if you're doing infrastructure as code, you just have code from our software. So we are treating the product more as infrastructure. I don't think our server was ever really a no code tool because it was SQL based. It was very heavy with transformation. But when you see a UI, you think of no code. And when you see an infrastructure with CLI and SQL, you think database.
Basically, app solvers pipeline, they're supposed to be all you need between the whole data and a table that's ready to be queried with the answer to the business question that you want. If that's the case, it needs to be a SQL system, and we definitely wanna give the power to the analysts, to the the engineers, who are not necessarily data engineers. We wanna allow every database user to be able to do streaming on top of the data lake. That's 1 of the biggest passion that AppServer has. And as data engineers, we just don't wanna spend so much data engineers' time on doing streaming, on doing engineering, which is not adding value to the business. No 1 needs to hide a bag with a 100 nodes. It just doesn't make sense. The next version of a software is gonna be SQL end to end with a lot of automation around streaming syntax. So you will only need to use it when you absolutely have to, and there is strong cost justification around.
Are there any other aspects of streaming applications and data lakes or the ways that you're applying it at Upsolver that we didn't discuss yet that you'd like to cover before we close out the show? Maybe 1 comment is that streaming isn't only for data lakes. You can do streaming into warehouses. You just need to do the computation for them. I can think of a recent example where we took a DBT map with, like, 5 different stages, and we have to tell the customer, stop thinking about the stages. We are not doing the stage process. Tell me what's your source. Tell me what's the target after all intermediate processing.
Let's do all of that in the lake and just serve the final result to the warehouse. And now you're actually making the process easier because you don't need to do the orchestration within the warehouse. You're reducing the cost because you're doing it to the lake, and you're able to feed the data in real time because you're feeding less data into the warehouse. I just finished a very long benchmark. We have tried to load data into Snowflake in streaming time, and we tried to find any type of ELT solution to feed data into Snowflake in streaming time, and couldn't do it without reprocessing with a streaming solution on the lake. So I think whatever you're gonna use, whether it's a warehouse or a lake, it doesn't mean you can't do streaming. You not just need to do the streaming before the data hits your targets.
[00:50:24] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:50:39] Unknown:
But I'm biased. I'm very biased. So, since I think that I really much believe in what we are doing for data pipelines and thing SQL data pipelines, which are declarative, is very big news. Let me choose something not in my domain, and I would say that data quality is a big 1. So I'm very happy to see open sources and system that put only data quality on their flag, you see companies not doing basic stuff, like checking the number of output records that you have or checking that you have null values in the column you're not supposed to have. So there's a lot of tedious automation around data quality, and it would be nice if every company could just take a data quality plug in and put it on top of its pipeline, on top of its database. It's never been solved in a strong enough way within the platform you store the data. So
[00:51:35] Unknown:
I think that data quality is here to stay. Data quality platforms are here to stay. Alright. Well, thank you very much for taking the time today to join me and share your experience of working in the data lake ecosystem and helping to make streaming more accessible and sharing some of your experiences of the ways that it compares to batch patterns. So definitely appreciate the time and insight that you've gathered, and hope you enjoy the rest of your day. Thank you. Enjoy this as always. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast dotcom to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site of dataengineeringpodcast.com to subscribe to the show, sign sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at dataengineeringpodcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Host Welcome
Guest Introduction: Ori Rafael
Ori's Background in Data Engineering
Overview of Data Lakes and Market Trends
Evolution from Hadoop to Modern Data Lakes
Challenges and Benefits of Streaming Architectures
Why Batch Processing Remains Prevalent
Applications of Streaming in Data Lakes
Technological Advances Enabling Streaming
Indexing and Metadata in Data Lakes
Reprocessing Historical Data in Streaming Context
Designing Event Structures for Streaming
Migrating from Batch to Streaming Architectures
Current State of Streaming Engines
Orchestration in Streaming Context
Lessons Learned from Building Upsolver
When Streaming Might Not Be the Right Approach
Future Plans for Upsolver
Final Thoughts and Contact Information