Feldera: Bridging Batch and Streaming with Incremental Computation

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Imagine catching data issues before they snowball into bigger problems. That's what DataFold's new monitors do. With automatic monitoring for cross database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies

and anomalies in real time right at the source.

Whether it's maintaining data integrity or preventing costly mistakes, DataFold monitors give you the visibility and control you need to keep your entire data stack running smoothly.

Want to stop issues before they hit production?

Learn more at dataengineeringpodcast.com/datafold

today.

As a listener of the Data Engineering podcast, you clearly care about data and how it affects your organization and the world.

For even more perspective on the ways that data impacts everything around us, you should listen to Data Citizens Dialogues, the forward thinking podcast from the folks at Calibra.

You'll get further insights from industry leaders, innovators, and executives in the world's largest companies on the topics that are top of mind for everyone.

They address questions around AI governance, data sharing, and working at global scale among others.

In particular, I appreciate the ability to hear about the challenges that enterprise scale businesses are tackling in this fast moving field.

While data is shaping our world, DataCitizens

Dialogues is shaping the conversation.

Subscribe to DataCitizens Dialogues on Apple, Spotify, YouTube, or wherever you get your podcasts.

Your host is Tobias Macy, and today, I'm interviewing Leonid Rishik, Lalit Suresh, and Michae Boudiu about Feldera, an incremental compute engine for continuous computation of data, ML, and AI work. So, Leonid, can you start by introducing yourself?

Sure. Hello. I'm Leonid.

I'm the CTO

of folder. I am a computer scientist by training. I got my PhD from the University of New South Wales. And, over the years, I've done research on operating systems, programming languages, networks. And I spent the last 7 or 8 years building, incremental computer engines like folder.

And, Lalit, how about yourself?

Hi, everyone. I'm Lalit. I'm the CEO at foldera. Like,

Leonid and Mihai here, I'm also a computer scientist by training with a PhD from t o Berlin. And before this, I was at VMware Research with these guys. My background is mainly distributed systems,

cloud,

and networking, also with some dabbling into things like formal methods.

And, Mihai, how about yourself?

Hi. My name is Mihai Budio. I

got a PhD from Carnegie Mellon University

doing computer architecture at the time, and then I moved to Microsoft Research in Silicon Valley where I was for about 10 years where I worked on, big data platforms,

large scale machine learning,

computer security.

Then I spent a little time at the networking startup and then,

about 7 years at, VMware Research. And the the last year has been,

about, Faldera

when we founded the startup, where I am the chief science officer.

And in terms of Faldera,

can you give a bit of an overview about what it is and some of the story behind how it got started and why you all decided to invest so much of your time and energy into building and growing it? Yeah. Sure. So what failure is is it's basically a very, very, very fast query engine.

And what it does is this thing called incremental computation. And so to understand incremental computation,

I usually like to talk about batch computing as a reference. So in batch computing, the way we know it through Spark and Snowflake and whatnot, right, you write a SQL query and then, you know, 100 of thousands of cores wake up, go over all of this data, gives you back an answer. Now if you run that same SQL query one second later, these engines will typically do almost exactly the same work it just did a second ago, even though you've only accumulated one second worth of changes in the meantime. And you can think of incremental computation as solving that inefficiency. Right? So you try to avoid these recomputations

in its entirety by very intelligently keeping a memory of the work you already did in the past. And when your data changes, any insert, update, or delete happens, you incrementally update all the views and computations you're maintaining over that data, which makes it not just very fast and efficient, but it also makes it good for both batch and real time analytics in one shot. As to, you asked about the story as well, right? Like, so as to how we got here, again, like Leonid mentioned, so Leonid and Mihai, somewhere around

2018, started working on this incremental computation

thread at VMware, and it had pretty good impact inside VMware, the most the best of which was in this product called VMware Skyline. It took down what previously used to take 2 days as a batch computation

to give customers some insight about their

infrastructure

down to seconds.

Right? This is like thousands of standing queries being incrementally updated on terabytes of daily new data with the p 99 of seconds. Right? So that's kind of the power of incremental compute that's available at hand. And based on the lessons learned building that, they also

went back to the drawing board and this and came up with this new mathematical foundation called DBSP, which is at the heart of FELTA. That's where we derive a lot of our superpowers from.

And with DBSP, what we can do is evaluate arbitrarily complex SQL completely incrementally. Right? Like, that's kind of the wall we broke. And that paper has been winning awards, so

that twin success of actual

product and real world impact as well as the research breakthrough

prompted a bunch of us to fund Feller a year and a half ago. And so here we are.

One of

the other technologies and projects that comes to mind, speaking of this incremental compute of SQL, is the materialized

project, which I know has some of the same foundational

theories as far as the incremental computation it's implemented in Rust. It is focused on these streaming and continuously updated queries. So I'm wondering if you can talk a bit about some of the ways that you think about your comparison to that project.

So let me mention, Shavik, very quickly.

So

one of the founders of Materialise is on a coauthor on the paper that, Lalit mentioned, and I've known

Frank for 20 years. We were colleagues at Microsoft Research. So I was very familiar with his work on this topic. Now Leonid will explain.

Yeah. So, this,

is not a coincidence.

So indeed, the first generation

of this technology that we build back at VMware

was based on differential data flow, which is the

same library and the same formalism that Materialise is based on. So we are very familiar with that. We worked with it for, a number of years. At some point, we found ourselves, you know, pushing the boundaries and wanting to extend it in ways that it wasn't exactly designed to, to work. So this is where this DBSP and the next generation of the technology came out of. I have a lot of respect and admiration for both differential data flow and and materialize, and it's definitely part of our prediction.

And another aspect of Feldera is that it seems to have at least some overlap with areas such as federated

query engines, such as Trino and Presto and Dremio

as well as data warehousing as far as being able to do arbitrary SQL across large volumes of data and these stream processors

such as Flink and Spark for being able to do computation across unbounded data flows. And I'm curious if you can talk to some of the ways that you think about Fildera, how it sits in that Venn diagram, and some of the unique capabilities that it has by virtue of being either a subset or superset of any or all of those. Of course. Fildera is not a data warehouse. Let me get that one out of the way. If you need somewhere to stick, 10 petabytes of data for 10 years, you probably need something else. We can store data. We can query it on demand, but this is not our 40. So what folder is, it is a query engine. It is a query engine that works on both

streaming and batch data. And, the problem it solves is we make it as easy to query streams,

as it is to query batch data. So if you think about any modern database, be it ClickHouse, Snowflake, TagDB,

you name it. It gives you this wonderful user experience for anybody who knows SQL, even if they don't have IT background, can write arbitrary SQL queries, and you can expect the database to process them correctly, efficiently,

and give you, you know, strongly consistent results. And there's nothing like that for streaming. Systems like Flink SQL or or or Spark Streaming, even if they provide some kind of SQL or SQL like interface,

they give you none of that user experience. Right? You have to be an expert. You have to know exactly how the engine works on the inside to be productive with those tools. And, you know, even then, you only get weekly consistent results, which, you know, in plain English means you cannot really trust them. And probably they will use a huge amount of resources,

to run those queries. So foldera tries to turn this around. So with foldera, you should be able to write arbitrary SQL and,

see it, executed correctly with strong consistency and good performance,

on on changing data. And then the other kind of issue that anybody who has worked with existing streaming technology

is very familiar with is that there isn't really such a thing as purely streaming use case. Just like there is no such thing as purely batch use case, by the way. You always have some kind of mix. You know, maybe you have your telemetry data or your, credit card transactions coming from Kafka, that's your streaming. But you always have some database sitting on the side with your user data or your device device data. And,

any kind of interesting analytics will always combine the 2. You're gonna be joining and aggregating data across both data sources. And if you have an engine that's only good at dealing with streaming data, it's just not gonna cut it. And even if all your data is streaming, you still have to solve this backfill problem. Right? When your system starts up, you have this large historical data set that you have to ingest and process before you can start producing,

new results, for the new data. And it turns out that if instead of looking at it as like, the streaming problem, that works with discrete events, if instead of that, you have this general purpose

incremental computer engine, then it can do batch and streaming into any combination of the 2.

Because, you know, batch data is just a really huge change that happens to have all the data. For people who are trying to wrap their heads around the use cases that Fildera powers, how it fits into their existing data architectures,

I'm wondering what you have seen as the

either types or maybe specific instances of technologies

or system architectures that can either be completely replaced by a folderra,

and what are the systems that you see folderra as being additive to?

So there are kind of 3 classes of situations, I would say, where you want to reconsider,

your current data stack and, you know, replace some of it with Fildera. Number 1 is when you find yourself abusing

your batch

processing engine, to do incremental computation. So maybe you have, you know, your Snowflake, and it's great at those nightly batch jobs. But then you need for some use case, you need results quickly or even in real time, and you start running those batch jobs more and more often. It becomes very expensive and still slow.

Or maybe you have ClickHouse, and it's great at giving you the real time responses for operational kind of queries. But you're starting to throw more analytical workloads, more complicated queries at it, and it just cannot keep up. In both cases, you probably want to, replace

that part of your of your workload with something that connects you to incremental computation like folder.

The second class of systems, I would say,

is

a build your own

architectures where people usually start with a fairly simple streaming use case. You know, they build some microservice

in whatever Java or or c or Rust. And it kind of works, but then they quickly discover that as they move to more interesting applications, it's

you you have all these problems computing and changing data, managing state, managing out of order events. And, you know, at this point, just like you wouldn't build your own batch database for your own use case, you shouldn't be building, your own streaming database.

So you should you should be using something like Fildera. And I guess the 3rd the 3rd kind of class of use cases,

is people's talk with stream processors like Fink and Spark SQL Spark Streaming, and they discover that they don't quite live up to the promise. Because, you know, maybe you don't just happen to have a Flink contributor on your team, and as a result, you are unable to be productive with Flink. And, basically, you would like to have that experience of just being able to write SQL and get get results without being an expert. So I guess that's another. For the second part of your question,

where folder is complementary.

So folder fits very nicely in your existing data stack, both, streaming and batch. So if you already have, you know, your data streaming through Kafka or Pub Sub, we will meet you where you are and we'll ingest the data and add analytics on top of it. And, likewise, if you have your data in in Postgres or Snowflake or wherever you do your, you know, transactional operational and batch analytics,

we will also happily work with that and ingest the data and do real time processing on top of that. Since you asked where it's additive, right,

even for the use cases like Leonid mentioned where there are certain workloads you shouldn't be trying to attempt on a warehouse, we are talking about moving that workload to failure and not so much you're replacing the warehouse at that point. Right? So I would say you still need your historical warehouses. You still need your bat systems around for those type of, use cases. Felterra really people come to us to take load off of them

efficiently using Felterra. Right? So you might tee your traffic through Felterra, compute some of those expensive aggregates and roll ups incrementally, and then maintain those results in your warehouse for the rest of your batch infra to pick up. So it's a very good complement that Another aspect of stream processing is that there are often cases where you need to do some

arbitrary compute logic using a general purpose programming language. Oftentimes, that ends up being Java or Scala because that's what we have available for stream processing

in in the large.

And I'm wondering what you see as maybe some of the cases where people who are using there want to be able to use some of that arbitrary logic and some of the ways that you think about the role of SQL in being able to be either extended to address some of those use cases or some of the,

interplay between Feldera and that more general purpose compute approach to stream processing?

In Feldera, SQL is really a front end towards where we derive all of this raw power from, which is DBSP. Right? And DBSP is currently written as a Rust crate. And with Phaler, we can both support

SQL with Rust UDFs, and we are also thinking of extending it to allow people to just embed Rust in their pipelines as well. So at least in the initial phases, it's just gonna be SQL and Rust.

I would leave it to Leonid or and Mihai to comment on what it looks like to go beyond that. Yeah. Modern SQL is is a fairly extensible language, through EDS in particular.

So, yes, today we support Rust EDS and let let let's you write high performance extensions,

to your incremental pipelines. We plan to build APIs for other languages

on top of that. And given the fact that it is in Rust, I can see an opportunity as well for being able to integrate with the Arrow ecosystem, particularly things like substrate, maybe even data fusion, and being able to leverage

that's happening in that ecosystem

to broaden the scope of what Fildera is capable of. It's it's funny you should mention this, because, indeed, Fildera has both the incremental engines that, you know, we built ourselves as a rust rate, but it also now has an ad hoc query engine that lets you, query data in kind of in batch mode. Ad hoc queries. It's a data inside folder.

And that one is built on top of data fusion. And in addition to that, we, already support error based ingress and egress, so we can work with data in error and per cap formats.

So in addition to data fusion that you just mentioned for some of the ad hoc capability,

you said earlier that the core foundation

of Feldera is based on this DBSP

project. I'm wondering if you can talk a bit more about how that works, some of the capabilities that it unlocks, and how that core foundational element has fed into the ways that you're thinking about Feldera.

So the BSP is actually an acronym,

which is also PAN. It's a combination of, digital signal processing, which is DSP

and DB, which is a database.

Because, it's really,

at at its core, it's actually a mathematical formalism

that,

rethinks the database as a signal processing system or as a stream processing system. We can choose s to mean other stream or signal processing. And we really, take inspiration from the digital signal processing

theory, which is widely used even, you know, when now when we record this podcast, it's processing

sound as a signal. So this kind of software used everywhere. And we say, you should

think even of databases,

not as a immutable objects that sometimes change, but as a stream of snapshots

that continuously evolve. And if you take this view, this theory actually tells you how you can,

incrementally evolve with the database efficiently. So the DBSP

formalism

is actually a way of thinking of incremental computation on databases.

Database incrementally changes every time you do inserts or updates, and you should think of views that are maintained on database as changing in the same way. And, DBSP theory gives you a recipe. If you write a query that maintains a view, it tells you how you can compile it into a program that incrementally maintains the view, which means this program looks at the changes to the database and automatically directly computes the changes to the view instead of recomputing the view from scratch from every database version. Now this is a mathematical formalism, and sometimes, you know, theory doesn't match practice, but we actually built DBSP also into a rust crate to show it's actually extremely practical.

So there is an underlying run time, this DBSP run time, which sits be below the SQL query engine, which implements the DBSP formalism exactly

in, in Rust. And then the SQL compiler is just a facade, which takes SQL programs and compiles them into incremental programs that run on top of DBSP.

With the fact that it's purely an incremental engine, I imagine that that also

gives you a lot of capability as far as being able to do historical time travel to say what is the data as of this specific point in time and being able to travel back to a different state of the incremental

computation

for doing sort of sort of historical analysis.

And I'm curious if you can talk to a little bit about what that time horizon looks like for

a practical maintenance of history.

So, the BSP was optimized for this problem of incremental view maintenance. It always gives you the most current version of the view based on the current version of the database. If you want to do time travel queries, you can. There are some SQL constructs, for example, Azov Joins

that, were designed exactly for this type of purpose. They were designed for temporal databases and for querying the database as it looked at a certain moment in time. But if you want to do time travel queries, you will have to pay, because you will have to store all historical data that you might have a query. Whereas,

the DBSP use case is really optimized to keep only as much state as necessary

for any future

updates of the database, but it will,

discard, and this is actually the power of the model. It will discard the data that will never influence

future outputs.

And speaking now to the overarching

I'm wondering if you can talk to some of the broader architecture, the ways that the

I'm thinking is taking advantage of DBSP

under the hood and

how you think about what is in scope versus out of scope

for Feldera as an end user

technology.

So in Feldera, we the main value is the incremental computation. Right? So we want to keep anything that's not directly allowing people to express incremental computations out of scope. So we the abstraction we offer users is what's called a pipeline. A pipeline is basically

the DBSP based incremental computation

plus connectors and some infrastructure that allows these pipelines to interact with the external world. Right? Get data from different sources,

do the computation,

and sync data to different destinations. That's what we call a pipeline. And so, Ferdera

as a platform is entirely designed to allow users to define,

run, and manage pipelines. And so there's a very thin control plane called the pipeline manager where it exposes our REST API

and this API allows people to define pipelines.

It invokes the SQL compiler

that Mihai mentioned. Pretty much just takes SQL, generates our us program, compiles it, and that's the binary that we generated is ultimately your pipeline with some connectors

linked to it. Right? Now depending on the

form factor that you're using, this like, how we run these pipelines will vary. Like, if you use our open source container, the pipelines run inside the same container as the pipeline manager. If you use the enterprise offering, we actually allow you to schedule and run them on a cluster. And then with this REST API, we've also built some tooling around Ferdera for users to interact with the engine meaningfully. So there's a web console, which is what most people use our sandbox or our it's usually the first point of contact with Ferdera, right, where you can run these pipelines,

write SQL, and so on. But usually people graduate at some point to using the APIs directly. Like, there's a Python SDK, there's a CLI, and you can also just use the REST API directly to write your automation

around Felvera. Connectors are the primary way we interact with different

external storage and data sources. Right? Kafka warehouses, CDC streams,

even just sort of raw HTTP in and out. And as you have gone from the initial

implementation of DBSP as that core incremental engine

evolved it into Falda as it exists today.

What are some of the ways that the

overarching

goals and scope have evolved from when you first started thinking about Falda to where you are today? So, initially, we set to do SQL very well. So we want to do all of SQL incrementally.

Then, it turns out that,

some some things can be done in SQL, but some things require extensions. And we try to

not extend SQL as much as possible, but,

the theory behind the BSP, which, is a theory about computing on streams, tells us exactly what can be done in a language like SQL and what cannot be done. So we actually have a clear separation between these 2. So,

we have started adding

features in a very controlled way, which allow you to go behind what SQL can express. So, streaming like computations,

for example,

the ability to

say something about the arrival order of the data. Because if the system knows that the data never shows up too much out of order, it can do much more efficient garbage collection of, its internal state. And, we are also extending SQL in other ways. For example,

there's all that, recursive

computations in SQL are very limited by the standard.

So we want to provide a much richer,

and powerful interface towards doing recursion.

From that recursion perspective as well, I know that one of the challenges of building any sort of complex system that relies on SQL is some of that aspect of reusability

or being able to modularize

different

components of SQL similar to how you would have a a package in a program language that you can install. And, interested in understanding a bit more about some of the ways that the recursion is implemented and some of the ways that that helps to address some of that aspect of reuse and modularity

of those SQL queries and being able to improve the effectiveness of a team who is building on top of Filbera and being able to build some of those core primitives that can be reimplemented in different contexts.

Well, a lot of it has been said about the limitations of SQL, and there's a lot of efforts to improve SQL. So, for example, there's this language called PRQL,

which is really a beautiful design. And,

conceptually, there will be no difficulty

writing a front end for the compiler which takes PRQL and generates DBSP code.

But one thing we have learned in the past is that people are not willing to learn new languages.

There must be a very strong motivation to adopt a new language. So our goal is really to to look as much as SQL as as possible.

And it turns out that you can actually stretch SQL quite far. So one thing that's considered an anti pattern in regular database is that but we can do it well. And this really helps with modularity is building nested views.

So with

a traditional SQL database, nested views means you're adding to your processing latencies. The more the more you nest, the slower it becomes. With Fildare, you can build arbitrarily nested views. You can have dozens of views built on top of each other.

All of this will still be evaluated in real time, which means that you can actually break up your complex logic into many simpler views. Beyond that, you're probably going to do things like, you know, generate your SQL queries from maybe some other description like dbt.

Yeah. As as Mihai said, there are wonderful language designs nowadays that are in many ways better than SQL, but SQL is a language that everybody knows and loves, including people who are not computer scientists, and we have to meet them where they are as the best we can. And, you know, SQL during the ages has been extended with a lot of extra

functionality. It has window functions, rich data types like arrays and maps and JSON. We we support those. It has table valued functions.

So,

SQL keeps evolving

to adapt to the requirements,

and our goal is to support, as much as that as possible.

Yeah. That that has long been one of the both benefits and pain points of SQL is figuring out which SQL am I actually dealing with here. Even if it says ANSI SQL, there are a few different versions of that. So the different dialects are a challenge, but it is true that there have been a lot of evolutions as far as capabilities that the majority of database engines have adopted.

Well, I want to make a a comment that portability in SQL, it sounds like a lofty goal, and it's I I think it's attainable. Each database really has a small

differences from other database, so it's impossible

to exactly port code. But at least the the high level constructs are the same in all all dialects. So

As far as the incremental computation, you mentioned the garbage collection of that internal state

for people who are figuring out what their deployment looks like for FELDERA. I'm wondering if you can talk a bit to some of the state storage aspects,

how it integrates with different things like open table formats, or does it require disk storage, object storage, and some of those other systems level architecture

aspects of running a filter instance?

Yes. So let me let me maybe say a few words about state in, streaming systems in general because this is one area that, seems to confuse a lot of people,

including people who build these systems,

unfortunately. So there is this perception that if you're doing streamings and, you know, you're just bumping the wire without any state. Data streams in, data streams out. Nothing gets stuck in the middle. But this is kind of the exact opposite of what's going on because

the way incremental computation works

is by memorizing and reusing previous computation results. So to do this job, folder stores

all kinds of indexes in both input tables and all kinds of intermediate

views and intermediate operators as well. So you have to be really good at storing and accessing the state in real time, which is why we codesigned folder with its storage layer. So it currently uses,

local SSD or NVMe. It's kind of a key value story. You could think of it that way. It's implemented as an LSM tree. It's written at Rust with

as few overhead as possible

and codesigned with the algorithms in folder. So that's our primary storage. We are also building

2nd tier storage layer that will be able to offload state to

something like s 3, to an object store like s 3. And then,

I'm gonna I think I'm gonna let Lenny talk about integration with,

lake house and other data formats.

Yeah. So

we're pretty big fans of these open table formats because it means that if someone comes to us, they already have their data in one of these things. We don't need to ask them to run any kind of auxiliary service to get value out of their data or FELDRO. Right? It's just it's just a table that's sitting around on s 3, and you can query it. And we have a very nice demo that you can see in our documentation

where we compute over delta tables. And one of the reasons we like that demo so much is because you can use our delta table connector both for computing on the entire snapshot of the state, but you can also do the snapshot and follow,

which covers both backfill and the

streams of changes that you can then incrementally compute over. And even better is the fact that as an integration point, it's great because your data is already sitting there. You can feed that as input to Ferdera, but then

you can maintain the outputs back as one of these tables as well, right? And from the vantage point of your spark or Databricks

clusters,

these just looks like any other table that you could then pick up from the rest of your infra, right? So it's a very convenient integration point, these open table formats, and it does not need any external services to be run by the user. So we are always hard on recommending this

path to,

users if if it's an option for them.

And for people who are adopting Feldera, they're starting to incorporate it into their

data flows, data processing

cases, building data assets on top of it. I'm wondering if you can talk to what a typical workflow looks like for an engineer building with Faldara and then how that scales to the team level of being able to collaborate across those different data flows.

Usually, it starts with a handful of engineers trying out our open source,

offering or our sandbox.

Most people try to first set up some tables to get the connector set up to get data into failure. And from there on, there's usually different options. People write ad hoc queries to define what kind of use they want to write, and then they copy paste those into actual views and they build these things

incrementally.

Over time, what usually happens is they do some benchmarking. We have pretty good tools inside Filedera to allow you to benchmark even without setting up connectors. The transition, once you go from the open source form factor to the enterprise offering, which you mentioned, like, you want a team to collaborate on it. Right? So ultimately and once you make that transition,

you'll be able

to have an entire team define a number of pipelines each individually

run and kept isolated from one another on the enterprise form factor pretty much. So the workflows usually start simple. Start on your laptop. We always say like, first run it on your laptop, see the value, work with us to make sure that you're hitting the performance milestones that you want. We can typically hit millions of events per second on a laptop depending on your queries. So that's usually where we ask people to get a taste first.

As far as the philosophy

of the open source and enterprise

divide. I'm curious of how you're thinking about

what the

core capabilities are, what makes sense to live in the open source,

and what are the payment gated features that provide the sustainability of the project and business?

Yeah. So one thing we didn't mention is that the whole cofounding team has a pretty good background in open source development, both as, like, project founders of very successful projects like OpenVSwitch.

We are cofounder, Ben Pfaff, and even as maintainers and contributors to other projects as well. And so we bring a lot of that energy into Feldera as well. So the open core side of the project is really designed for small teams and individuals

to get started with. It's basically a single container that has everything you need, right? All the SQL, all the connectors,

and so on. The enterprise offering is much more suited for production use, where if you want to run Felver on a cluster

and make use of a pool of workers' worth of resources

with isolation between the jobs, things like this. Right? Like, features that are more suited for production use, that's where we usually ask people to switch to the enterprise version rather than trying to build all of that info on their own, with the open source. As to, you know,

you've used the word sustainability for the project. Right? Like, the keyword is balance, and, this remains something that we're paying attention to.

Another aspect of

Felderra, when you look at the landing page, it advertises

its utility in machine learning and AI use cases beyond just the pure data processing incremental state management. And I'm curious if you can talk to some of the ways that the aspects of Felderra make it conducive to those use cases

and some of the ways that it builds on top of those core data engineering capabilities

and

allows for those more

mathematics heavy or, you know, machine learning and AI use cases that are growing in popularity?

Indeed,

real time ML, real time feature engineering is probably one of the biggest applications. Although the platform is horizontal, it has applications in many domains.

So the use case here is you have your streaming data coming in real time. You want to feed this data into your model. The most common class of use cases is some kind of threat prevention, like fraud detection, for example, where you want answers really quickly so you can stop that

illegal payment, for example. But you're not gonna feed your raw data into the ML model. It just doesn't have those events. They just don't have enough context with them. So what you do instead is you you aggregate,

enrich, join, transform those data, create these feature vectors, and feed the this rich feature feature vectors into your model. And all these transformations, they can be described in SQL, And being able to lose them in real time with very low latency is something that you cannot really do without,

a technology like foldera. But there's also a twist in this story, which is that, you also need to be able to run the exact same queries in offline mode on training your model. Right? This is when you have an array of historical data

that you want to feed into your training process,

train the model, look at the accuracy,

modify your feature queries, try again. And so this capability to be able to run the same

analytics on streaming and batch data give you guaranteed exactly the same results.

So in the ML world, it's known as online, offline

feature parity,

and that's also something you simply cannot get with any other platform I can think of. So, yeah, it's a it's a great tool for real time feature engineering that,

gives you very low latency, real time processing, and, perfect

online offline parity.

So may I mention something?

I would like to add to the previous answer on open source. I I want to mention that the DBSP core library in Rust is licensed under an MIT license, and the SQL to DBSP compiler is based on Apache call site,

and it's also it's licensed, we're using an Apache license.

So the the core components of our project are all open source with the liberal licenses.

And in fact, Mihai is one of the principal maintainer of our soft call site now because of his numerous contributions.

Yeah. It's definitely

another project that has helped to broaden the scope of applicability

for SQL on a larger variety of different compute substrates.

And as you have been building Feldera,

working with your customers and community members, what are some of the most interesting or innovative or unexpected ways that you've seen in Feldera used or even the underlying DBSP

engine?

I would

say we're seeing folks in the blockchain community you use Veltera right now, and this isn't traditionally what I would call MLAI or data. We're seeing some interest in people using us for incremental computations with graphs, and this is, again, a place where being able to compute recursively

comes into the picture. So these are all new things that are showing up that we're quite excited by. One of our, colleagues is also building a spreadsheet

using Ferdara, which is also something that I wouldn't have traditionally associated with the

typical use case.

And a spreadsheet actually makes sense in the sense of, like, when you edit a cell, all the formulae that reference that cell are supposed to

update incrementally. Right? So from that vantage point, it makes sense when you think about it, but it's not something that comes to mind right off. And in your experience of

building

the Filvera technology,

the business around it, what are some of the most interesting or unexpected or challenging lessons that you've each learned in that process?

Well, I I discovered that I guess it's unsurprising, but I discovered that,

educating people about the possibility,

of this technology is probably the hardest part. There are kind of 2 classes of users. There are people who are not really familiar with streaming, and they think in terms of batch jobs and, you know, batch world.

And so teaching them about this completely different mode of evaluating your queries,

incrementally in real time, it's a big it's a big mental leap for them. But even worse,

folks who have experience with streaming analytics with things like, Spark Streaming or Flink,

they, you know, they got burned and they,

see streaming as this kind of clunky

technology that, know, is really hard to use, never quite works, requires,

some very deep expertise and a lot of resources. And

convincing them that there is a better way that these issues are more of a problem of those specific tools rather than the approach

is pretty challenging. So I think, yeah, that's one of the things you definitely had to deal with in this last year and a half.

Yeah. I'd echo what, David mentioned. I think messaging and working on sort of educating users

on, who come to us based on some preconceived

biases about how streaming is supposed to work. And, first of all, teaching them that incremental computation is a generalization over batch and streaming. Right? Anything you can do in one of them, you should be able to do with us, even better. I think sort of getting that lift has basically been an interesting challenge for us.

So one thing I learned is that marketing can be much harder than, writing code. So what happens is that,

streaming is very popular. If you actually Google it, you'll find, you know, hundreds of hits. Everybody seems to offer it in some shape or another, and it's very hard to tell how the competing offers,

differentiate

each other. So,

and, you know, the whole founding team is, very technically oriented. We are all, software engineers at heart. So, marketing is something we have to learn about more. And we very much appreciate the opportunity given by the podcast to tell the world about this. Especially the transition from research to company building has been quite the learning curve. Let's put it that way.

Yeah. The sales and marketing aspect is definitely something that often gets underestimated

by people coming from a technology background of, oh, I I built this really awesome thing. Everybody wants to use it. Oh, wait. Nobody knows it exists. That way. Yeah. Exactly. Exactly.

Yeah. And

for people who are coming up against challenges

with stream processing,

incremental state management,

what are the cases where a foldera is the wrong choice and maybe they would be better suited with one of these other streaming engines or different technology architecture.

So I do think that folder supersedes these other systems. It's, simply better. But if what you're doing is not streaming, if if you need batch processing, long term storage, ad hoc querying capabilities that, you know, scale to enormous datasets, then you're better served by something like Snowflake,

or one of Databricks or one of the other lakehouse

or data warehouse tools.

And as you continue to build and iterate on Fildera, what are some of the things you have planned for the near to medium term or any particular

projects or features that you're excited to explore?

I'd say road map wise, I think we really want Fildera to handle increasingly bigger and bigger scales of data. Right? So things like

interaction with the checkpointing to s 3 automatic scale out. Compute storage is already disaggregated, but we wanna disaggregate all the way to external storage as well. This is kind of the, I would say, short to medium term road map. As for long term road map, I don't know if Mihai or Leonid also want to comment. So, you know, the power of mathematics is actually it allows you to predict the future. So,

that's it's very interesting. So the DBSP theory tells us that incremental view maintenance, which is the way databases have been doing incremental computation,

is not that hard of a problem. So it's been a problem that people have been working on for 40 years, but, a solution is inside. So it it can essentially do all of SQL incrementally. That's what the theory tells us. So, I actually can predict that, maybe in a decade. You know, idea sometimes take a very long time to to propagate. So maybe a decade is a little optimistic. But in a maybe 2 decades, let's say, all databases will be incorporating this kind of functionality.

So,

incremental computation and the traditional database competitions are not different things. They can be unified, and this unification will happen. And as part of this prediction,

I also claim that,

custom streaming engines, they must either evolve in very, very different ways or they will disappear completely. There's no reason for them to exist. Databases can't do this very well. It's a very interesting prediction, so I, look forward to seeing how it gets proven out. I definitely think that a lot of the streaming systems that we've had up till now

seem to be a lot more complex than they need to be, both from the implementation

and the usage perspective. So I definitely look forward to seeing them continue to be more user friendly, easier to implement, easier to integrate. Are there any other aspects of the work that you're doing on Fildera, this overall space of combining

streaming and batch processing, incremental computation,

and the applications for it that we didn't discuss yet that you'd like to cover before we close out the show? So as I'd like to mention, one thing is,

systems that people build are not really designed for incremental change very well. So a database, for example, you you know, you can update, insert, or delete,

and that is an incremental change. But there's no easy way,

and this is one of the places where Ferdera has to struggle most is to actually explain to the consumers

what the changes are. Many consumers are simply not designed to accept negative changes, for example, which are deletions. Turns out that there is no way in SQL to delete if you have a table which has a multiset, can have multiple copies of a row to delete only one copy of a row while keeping the other ones. So even database weren't designed properly

to accept changes. And people build extremely complicated systems, like change data capture systems, just to migrate changes from one database to the other. I claim also that there's unnecessary complexity in this whole ecosystem,

and, taking a unified approach of describing changes uniformly across all these systems will simplify this ecosystem dramatically.

And it can be done. We know this can be done. I I definitely look forward to that as well because as you said, change data capture is one of those things that everybody says you need this, but when you actually start thinking about how to implement it, it turns into a giant morass of complexity

that will work maybe 80% of the time, and then the other 20% of the time, you're wondering why you ever went down that road. Exactly. And the and the reason is because this is really a capability that should be built into a database, like every database,

and it's not. So people build it on top. It's all kinds of ugly extensions. And, yeah, this is not hard for me. A database should give you a service. You register a view, and you say, let tell me what's new in this view. Potentially, you know, the entire database you could think of. We can register for many views. And then there there should be a service built into the database. You shouldn't need to, you know, read the logs and reverse engineer what's in the logs. It's just wrong.

And so from that perspective, how do you see DBSP potentially

either supplementing or replacing the concept of write ahead

logs? Well, you know, the write ahead logs are necessary for, durability,

so those won't go away. But people abuse the logs for extracting the change information from the database. That use of the logs, is unnecessary.

So one beauty of Ferdara's design is that

the changes we receive from the outside world, like in inserts, updates, deletes, the representation of tables, the representation of views, the changes to tables, the changes to views, and even the on disc

journal and these formats are all z sets, which is what the data structure we use to describe changes.

Right? So this can be done. If

you design from the ground up to compute on changes, this is what the world should look like. But, you know, everything outside FireEye doesn't look that way today, and we have to find ways to bridge bridge that gap. Alright. Well, hopefully, this is the first step towards popularizing

that and getting those capabilities integrated in more places. So for anybody who wants to get in touch with each of you and follow along with the work that you're all doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Interestingly, the the discussion we just had is what I would claim.

And it's kind of funny if you just look at

all the software we're building out there today. Right? Like, take your food delivery apps to your flight booking apps to any kind of payment portal you use, every credit card transaction swipe, every card going from a to b, all of these are small

changes happening continuously to the world around us. Right? And when they show up behind the scenes in some data center as

changes to an existing data model, suddenly,

all the tooling that we've built over decades

has no concept of computing on changes. We see Feldera as basically solving that problem at a fundamental level. Right? And hopefully, that will hopefully, we can drive that change. Well, thank you all very much for taking the time today to join me and share the work that you're doing on Filvera. It's definitely very interesting technology. It's great to see a new angle on these overall problems of being able to manage incremental computation,

streaming, and batch unification,

and, exposing that all through a

interface that is understandable

and,

tractable

using SQL. So appreciate all the time and energy that you're putting into that, and I hope you enjoy the rest of your day. Thank you for having us. We very much appreciate. Thank you. Thank you very much.

Much. Thank you for listening, and don't forget to check out our other shows.

Podcast.net

covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems.

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host atdataengineeringpodcast.com

with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.