Adopting Real-Time Data At Organizations Of Every Size

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.

With their new managed database service, you can launch a production ready MySQL,

Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs.

Go to data engineering podcast.com/linode

today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks.

By the time errors have made their way into production, it's often too late and the damage is done.

DataFold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests.

DataFold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production.

No more shipping and praying. You can now know exactly what will change in your database.

DataFold integrates with all major data warehouses

as well as frameworks such as airflow and DBT and seamlessly plugs into CI workflows.

Visit dataengineeringpodcast.com/datafold

today to book a demo with DataFold.

Your host is Tobias Macy. And today, I'm interviewing Arjun Narayan about the benefits of real time data for teams of all sizes. So, Arjun, can you start by introducing yourself?

Hi. I'm Arjun Narayan. I am the cofounder and CEO of Materialise,

the streaming database for all your real time needs. And do you remember how you first got started working in data?

I did a PhD in databases,

which I stumbled into, which is a hard thing to stumble into. But that's really sort of how this journey started. I was interested in computer science. I wanted to do more computer science. Applied to some PhD programs,

specifically around distributed systems, networking,

and, you know, back then, I guess it was called big data.

And as I got closer to the weeds in all of those problems, repeatedly had the

thought that these problems are database problems that the database people have thought really hard about for multiple decades. And maybe we should just be using

the tried and true solutions, and we should make the databases

more scalable rather than getting people into distributed systems sort of rolling their own database, and I got more and more opinionated over time that way.

And then I loved it. And so I worked for a small series a startup at the time, Cockroach Labs, which is now, you know, large wonderful company. And that was sort of my introduction into databases.

Now we're talking about the use of real time data because as we have

increased the scalability

and performance of the data systems that we're running, we're starting to say, okay, now we can actually do this faster. And so

to frame the conversation,

I'm wondering if you can give your conception of what real time data is and what that means and some of the benefits that it can provide versus

the, you know, juxtaposition of what people will generally term batch systems.

Yeah. Totally. So the first thing is that real time data has, like, an enormous amount of

associated with it because people

typically

associate the sort of desired end goal, which is the data that is fresh and fast and up to date, with

the difficulty

in the implementation

details of building the horrendous pipelines that can even deliver

that end result. And people go, oh, I don't know if I wanna deal with real time data. But if you just sort of step back and observe that, do you want your data to be up to date and fresh? And if if I could wave a magic wand and say, you know, all of your data was up to date all the time, pretty much everyone would take that deal. Right? Particularly as long as the tools would allow you to sort of do arbitrary

historical time travel. Right? Then real time data is always better, sort of strictly better. Right? The problem

is that

so far, when people have tried to adopt real time data, they have had to get their hands dirty in increasingly unpleasant ways and deal with a lot of

complex

building blocks

that they don't necessarily have to in batch. Right? So in batch data, you have these tremendous amounts of tools and ecosystem

that really helps

you wrangle all of your data with a lot of ease. And 1 of these sort of classic tool slash ecosystem

things is SQL. Right? You just write a SQL query.

Tons of stuff happens behind the scenes. You don't really know or care. Right? You're like, I don't know if this takes a bunch of distributed systems under the hood, microservices

in the cloud. Like, I type SQL. I get answer. Like, this is a wonderful way to live, right, for the vast majority of people. And, you know, until now, that has not existed in the world of real time with the exact same fidelity

and ease of use as it has in batch. People have had to

build Kafka clusters and write microservices

and do all that stuff manually.

And if we could move away from that paradigm, if we can make real time as easy as batch,

my contention is

most people would choose real time because why not?

As long as it was as easy and as cheap, and and you know? The 1 situation where I could see people

not wanting real time would be or where they would prefer batch tools is in the world of

data exploration. Right? So when you are doing sort

of data science or you are you are in this exploration mode where you are simply thinking about, what am I even looking at? This looks funny, that sort of moment. You typically want to reduce as many moving parts, which means you might wanna fix the dataset.

Right? So let's just look at how the world was at midnight last night, and then let's probe and look for correlations. In this exploratory mode,

I think batch

will always be better simply because there are a variety of optimizations

and tools and techniques for making these ad hoc exploratory queries very fast and responsive

that a batch system can optimize for. But beyond that, any sort of recurring pipeline, any sort of job that runs every night, if you fix the query,

my contention would be if a query ever is run a second time,

then real time is always gonna be preferable for the user to batch. Yeah. It's definitely a good framing for it. And

as far as the kind of typical challenges that are associated with streaming,

before we get too much further into kind of the meat of the discussion, I'd like to frame some of the ways that that manifests

and maybe some of the ways that

modern systems have been able to either circumvent or paper over those challenges. So things that come to mind are it's very difficult to be able to join across

moving datasets.

It's very difficult to know what the proper window sizes are if you're trying to run an aggregate.

It's very difficult to be able to do

kind of long time span

historical aggregates

with moving data because you have to decide, okay, you know, if I wanna compute the state of the world, you know, from the beginning of time, then that's gonna take a long time, and by the time it finishes, the the answer will have changed. And so I'm wondering if we can maybe talk to some of those

classical issues and some of the ways that we're able to

either

ignore or resolve or kind of paper over some of those challenges.

Yes. Many of these these concerns that you have raised typically

are those faced by practitioners

that have adopted streaming technologies and are trying to build something in streaming.

Streaming today typically means sort of deploying a Kafka cluster. That's pretty much the way that people move data from point a to point b. And then using some sort of stream processing tool like Kafka Streams or Flink, Apache Flink, or something of that sort in order to do the query processing. And these

stream processing platforms have sort of proto SQL

layers on top, which do not support the full fidelity of the sort of standard SQL that a batch database would support.

The problem

that folks typically encounter

is that the stream processing systems

fail to scale at meaningful throughputs of messages per second and of aggregate historical data size

for queries that are very stateful. So, typically, what people start doing is they start

arbitrarily reducing the amount of state that the system has to wrangle so it doesn't fall over. So they typically impose some kind of window size on sort of, let's do this query. So let's look at the

aggregate sum or join or something of that sort over a recent historical window, let's say the last 10, 000 messages. Right? Like, what this does is this restricts me

restricts the system from having to hold all the messages

all the time, and so it's able to make some amount of progress without falling over.

The problem is that may not be the query that I want. Right? Like, if I'm doing a join on some primary key, there may not be a

join

between 2 Kafka

topics

where you're guaranteed to find the join in those last 10 1, 000 messages. Right? You have restricted

the set of

messages for consideration

too much compared to the actual semantics of the messages that are flowing through system. So

when you start to

have to deal with what are fundamentally implementation

details because of the inherent limitations of the system, you can't reason about just the semantics of the query you're trying to get done. Right? So you are no longer in this realm of, like, I am an analyst. I am thinking purely about the business needs. Like, I have this information here, this other information there. I wanna join these 2 things together, and I want the result, and I want it to be up to date. Right? Like, if you're able to sort of

think in that declarative way without having to think about, oh, well, now I have to consider the window sizes that that so that the system doesn't fall over, then you can actually make a lot of progress. And that is what batch systems are wonderful at. Right? Like, in the batch system world, you wouldn't have to think about, you know, whether that was in the last 10, 000 messages because a large cloud native

batch data warehouse will just

glob over

billions of messages for breakfast and give you the output of your join.

These challenges that come from the

inherent limitations

of

some

specific stream processors,

in my contention, and this is an opinionated

sort of biased contention, have been holding us back. Right? We need to get to a world where the underlying streaming systems

are

at par or close to par with the capabilities

of batch data warehouses where the user doesn't have to think about these things. What does that mean? It has to be scalable and able to deal with extremely large amounts of state management at the end of the day such that the user doesn't even have to care about state management. Right? So streaming state management

needs to basically no longer be a consideration, and people arbitrarily creating window size restrictions where the histories of their Kafka topics needs to stop being a thing. Right? And as long as that is a thing, streaming will never be as easy as batch.

As far as the

types

of organizations or teams that have, to this point, been adopting real time, I'm wondering if you see any

commonalities or general themes of the motivating factors that have pushed them to making that engineering investment and making it tractable,

and some of the ways that you can see the evolution of real time adoption

continuing as these systems become more mature

and sophisticated and easier to operate?

Yeah. So, today, the only people who have been able to meaningfully deploy and manage and wrangle, and this has been getting better. Life has been getting sort of better for folks. Used to be 5 years ago, you needed an entire team to

manage your on prem Kafka clusters, and the move to sort of cloud native, sort of Kafka

fully managed and hosted Kafka services means that you can do it with a leaner set. But even today, the folks who run and manage streaming infrastructure

tend to be on the production infrastructure side. It is not something that is accessible

to

a business analytics team,

the kind of team that would use a Snowflake or a Redshift. Right? The folks who are productive

in adding meaningful business value, doing analytics,

using fully managed services like Snowflake and Redshift

cannot today be productive, and they have to sort of cross team

to the production team to help deploy some real time

infrastructure and perhaps even write some Java code for some Kafka Stream Appliance or some Flink Appliance that sort of gets munged into that and and orchestrated by some Kafka connectors. You've crossed teams entirely here. Right? And I forget the provenance of this joke, but, basically,

anything that requires 2 VPs

to commit resources

basically means it's never gonna get done. Right? So we're we're talking about a 2 VPs problem at this point.

And the infrastructure

needs to get easy enough that it can be handled

by the analytics. And that that doesn't mean that that's the end goal here. Everyone can benefit,

including on the infrastructure side. Infrastructure side also buys a and uses hosted

OLTP databases. Right? So we need to get our streaming infrastructure to be as simple as an OLTP database or a cloud data warehouse, such that the amount

of productive users can go up dramatically.

The other interesting angle of this topic is when we're talking about real time data, we're generally talking about the need to be able to have up to date information for some sort of product or

business purpose, and I'm wondering if you can talk to who you see as the general

consumers of that real time data and maybe if there are any kind of common personas that you see across those use cases.

That's a wonderful question.

Let me start by talking about the amazingness of batch data warehouses. Right? Like,

batch data warehouses are used by a variety of personas,

but typically analysts or data engineers who are producing

some

result for human consumption.

Right? So as long as there's a human in the loop, batch actually works pretty darn well today. And while real time may be a nice to have and, you know, if it was click, click, click, just as easy as batch, they might prefer the dashboard to sort of be constantly up to date. Today, if you're going to produce a report that goes to a human being, batch works great.

Where real time starts to become a must have, not just a nice to have, is when you want to embed that

into

an automated action. So you are going to take the result of

this SQL query and do something automated. That may be automated marketing. Right? That may be automated

segmentation of your users,

automated email activation. And these today, what ends up happening is the analyst's team or the analytics team

has these insights,

they're able to, say, push this data back into Salesforce

for human,

you know, sales reps to take a look at and make sort of these business judgment calls.

But they are unable

to deploy this in an automated way without having to go back to that infrastructure engineering team. And that is a persona. The

analyst, the

business owner, the business sort of lineup business owner that wants to take a manual action and turn it into an automated action and fold it back into their core application,

that really is the persona

that wants real time infrastructure today that is on par with the batch infrastructure that they are productive with and use happily.

Beyond just the

infrastructure aspects of being able to manage the flow of data, you know, whether that's running a Kafka cluster

the end to end

stack needs to be properly integrated or architected to be able to

source and generate and process

these real time flows versus just a periodic batch job of reach out to this other system, grab a bunch of information, and pull it into this other system?

The first 1 is just moving the data in real time from all the various systems where the data already lives, right? So no data system is in a vacuum.

You have to first start by pulling data out of other systems.

A huge amount of data today lives in OLTP databases, right, a Post request, a MySQL, and Oracle. And so change data capture ends up being an important aspect of sort of day 1 aspect of real time data. Right? So

batch ETL no longer really works because if you've already

had to wait for your data to be exfiltrated

sort of once an hour or once a day, there's no potential for real time computation downstream. Right? So just moving your data in real time out of your OLTP systems using a change data capture system. Debezium is a popular 1, but many of these databases also have native sort of log replication

things that you can scrape off of or push into Kafka.

The second 1

is existing real time data that already exists. Right? So a classic example of this is segment web events. Right? Like, you can just get real time segment

events that are coming off of your website or something of that sort. And then you typically want

to

then think about the downstream computations that you are going to do. And over here, I have an opinionated take, which is most people

this is not a absolute binding universal, but I think most people, most of the time, can get most of the things they want done with just SQL. So what they really want

is a

system that can connect to all of these input sources, be that databases or web events or arbitrary Kafka events,

and then they just wanna write some SQL queries. And today, they do that in batch fine. Right? So you can take all of the stuff and load it into a Snowflake or a Redshift and then write SQL queries. And our contention is people just wanna write that same SQL and have that result of that SQL query just always be fresh.

And that is what Materialise does. Right? So Materialise is a database that under the hood is the full sort of stream processor.

But from the user experience level, it's just them writing Postgres queries that stay up to date.

You can't get everything done this way. Like, there are times where you do want to write

some kind of imperative code. Not everything

fits

SQL,

and not everything fits SQL ergonomically. Right? Like, there's some things that you could wrangle into SQL, and just like that line, like, your scientists were so preoccupied by whether or not they could, they never stopped to think whether or not they should. Right? Like, you can write some god awful SQL that you wish you could have written in some other language. But there's also computations that the current sort of SQL standards

just don't fit,

imperative sort of paradigms.

And those things are always gonna have to live as custom microservices.

Right? But the contention that I would make is that about 90 or 90% plus of what you were forced to go into microservices

to get real time

or simply live with high latency batch can be done using SQL on top of your streaming inputs.

RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control.

Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enables you to automatically send data to hundreds of downstream tools.

Sign up for free today at dataengineeringpodcast.com/rudder.

As far as the work of being able to

consume these real time data streams, once you have the underlying infrastructure, you've got the integration set up.

Are there any

kind of educational aspects that you found necessary

for

data teams or analysts to be able to

construct the way that they're thinking about the problem in a way that is more conducive to

working with a continuous flow of data versus what might be more kind of natural or familiar to them coming from a batch world where rather than saying, you're going to construct this query where you're aggregating across all data in the entire system

as opposed to,

you need to think about this differently because what you're actually trying to answer is what is the state of the world right now versus what has been the state of the world across all time? So not everything that can be computed efficiently

in sort of a 1 shot computation

can be translated

into something that can be efficiently incrementally maintained. A good way to sort of get some intuition around this is to think of something where the answer set just changes dramatically second by second. Right? You are pulling out the result of some query where if you ran it the second and then you reran it next second, you're just gonna get a wildly different result. Right? So

you do need to think about those queries

where the result set

changes

in some manageable way because you are going to be doing work or you're gonna be asking whatever system you're using, be it materialize or something else, to do work

proportional

at the very minimum

to the

the rate of evolution of the result set. Right? You might genuinely want a fire hose of

downstream competition

that changes very quickly,

but then you're gonna need sort of the amount of resources,

you know,

in terms of CPU cores or what you will,

that is able to handle that. The

key piece of education

that we have to do is that it is worth thinking about what you actually want in this resulted because a lot of the time, people, you know, tend to think about it in terms of the amount of inputs, and it's really in terms of resource allocation. We need resources to deal with the amount of inputs that are changing.

And if you reframe it in terms of the amount of outputs that change, there are opportunities to use vastly

fewer compute resources

and have something that may have a sort of fire hose of data coming in, but it's able to quickly look at that data,

make some incremental changes, and discard the data and hold

much less state

than people think is necessary because

incremental maintenance

of the query results

using resources proportional to the outputs is much, much, much less than sort of hovering in all the data, computing everything,

throwing away 99% of it, emitting an output, and then sort of redoing it in a non incremental sort of rinse repeat fashion.

And that means that people have come in with the

expectation

or they've been trained by their batch systems and by the sort of hoovering data, rinse repeat, batch pipelines that they've built, and we need to educate them in thinking sort of incrementally.

Another aspect

of this is

the kind of current

state of the world for streaming has been

fairly,

I don't want to say static, but it has been complicated for a long time.

And I'm wondering what you see as

the contribution of that state to the general attitude of teams towards their

appetite for even starting to think about adopting real time or incorporating real time into their applications and what they see as that kind of barrier to entry or

initial required investment and maybe some of the ways that the reality is no longer We

we

We we encounter this all the time. Right? Like, in fact, we go out of our way to not say streaming because the associations

that people have with the word streaming are that of grief and despair and

pain.

Right? I don't think it is inaccurate to say that sort of streaming today is roughly where hadoop was,

you know, 15 years ago, which is this giant confusing mess of things that you sort of have to constantly

wrap your head around and deploy and staff up teams of, you know, tens of engineers to even just keep running.

This is only gonna go away

with results. Right? So people have to be able to use

real time systems, get real time results where the underlying engine is a stream processor

without really thinking about stream processing or streaming.

And

it's only once that

happens for a large enough ecosystem

of folks who no longer have that prior association with streaming are things gonna change. Right? I think there's a lot of scar tissue

and a lot of

existing

associations

that it's simply gonna be a matter of time before

that no longer is the case. That said,

I do want to take a moment to talk about the fact

that existing streaming systems, at least they exist. Right? Like, they are delivering a lot of value to a lot of folks who are able to build and deploy these systems. Right?

So the same as with the Hadoop ecosystem. Right? There were lots of companies that had

full at scale deployments of clusters and such and were getting value out of those systems.

The problem with that world was

it required that you had an organization

that could hire

a 100 data

Hadoop expert. Right? And not every organization can do that. Right? The democratization

of the ability to use big data and get results and be sort of data driven, so to speak,

really came about because

with cloud native

batch data warehouses,

organizations that couldn't hire Hadoop engineers got the capabilities

of organizations that could hire Hadoop engineers. Right? And I think the same thing is true of streaming. Right? So

if you are a streaming first company,

great example, Netflix.

Right? Netflix can get whatever they want done using streaming because they can hire the best of the best, and they can build and maintain whatever they need to maintain.

Not every company has that luxury. Right? If you are a company

that is not in Silicon Valley, that cannot sort of snap your fingers and have your recruiting team

hire, you know,

20 streaming fluent microservice

architects.

You just have no options today.

As you have worked with organizations and they've started to

incorporate

real time data into

their products, into their analytics,

I'm curious

what you have seen as the ways that that changes the way that they think about

the information that they need, the types of analyses that they are interested in, and the ways that

data is

applied and incorporated

across the organization?

The first is folks tend to want there's a sort

of gravitational

pull back into production. Right? So what has

in the past had analysts sort of sitting only at near the exhaust of the data and then pushing things back to human stakeholders throughout the organization turns into

a closed continuous loop, which means that

folks who have not had to think about

the problems

of production

running production systems

have to start thinking about production systems and thinking about their deliverables in a production system. What this means is if you have have previously been a stakeholder that has only dealt with

cloud data warehouses and then putting data back to, you know, manually

human to human to other members of your organization, and now you are delivering something in an automated fashion. You now have to think about an on call rotation, getting alerts,

getting paged if these systems go down. That is something that the production folks have always had from day 1. Right? Like, if you are a DBA,

you've always had this. And so that's, I think, the biggest mindset shift. It's it's sort of as you go to real time, as you go to production,

you have to

involve a set of considerations

that you may not have had to have before. And that is, I think, an ecosystem wide

journey with things like reverse ETL or pushing things back into production systems. Folks have to sort of change the way they think. The second 1 is

the expanded capabilities

that they can do. We've had some sort of

funny or, I think, you know, pleasant interactions where folks

who have

artificially restricted themselves

by sort of putting on the

narrow lanes that their streaming systems that have allowed them to do in the past.

And and once they realize that those, you know, restrictions are lifted,

they have a wonderful set of opportunities in front of them. The biggest 1 of this is joining data from multiple data sources. Right? This is something that has been

extremely difficult

using,

you know, previous generation streaming tools. Once you can start to look at

a join between your production OLTP data log from your segment web events. Right? The list of possibilities that you can do, the features you can build widely expands. And so people oftentimes go into sort of this flurry of, oh, I can do this. I can do this. I can do this. That's very pleasant to see as somebody who builds data systems to just see the possibility space expand.

To your point about the kind of managing state and

the other earlier question about computing the history of the world,

another aspect of real time and streaming that has been a kind of mainstay of the space for a while is that

in order to be able to effectively

combine

historical analysis

with real time updates is the need to have 2 separate systems where you have your batch system where you do these big expensive computations where you can compute across everything, and then you have your streaming system which is just for the most recent data, and then you have to periodically

age things out of that into your batch systems.

And

there have been

evolutions of that with things like Pulsar and Kafka having the ability to tier their storage so that it's not all sitting hot on disk. They're able to archive that out to s 3, but then still be able to use the Kafka cluster and the Kafka KPIs or Pulsar APIs

to, you know, work across that historical data as well. And I'm curious what you see as the current state of the art for being able to

work across the real time and the historical data and maintain that statefulness

and kind of what the

trade offs are of either still requiring 2 different systems to be able to be effective or if you're able to actually

integrate everything into 1 system, 1 interface, and 1 kind of way of working with the data.

I will take the opinionated stance that

that is horribly broken, and people should never have to sort of manually think about aging or building a Lambda architecture that have have a manual reconciliation

sort of process

to to move data from real time to batch. I do not think there will be 1 system that people use

to solve all of their problems. Right? The key distinction I would make is people should choose different tools

because they have different business needs. Right? And they have different teams that have different workflows for which bet different tools are suited. I think it's wonderful that there are many databases

and many data platforms for different folks, you know,

but it should be based off of

actual

sort of different sets of stakeholders having different sort of UX

or

workflow needs,

not because the systems under the hood sort of need to have that data migrated. Right? When a single stakeholder has to use multiple systems,

I think that nothing but slows them down. I think real time data

will live or real time data platforms like Materialise will sit side by side with systems like Snowflake or Redshift

because there are different stakeholders and there are different

data pipelines.

For instance,

BI tools that are doing data exploration.

Right? They're probably gonna run off of the batch system. But if you have to build some sort of manual reconciliation

thing that's pulling data from your Snowflake and from Materialise

or from a batch system and a real time system. Because the real time system can't handle the scale and the dataset size of the batch system.

That is a flaw

that is bad. Right? So we take the belief that the real time system should be able to deal with all the state management with the seamless,

completely

transparent

usage of cloud object stores like s 3 to deal with terabytes and terabytes of state without the user having to sort of lift a finger. And that is a good thing and that that users should demand that of all of their real time infrastructure.

And so as you are working with your customers at Materialise

and communicating with other players in the ecosystem,

I'm wondering what you see as the

broader

trends and evolutions

in the space and the kind of underlying technologies

and some of the ways that the feedback that you've gotten from working with some of your early customers and design partners have influenced the way that you think about the

implementation of the Materialise platform and the ways that you want to expose the interfaces

and shape the interactions

with the underlying data for these end users?

I think the biggest shift is this sort of tendency for everything to go to prod. Right? Like, the fact that we have to,

you know, support and

enable

users who are new to running prod services

to start thinking in a prod services mindset. Right?

So I don't mean in terms of us having an on call sort of production alerting system. That goes without saying. I mean, users who are rebuilding who have gone from not having to do that because they run batch jobs that they sort of check-in the morning to see if they succeeded or if they failed, they hit rerun or something of that sort,

to

these stakeholders having

to

have their own notion of alerting and sort of real time continuous

testing,

real time continuous

integration testing

before they change a query because there's a live essentially, a live migration that needs to happen because they have these running services and dependencies.

Because they've moved from batch to real time. Right? These are new considerations,

and these are

challenges that we have had to sort of help educate.

It's also

very gratifying because

they are now able

to build and deploy production production services that before would have to roll through another team, right, where they would say this is a thing the business needs to do, and then they'd have to sort of make a case that this other team would need to be staffed, the 1 that already has the alerts, that already has the on call rotation, that already manages the streaming infrastructure, and sort of petition that team to do it. It's gratifying to enable folks to self serve and build these production real time services without needing to cross that, but then there's also the challenges of having to educate them through the hard parts

of that. In your work of building Materialise,

I actually had 1 of your co founders,

Frank, on a few years ago, fairly early on in your product launch. And I'm curious if you can talk to

some of the ways that the product focus

and the ways that you think about

building this real time substrate

have evolved over the past few years from when you first launched to where you are

today? That's a wonderful question because we have had a

dramatic

improvement or upgrade to materialize

through making it a cloud native service since you you talked to Frank. Right? So when we first started Materialise, all we wanted was to show that

SQL materialized views were even possible on top of real time changing data. Right? So Materialise was a binary. It ran on a single machine, and it simply connected to Kafka clusters. And it could handle a decent amount of volume, but it was fundamentally

and, again, it could horizontally scale out but as a single service. Right? And

we have built a next generation cloud native materialize

that supports

multiple

sort of use cases with separated storage and compute using cloud object store and persistent storage

that supports multiple use cases

that all can share the data but can stay isolated from each other and independently

scale. What this fundamentally means is

separate compute clusters

that

can sort of feed into each other, that can stay siloed,

that can be replicated so that you can have redundancy in the case of faults or outages.

And all of this is new, and we recently announced it just a couple of months ago in our early access.

We have

users that have come from the all on premise world where they were themselves running and orchestrating

multiple VMs, each with their own independent materialized

binary running into using a single unified

system that allows them to

share the underlying

inputs

while having isolated compute experiences.

And that is what's new since you talked to Frank.

Build data pipelines, not DAGs. That's the spirit behind Upsolver SQL Lake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG based orchestration.

All you do is write a query in SQL to declare your transformation,

and SQL Lake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data.

SQL Lake supports a broad set of transformations,

including high cardinality joins, aggregations, upserts, and window operations.

Output can be streamed into a data lake for query engines like Presto, Trino, or Spark SQL, a data warehouse like Snowflake or Redshift, or any other destination you choose.

Pricing for SQL Lake is simple. You pay $99 per terabyte ingested into your data lake using SQL Lake and run unlimited transformation pipelines for free. That way, data engineers and data users can process to their heart's content without worrying about their cloud bill. Go to data engineering podcast dotcom/upsolver

today to get a 30 day trial with unlimited data and see for yourself how to untangle your DAGs.

In terms of the way that you think about the space, I'm curious if there are any early ideas or assumptions about what was possible, what people wanted out of the space that have been

challenged or invalidated

as you have gone through this journey of building the platform, building the business, and working with customers, and then going through this rearchitecture

of the underlying technology?

So when we first built Materialise, it had no persistence layer. Right? Like, it depended on upstream data sources being endlessly replayable.

Typically, you know, this many Kafka cluster with infinite historical retention

or, you know, required a Kafka cluster where the compaction

that

settings that it had used

were completely compatible with replaying

the computation.

It was purely materialized was a purely in memory ephemeral system.

This did not work for a lot of people because

for a variety of reasons, either being sort of tightly coupled to the underlying Kafka system. We always had a long term plan of building a persistent system for Materialise,

and we quickly came to the realization

that this was in fact a fundamental

requirement

for Materialise's wider adoption. So with our cloud native platform,

it comes in with persistence by default. Right? Persistence using

cloud native s 3 object store, sort of infinite scaling persistence

while still being low latency.

That was, I would say, the biggest shift

between sort of

when we first started launching Materialise and started

putting that in front of customers to Materialise today. James Jacobson: In your

experience

of building the platform and working with customers

and working with other players in this ecosystem? What are some of the most interesting or innovative or unexpected ways that you've seen real time data applied?

1 of the wonderful things about building a data system is having a business that is horizontal, which is you're not coming in with a specific industry vertical that you're dealing with, is is you get exposed to a wide variety of use cases that come from folks that do have this

vertical specific expertise. Right? And so we find ourselves constantly delighted

by the things that are possible. And I think this is true of all databases, which databases are are sort of ubiquitously

used across across all sorts of verticals.

The biggest 1 that we find that I personally,

this may not have been new to other folks at Materialise,

to me was

the fact that

Materialise

allows our users to unify their machine learning workflows and do real time sort of online prediction

very seamlessly. Right? So,

typically,

without a system like Materialise, what you would have to do as a user

would be to do some sort of

training,

extract some features, and then you would take

those features

and, you know,

the weights, and then

build a separate system that was streaming that would do prediction based off of the inputs, right, based off of the now streaming inputs, and then

feed that into some kind of cache,

and then have that cache be hit by your production application. Right? So now you've got, you know, a training system. You've got a stream processor.

You've got a Redis cache.

You have some sort of orchestration workflow to sort of periodically

recalculate

the feature weights

in a batch offline fashion, extract those weights, and then update the streaming pipeline.

And all of that complexity is gone, so just focus on the core machine learning by unifying

dispatch and streaming because you can fundamentally run those

complicated

training queries on the same system that's doing the online feature prediction.

That to me

was, I would say, the most personally gratifying.

In your work of building the Materialise platform

and exploring the technology and working with your customers, what are some of the most interesting or unexpected or challenging lessons that you've learned personally?

Well, I'd like to say I was mostly eyes wide open

starting this journey.

I had a lot of benefit from seeing

and being a part of building CockroachDB

from when it was, you know, a single node database that would follow for

every day to a, you know,

tens of nodes, extremely resilient

cluster.

But the number 1 challenge I think I mostly went to the sites wide open, but there's always more than you think. It's a long journey. Right? It just takes a very long amount of time to build a rock solid production database from scratch.

It's primarily a

none of this is built in a single sprint. None of this is built with a single team.

And so recruiting,

building,

and

managing

a large

engineering organization that's building a product

that has a very long time before you get, you know, market signal

is a challenging

process. It's very different from a

business where

within 6 months, you can put this and have customers and then iterate out with that customer feedback. Right? It took us many years

with 0 customers

to build

the very first nobody buys 80% of a database. Right? Like, you really do have to get that fit and finish to a very, very high

bar before you will have your first customer. And so that is a challenging process, and I like to think I went to it mostly eyes wide open.

But it always takes, you know, 20% longer than even your most pessimistic

assumption.

For people who are

weighing their options about how to

approach their use of data the way that they architect their data systems, what are the cases where real time is the wrong choice?

I think

real time

does not make sense for data exploration. I touched upon this a little bit earlier. But if you are

in this genuine exploratory

mindset where you don't even really know what you're looking for and you're sort of poking around and trying to understand your users better. Let's say you've just been tasked with an open ended problem.

I think a batch system is simply gonna be better because you are going to be running a

extremely high number of 1 off queries. You know? You're gonna be sitting there going,

what is that correlated with that? What is this column when I join it with this other table? Right? And

I think a real time system

will confuse rather than clarify because

in that mode,

you want to reduce the number of variables

that are changing.

And the dataset itself moving, the ground shifting under your feet is 1 more complication you do not need. It is only once you have come to some conclusion from that exploratory process

where you've said,

this is the thing that predicts customer churn when they have 2 bad experiences

with

my company

in a short period of time that we need to immediately

do x y z, or else they are going to never use our service. You you'll come to some conclusion like that. Right? So that is the point at which

real time becomes the right choice. But if you introduce real time into that data exploratory process, it will not help. It will simply hurt.

As you continue

to work in this space and build out and evolve the materialized platform and business, what are some of the things you have planned for the near to medium term and some of the ways that you're looking to

improve that end user experience around real time data?

There's always larger scales. There's always more performance work that 1 can do. We're quite proud of what we've built. We are still in early access.

Next year, we will make Materialise generally available, which means that anyone anywhere in the world can sort of click click sign up and get access to Materialise.

There's a lot of

scaling and operational

work that goes on behind the scenes. So a lot of that is really making Materialise more

accessible to the widest number of users possible,

making it available on every cloud service. Right? So today, Materialise is AWS only. There's also some cool features that are on our

to do list. Right? So today, Materialise is is SQL only, and I wanna sort of plant the seed that, you know, things like user defined functions and triggers and recursive queries and things that are, you know, on the more bold and ambitious

sort of end of the spectrum are things that we are still to build. Right? I'm particularly personally excited about recursive queries. Right? Our underlying stream processor

is very capable

of handling

meaty recursive queries, and most SQL databases have sort of

weak, if at all, existent support for with recursive, and I'd like to change that. So that's something that, you know, is something that we really wanna

do in the coming years.

Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I think the biggest gap is

while there have been some advances

in bringing good software development practices sort of repeatable bills, repeatable testing.

I still think that

data tooling isn't quite at the place of software tooling where we have, you know,

repeatable builds, continuous integration,

continuous data validation, continuous data testing,

and we can make more progress there such that building and maintaining data

applications

is

the same level of

quality that we can assure our customers or our internal customers

that deploying software is. Right? So we really need to get to the point where

it's like a fully automated GitHub with

continuous integration tests, nightly's

immediate ability to roll back, you know, bugs, immediate alerts and notifications when there's sort of a data quality issue.

And I think we're making progress there as an ecosystem, but there's still more work to do. Well, thank you very much for taking the time today to join me and share the work that you're doing at Materialise

to help improve the experience and capabilities

for people who are working with real time data.

Definitely very interesting and constantly evolving space. So I appreciate all the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Thank you so much for having me.

Thank you for listening. Don't forget to check out our other shows, podcast.init,

which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning.

Visit the site at dataengineeringpodcast.com.

Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us us about it. Email hosts at data engineering podcast.com

with your story.

And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links