An Exploration Of The Expectations, Ecosystem, and Realities Of Real-Time Data Applications

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.

With their new managed database service, you can launch a production ready MySQL,

Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs.

Go to dataengineeringpodcast.com/linode

today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.

Data stacks are becoming more and more complex.

This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality

of the data and causing teams to lose trust. Siflae solves this problem by acting as an overseeing layer to the data stack, observing data and ensuring it's reliable from ingestion all the way to consumption.

Whether the data is in transit or at rest, Ciflae can detect data quality anomalies,

assess business impact, identify the root cause, and alert data teams on their preferred channels, all thanks to over 50 quality checks, extensive column level lineage, and over 20 connectors across the data stack. In addition, data discovery is made easy through Siflae's information rich data catalog with a powerful search engine and real time health status. Listeners of the podcast will get $2,000 to use as platform credits when signing up to use Siflae. Siflae also offers a 2 week free trial. Find out more at data engineering podcast.com/ciflae

today. That's s I f f l e t.

Your host is Tobias Macy, and today I'm interviewing Shruti Bhatt about the growth of real time data applications and the systems required to support them. So, Shruti, can you start by introducing yourself?

Hey, Tobias. Thanks for having me here. Hey, everyone. I'm Shruti Bhat. I'm chief product officer at Rockset,

the real time analytics platform built for the cloud.

You were on the show a while ago when we first talked about Rockset shortly after you had launched it. But for folks who haven't listened back to that episode, which I'll have a link in the show notes, can you give a bit of an overview about how you first got started working in data?

So I was previously at VMware and then at another startup which got acquired by Oracle.

And that's how you know, once I was at Oracle talking to all the data teams,

the biggest question people were asking was how do I go from data to app?

Right? People had figured out how to build a lake, how to use transactional databases,

how to do big data analytics, how to do warehouses,

But now this new phenomenon is starting to build data products or data apps was just starting,

and that is still a big challenge. So

I was looking around trying to figure out a good answer, and I met Venkat and Dhruva, who were coming out of Facebook, where they built the Facebook newsfeed.

If you think about it, the Facebook newsfeed is 1 giant data

app. It takes a lot of real time information

on who's clicking on what, takes all of your historical information on who are your friends, what have you liked in the past, and then builds this very personalized news feed for you. Personalization is a great example of news feed. So

that's really how I got into it. And since then, I think

we've been seeing a whole lot of new data apps and data products being built. In this question of real time data and to your point about building applications on top of these datasets,

what are the

kind of main use cases for

these low latency datasets and the types of turnaround time that people are expecting and maybe how that factors into whether they are an internal versus an external consumer of that data?

Oh, great question. Yeah. Putting it as 2 categories, user facing analytics and then internal operational analytics is a very clear way of thinking about it. On the user facing analytics side, your end consumers are your customers.

So it's embedded in your product. It is

sometimes showing up as live dashboards for your customers.

Oftentimes, it's alerts for your customers.

It could be personalization and recommendation.

So ecommerce site, your customer is there. They purchased some things in the past. They're clicking through some things. How do you recommend the right product before the user leaves that session?

That's 1 great example.

Logistics and delivery tracking.

I mean, everything that Uber and Instacart and all the guys have done out there, that's a great example. But we see even a lot more of that in supply chain. For example, 1 of our largest customers, I think Command and Alcon, we published a story.

They track 80% of the cement mixers in the United States. And if you think about a cement mixer,

it's constantly spinning. If it's raining, you gotta reroute it. If your contractor or your crew is late, you gotta reroute it. So it makes sense. Right? Logistics delivery tracking

for your end customers

embedded in your application. That's a great example.

More user facing analytics. We see

stuff like if you're doing in game personalization.

Everybody's buying a shield right now. Tell people to go to go buy a sword. Well,

that's in game monetization and personalization.

So this is very common on the user facing analytics side. But even internal analytics, it's not always BI dashboards.

On the internal side, I think of it as

if you have analysts building weekly reports,

great. That's not a real time thing. You should be doing that on a warehouse.

But if you have people on the ground making day to day decisions,

that's operational analytics. And examples of this,

we have a major fintech company that's doing anomaly detection and fraud detection.

So this is 1 of the top 3 buy now, pay later companies.

You have millions of transactions happening across thousands of merchants.

How do you catch that anomaly and have your risk team

take action in real time? That's 1 example

of internal operational analytics. Still a data app, but it's internal.

Another customer,

1 of the major airlines, I can't mention them, but you know who they are.

Well, you know, every time your flight is delayed and they're trying to reroute you, what do you think they're using? They have to look up so many different things to see where is the flight right now, what crew do I have, which flight is overbooked,

What's the, you know, cheapest way to reroute you to your destination in the fastest time? That's internal operational analytics. So all of these are

great examples of user facing analytics and operational analytics

where

low latency queries matter

and

or real time data matters. But low latency queries always matter.

And in this context of real time data and being able to build analytics on top of it, the overall space of streaming data and the different technologies available to support that have been years in the making and they still, in some cases, feel like they're

getting to the point of maturation, but maybe not quite there yet. And of course, that's going to be different depending on

what company you're at or what technologies you're using.

And in this question of real time data and access to it, I'm wondering if you've seen that

the growth of adoption for these real time analytics use cases has been

driving the improvement in those technologies, or if it's the other way around where the presence of those technologies and the realization that this is a possibility

is driving the adoption of real time data?

It's a little bit of a loop. It's a flywheel. Right? It's not 1 or the other. It's both of them are kinda happening together.

I remember when I first talked to the confluent team behind Kafka, I think 5, 6 years ago,

they kinda laid it out in terms of phases, and they said phase 0 was already done where a lot of people were collecting real time data. But they were just getting into phase 1, which is finally how do you get value out of that? And how do you actually build apps on top of that.

And I think we're seeing a lot of that phase 1 today

where they're still building these apps. And phase 2 is where you go from that to

a lot of automation,

where everything

the default expectation

is not a real time dashboard. The default expectation is you have a program

that monitors things for you, taps you on your shoulder, and tells you what to do,

and that is phase 2. We're not even fully there yet. We're still in phase 1 where people are just starting to harness the real time data and make decisions on that in real time. And we see go both ways. We come into accounts where they invested so much in, say, Kinesis or Kafka

or even, I would say, CDC streams. So it's not

the event stream, but CDC streams coming from Mongo or DynamoDB.

So real time data, you know, little nuance. It's not always coming from Kafka and Kinesis as a event stream.

Sometimes it's coming as a CDC stream coming from Postgres,

you know, coming from Oracle.

Mongo's done it beautifully. Dynamo streams is beautiful.

But we see people

invested or really bought into CDC streams and event streams,

and that's 1 of the reasons they're looking at how to make decisions in real time using that. But oftentimes, it even goes the other way. We come in. We prove that you can reduce the cost of your overall operation. You can get, you know, more speed.

Well, suddenly,

they'll go and invest more

in their real time infrastructure. So it goes both ways, I would say.

In the space of real time analytics, whether that's embedded analytics for end users or low latency analytics

for internal consumers, you know, for instance, the flight rerouting capability that you were mentioning.

What are some of the common architectural patterns that teams have settled on to be able to build out and support these use cases and

maybe some of the points of friction or operational complexities that come about as a result of these distributed systems problems that are inherent to working with data, particularly

at, high velocity and high volume?

Yeah. Yeah. The scale definitely makes a big difference here.

So the biggest

friction I'll start with the friction, what not to do, and then go into what to do. What not to do is don't try to do things the batchy way. Because, you know, in the batch world,

certain things that we've learned, we've so ingrained into ourselves,

they don't work in the the real time world. So you have to almost unlearn some of the things. Start from first principles. Go back to first principles.

And the minute you get into first principle thinking,

you unlock a ton of value. What do I mean by this? Take for example,

preprocessing.

Take for example, data modeling. How should you think about it? We've all learned in the batch world,

the more you invest in preprocessing and data modeling,

that's the right way to go because that's how you operate at scale. That's the clean way of doing things. If you don't want things to break downstream,

if you wanna control your costs at massive scale

in the batch where you're being taught that you have to do a lot of this, you know, do the data modeling upfront, invest in that upfront, do a lot of denormalization

if you must, do the preprocessing,

do the pre aggregations

offline

because

that's how you save cost.

In the real time world, a lot of those things you have to question and go back to first principles.

What I mean by that is

the more hops you add, the 2 things you're doing is, a, you're introducing more data latency,

and, b, you're adding a ton of complexity because

in real time, once things break,

it's very, very hard to go back and, you know, fix your real time data pipelines if you have very complex

multi hop kind of real time data pipeline. So I would say the couple of things to keep in mind. 1 is

simplicity scales, complexity scales.

The more real time you're going, go simple. Right? Have fewer mute moving parts because that's how

you achieve scale, and that's how you achieve speed at scale. And when you say fewer moving parts, that also means

really question

where and how you do your preprocessing

and your data modeling and your aggregations.

So I'll give you a great example.

The Fintech company that I was talking about,

they had built this whole thing, anomaly detection on a batch pipeline with their previous approach.

And

amazing savvy team,

they brought it down to 6 hours, but their business is still exposed to risk for 6 hours. And the cost was crazy high because they're taking massive volumes of data

and running through these multiple hops.

When Roxy came in, we looked at the whole thing and said, wait. Cut that step. Cut this step. Cut this step. And suddenly,

we're still doing pre aggregation. So it's not that we're not doing pre aggregations, but here's the new architecture that they have.

They've gone from Stream, so it's Kinesis,

to Rockstead directly.

But as they go into Rockstead, it's all schema less. It's all JSON. So they do not do any data modeling. They do not do any schemas, completely schema less. So they completely eliminated

the need for, you know, dealing with schemas.

Very deeply nested JSON but they've eliminated having to unnest your JSON. Again, this is super important. You wanna

eliminate these steps. They don't unnest their JSON anywhere else.

And they still do what we call roll ups or pre aggregations, but they do them in real time.

Now this is actually, last time we spoke, we didn't have this new capability we added, which is you can do a SQL transform in Rockset

before the data is indexed and stored. So as the Kinesis stream comes in, they're rolling it up, and roll ups basically allow them to reduce their

storage cost,

I think, by almost 10 x.

But the beauty of it is now in real time, like, the data's arriving within

1 to 2 seconds. They've pre aggregated the data. The only 2 parts they have here are Kinesis

and Rockset,

and now they're running SQL queries on it to catch anomalies.

Now on top of that, yes, they've built alerts and triggers and some really interesting things, but this is the end to end pipeline. It's Kinesis, RockSED, and whatever

they wanna use on their end to alert whenever there's a fraud. So you see it's becoming

a very massive scale real time system,

but with very few moving parts.

And when we did the TCO analysis, they actually cut their cost, I think, almost by half, and they're able to achieve speeds of for instance, 6 hours,

get allotted within 1 to 2 seconds of something happening.

It's a very different architectural pattern.

For organizations

that have already invested in building out analytical capabilities,

whether that's by building a data warehouse or a data lake or whether they're doing analytics directly off of their application architectures.

What are some of the motivating factors for moving into this real time space and some of the

considerations that they're typically working through as far as what are the

overall costs going to be, do I have the necessary talent pool to be able to support this infrastructure, support these approaches?

You know, what are

the changes that we need to make in terms of how we think about what types of analytics we're providing or what that we're building? And, you know, is it generally something where it's an either or or do people generally settle on, okay, we are going to use batch for these use cases and real time for these use cases and just some of those considerations and breakdown that they go through as they start to think about what capabilities can they offer

in a more real time manner?

Oh, it's definitely both. There's a place for both. It really depends on your use case. I know, you know, every data engineer has probably heard this a 1000 times, pick the right tool for the job, you know, match the tool for the use case.

The biggest consideration I'm seeing though in this economy, 1 thing has changed, and I hope, like, all the data engineers and data architects listening are paying

attention to what's happening in the economy and how they need to get their projects approved

in the next 6 months. It's gonna be very different from what was true in the last 6 months. We're already seeing this.

So if you

realize

now, suddenly, there's a lot more power in the hands of CFOs.

Whereas in the past, you could do a lot of innovation projects. I put it in my my CTOs or CPOs. I myself,

in a product role, but suddenly,

I have to really pay attention to what's happening in the financial side, and it's a different equation now. So if you wanna get your projects approved, I would say going forward,

it is gonna come down to what is the price performance.

If you can't prove that,

you know, you can achieve the same thing with

lower cost and better performance, it's gonna be incredibly hard to make those shifts in the next 6 months.

What we see

is, going back to your earlier point,

the use cases are at a place where people are already trying to do this, and consumers are demanding this.

If it's user facing analytics, you're doing it because your customers are demanding certain features. Your customers are demanding certain

analytical capabilities in your product.

And at that point, you have a couple of choices.

You either try to build it on your existing warehouse,

say, you know, something like Snowflake or Redshift,

or you try to build it on your existing transactional database,

like, say, Mongo or Postgres

or even Oracle, if you will.

You only have these 2 things in your toolkit if you're not using something like Rockset already.

And what we found is that when you try to do it on either of these things,

for user facing analytics particularly,

the price performance is just not there. Why do I say that?

With warehouses, for example, we've gone in and seen this in actual customer scenarios.

For user facing analytics, 2 things are different.

1,

you have a lot of frequent updates.

And when you have a lot of frequent updates

and you try to do this on your warehouse,

very quickly you'll find that the warehouse does these very expensive merge operations.

Why? Because it was designed for a batch wall, and it's assuming that you're gonna batch your updates.

So no wonder if you don't batch your updates and you try to send it very frequent updates, you blow through your credits in no time. And we've had a lot of customers do this and say,

oh, I'm forced to take my CDC stream and batch it in 15 minute or 30 minute or 1 hour increments because otherwise,

my warehouse cost goes through the roof. So that's 1 consideration.

The other big consideration is your queries. With user facing analytics,

your queries are different. You're no longer doing these

very scan oriented, scan intensive queries.

I'll give you an example.

If you're running a weekly report,

you might say, what is the average ASP

in Europe

this week compared to last week or, you know, compared to last year?

We are comparing, like, ASP across regions. Well, that's very

commonly scan oriented. You know? You have to scan through everything to compute an average.

Sure. Warehouse does a really good job. To think about personalization use cases, logistics tracking use cases,

tell me everything you know about Tobias

right now from, you know, his purchase history to his click stream to his interest.

That's a very selective query.

And you try to do this on a warehouse, and you can see

because it does brute force scanning for the most part why you're wasting a ton of compute.

So this is the real problem, which is with the new kind of data apps or data products that developers are building.

Suddenly, they're not analyst

style access patterns. Your data access pattern is different. Your data load pattern is different.

Developers

deal with data differently.

So if you have a developer

working with it and you're trying to force them to use a warehouse,

no wonder your cost is blowing up

because it's just not the right tool for the job.

So this is where, you know, we've seen

yes. If you're using it for an analyst doing a weekly report, you're probably getting the best price performance because you can have them,

you know, spin up that warehouse,

just run the ad hoc queries,

spin it down,

and you're working

on yesterday's data. Perfect. You're dealing with developers.

That whole paradigm is broken.

No wonder your price performance is not working for you. So when we've gone in, we've actually seen

up to and I'm gonna say up to, it's a very marketing claim. But I've seen up to

half the cost and double the performance

simply by

switching from a warehouse to something like Rockset, which is a real time analytics platform. And I don't say just Rockset. I would say, you know, choose the right real time analytics platform that works for you,

but

choose the right tool for the job. A warehouse is probably

burning a hole in your pocket. And if you can make a case to your CFO that you're gonna double the performance and cut the bill in half,

that project is getting approved.

To your point about choosing the right real time platform,

what are the different capabilities or feature sets or

maybe operational capabilities

or operational

models that teams need to be thinking about as they're making those determinations? So, you know, whether it's compliance reasons or

these kind of interfaces,

inputs and outputs that are supported, just what are the

attributes of these different platforms that are going to vary and how teams need to think about which 1 is going to be the right fit for their needs?

Yeah. Great question. I think I didn't answer your previous question on what should the

training for the team be or how do you decide based on people capability. So that's 1 for sure. But let's build out maybe you know, I'm jamming with you. Let's build out If you are evaluating something for user facing analytics or your operational internal analytics, it's a real time platform.

What should your eval

look like? How do you set up the right POC? How do you ask the right questions to your vendor? So I would definitely put price performance top of the list. And again, price performance

is the right way to think about it. Don't think of it just as cost or TCO

or performance only.

The way you think of price performance is almost like

miles per gallon,

which is, you know, we were just talking about this before.

In the old world, when gas was cheap,

you could just say, I don't care about miles per gallon. I just want the fastest car.

Today, that doesn't work anymore.

Gas is crazy expensive. You should care about miles per gallon. Similarly, you know, CFO suddenly have very tight budget, so you should care about price performance. Our price performance,

which basically means

see how much compute you're spending.

The biggest thing when it comes to these workloads is compute.

You're not gonna really burn a hole in your pocket with storage. You know, storage today is so commoditized. It's

and also most of the user facing analytics and operational analytics projects that we've seen, they don't have petabyte scale datasets.

They tend to have, you know, tens to hundreds of terabytes.

If it's a petabyte scale dataset,

chances are

you're doing offline, you're

doing, you know, year over year kind of analysis, and that should live in your warehouse.

But for user facing and operational analytics, for real time analytics, we see tens to hundreds of terabytes,

especially because you're doing roll ups, pre aggregations, you're doing

retention policies.

If you manage it the right way,

storage is not the thing. Compute is the thing that's really blowing up your cost. So set up price performance

and really measure

for every,

you know,

hour of compute, say it's whatever CPUs you wanna peg it at. 4 CPUs, 8 CPUs, whatever you wanna peg it at. For every hour of 8 CPUs I spend, what kind of performance am I getting? Every minute, every hour, whatever you wanna measure it. So pick your dataset,

run it on the different vendors and say, this is the real price performance

for compute.

And that's almost like a compute efficiency metric that you should care about because real time analytics

and even for warehouses. I mean, everybody who's spending a ton of money on Snowflake will tell you it's not the storage, it's the compute

that is really expensive.

So build that compute efficiency metric

anytime you're running a Snowflake

warehouse for

24 by 7,

you should ask yourself, is this,

you know, the right workload?

Maybe I should come do a compute efficiency comparison

against some real time analytics platform

and build price performance as your number 1 thing. Compute efficiency matters.

The second 1 is what you called out, which is people.

You have the people that you have. You have the skills that you have. Yes. They can learn new things,

but these people already probably know SQL.

So, you know, put some of the requirements down. SQL, if SQL matters, SQL

is

not only because of people, but also the thing that everybody's standardizing on in the industry today. If it doesn't speak SQL natively, if it

anything that says SQL like

is not SQL.

So,

yeah, you really wanna put what is the thing that your people are capable of.

And generally, to me, that comes down to SQL.

Again, continuing on the people trend, there's are your people in the business of

managing infrastructure or managing data?

All the open source tools are amazing because it lets you go super deep. It lets you look at the code. But if you're not manipulating that code, if you're not, like, an active open source contributor,

what it ends up doing for you is it ends up having,

like,

a lot of burden on the team to manage that infrastructure. You know? Open source is amazing.

We are not open source. Rockset is not open source.

But we study a lot of the open source tool, and we think it's amazing

if you're running it in the data center on prem because now you can control your hardware. Now you can eat every bit of price performance out of it.

In the cloud, it's a different story. If you're running it in the cloud, again, this is where you should ask yourself, where am I gonna run this?

I'm running it on prem. I wanna control the environment so much. Open source makes sense. If I'm running it in the cloud,

if you can't spin up and spin down instances

every minute, every second,

you're actually doing something wrong. So that's the philosophy we take at Rockset, which is we're fully managed in the cloud.

We have auto scaling ingestors, auto scaling

pods, auto scaling compute.

The reason we do that is

you shouldn't have to worry about anything.

We should be, you know, not only spinning up but also spinning down every second. That's how you get, you know, again, going back to compute efficiency. So 3 things we touched on, price performance,

people, which is, you know, SQL, and then operational efficiency, which is what is it that matters?

Do you want to manage it on prem, or do you wanna do this in the cloud? And then going back to security and compliance, that's a big 1. I think we run into this a lot,

especially if you're using managed services

like Rockset. You wanna ask all the compliance questions.

You wanna ask the security questions.

You wanna ask about, you know, all the good stuff like private link. Do you connect to my VPC via private link? So a bunch of security questions. Encryption,

you encrypt

everything,

you know, at rest, in flight, like, what's going on?

You have to

understand

what is the key management, and there are lots of good things on the encryption side. But, generally, what we've seen is if the compliance is in place,

like SOC 2, type 2, HIPAA, all the compliances in place,

and

these people are crossing the security reviews of all the big companies,

they probably have the right systems in place. And this is what you know, if I go back to our early years,

we didn't have some of these compliance things in place. So we weren't ready to sell to the enterprise

over the last few years we have.

So by now, we have a security team. We have the whole

enchilada down.

So

you should be able to go back, ask them all the right questions, put your security team in touch with their security team, and as a data engineer, not worry so much about it. Just say, hey, as long as you pass my company security review, you're good to go.

And then the last thing I would say is what I mentioned earlier, which is,

let's say, all of these make sense. Price performance makes sense. You know, you have SQL, so you have interoperability

with your ecosystem.

You have

the operational,

you know, cloud model that works for you. You have the security and compliance. But the last 1 that's super important

is what I touched upon earlier,

which is

how do you minimize the number of moving parts?

Because this is, I think, the thing that trips up people the most as they're going into this combative real time.

Simplicity, simplicity, simplicity.

So if you find that you have to do 10 hops,

something's wrong. That's not gonna scale. You're going to constantly be debugging your real time data pipeline. But if you find that you can build a simple thing with 3 things,

that is a very, very scalable model. So if you need to denormalize, something's broken.

Do you really need to, you know,

use, like, 3rd party systems for whether it's data modeling, whether it's schema management?

Ask yourself if something's broken.

The fewer moving parts you have, the cleaner, and now you know it's something that scales because you're not building for today. You're building for the next few years.

The biggest challenge with modern data systems is understanding what data you have, where it is located and who is using

it. SelectStar's data discovery platform solves that out of the box with a fully automated catalog that includes lineage from where the data originated all the way to which dashboards rely on it and who is viewing them every day.

Just connect it to your DBT,

Snowflake,

Tableau, Looker or whatever you are using and select star will set everything up in just a few hours.

Go to dataengineeringpodcast.com/selectstar

today to double the length of your free trial and get a swag package when you convert to a paid plan.

And on that question of simplicity and the number of moving parts and also to your earlier point about the size of the data that you're working with in these real time systems is the question of

when you're talking about real time, generally, you care about what has been happening in the past few seconds, minutes, hours.

You know, what are the reasonable time horizons that people generally settle on as far as how much of the data do I want to push into my real time workflows and be able to surface

in these embedded analytics use cases or real time analytics use cases?

And then how much of it do I then age out into these batch infrastructures, whether it's the data lake or the data warehouse?

And then that introduces the question of we just recreated the Lambda architecture with some newer technologies and, you know, all of the problems that go along with that.

So the question of what should you keep in your real time systems, you know, if you look at it purely from the lens of time series data,

yes, a lot of those questions pop up because

when you think real time, you think time series.

But I wanna challenge that a little bit because time series,

yes, it is 1 part of it. When you talk about tensor data, when you talk about click stream data, again, it goes back to event streams.

And when you talk about event streams, sure, there's a lot of I only care about last 7 days. I only care about let me give you an example for sensor data.

Let's say you're UPS. Right? And you have these smart drop boxes. Whenever somebody drops something and you wanna reroute the truck immediately to go pick that package up in the most efficient way.

Suddenly, somebody adds a new field called temperature,

and temperature starts showing up. And, you know, like I said, if you have fewer moving parts, none of your pipelines will break.

Temperature will just propagate through your stream and into, say, something like rock setting. You can start working with it. But the real question is, do you care about

temperature every second?

Do you care about temperature 5 times a second, which is how often the sensor is sending?

Probably not.

You only care about, well, if somebody's shipping wine,

you wanna be careful about the temperature in that drop box. But every 5 minutes, that's good enough.

Every maybe even every hour is good enough, and that's a determination that every customer makes.

So it's not just

7 days or how old. It's also at what granularity do you store it. So when it's event streams and time series, you have to ask yourself 2 questions.

1 is,

how much data do I want to store? So 7 days, you know, 1 day,

even many times, you know, many months, many years.

That's okay as long as you're storing it at the right granularity.

So we have people who are aggregating it on a per day basis. And if you start aggregating a per day basis, you can actually store

many months of data. Or sometimes they'll say, I want to store the last 7 days at a certain granularity.

And anything older than that, I can store at you know, certain different granularities. And all of that should be seamlessly handled in your geographies' platform.

So suddenly, it's not about only send the most recent

data here and then send this here. It's really about, again, rethinking your architecture and saying, for time series data,

you should be able to set it up for the right granularity

and also think about retention in the right way. If you do both,

oftentimes

with event streams, you're okay. But the second big 1 I talked about, CDC streams.

This doesn't apply often to CDC streams

because CDC streams are coming from your transactional database.

They're getting

updates, and this is where,

you know,

upsurge matter. Again, with Roxette, as you know, Tobias, we

went upsurge first.

This is why we're 1 of the few real time analytics platforms talking about CDC streams so vocally.

With CDC streams,

anything that doesn't handle upserts

is super inefficient because you're doing merges. Elasticsearch,

same problem. Snowflake,

same problem. I mean, any other system you take, they're not doing upsurge. That's a problem. Now with CDC streams, it's not the land architecture anymore

because you're bringing in

almost everything that's happening in your transactional database,

and that volume is oftentimes not that big. You're only bringing the right tables. So I might need, for example let's take personalization.

I only need 1 of the purchases that Tobias has made to be able to put you know, personalize

your experience in the ecommerce platform.

Maybe you have a lot more tables in your transactional system. I don't need that.

So you selectively bring in the tables

and the fields. Like, you only bring in the things that you need from your transactional database.

And their roll ups don't matter

because it's not time series.

Their retention policies don't matter because I need to keep

all the transactions that you've ever made.

But, again, the volume of that data is very, very small. We're never talking petabytes in a transactional system.

We're talking very small data.

To that question of data volume,

what are the kind of breaking points between,

I need to run this analysis in

a batch fashion using my data warehouse or my data lake,

or this is a small enough volume of data where it makes sense to run this in a real time environment because I can,

you know, recompute these aggregations or recompute these analyses

fast enough to be able

to account for the newer data that's flowing in while building off of the pre aggregations that I've done or the, you know, rolling windowed aggregations that I'm managing.

And then as you do get outside those windows for these different reasons that we were just discussing,

you know, how to manage

either aging that out into the batch system or whether you see folks typically

doing a double write where they will

propagate it to both the batch and the real time environments at the same time?

Yeah. Great question. So, typically, it's not based on the data size. At least whenever we've spoken to customers, it's more about

whether you need those low latency queries, whether it's user facing analytics, whether it's operational analytics. And who is gonna consume this is the first question we ask.

And if it is for an analyst

doing ad hoc queries, for a data science scientist doing

offline training,

that is a batch use case.

That's the best, you know, way to think about it. But if it's operational analytics, it's people on the ground making day to day decisions, it's user facing analytics

where it's embedded inside your product,

and it's developers

building apps on it.

Then no matter the size of the data, there is a way to do this in real time where you can make it

much more price performant. And, again, it goes back to what granularity,

how do you do this rolling window, how do you drop the fields that are not necessary.

And it always comes down to if this is the use case,

it can be done in real time at a much lower cost,

oftentimes,

than doing it on a batch system.

It constantly trips people up because they think real time is more expensive.

And every time we've gone in and done this, we've been able to go and, like, cut the cost in half compared to doing it on a batch system

simply because you're using the right tool for the job.

And if you do it in the right way with the rolling windows and bringing the right data in,

suddenly your cost is actually much lower, and you're getting the performance you need. So I would say don't tie it to the data size or data volume.

I've literally seen what I jokingly call or lovingly call data torrents.

They're not data streams anymore. Right? They're like many, many terabytes a day.

And the more you try to do that in a batch fashion,

the more your cost is actually going up.

But if you bring it into a real time system that's built for this where,

you know, it's got all the time series optimizations.

It's got the rolling windows. It's got copies that you actually need to store at the right granularity. It's got the aging built in.

You can actually handle that volume of data

at much lower cost.

So

it's more driven by the use cases, the first thing. It's also driven by the queries, not just the data that I will I think

this is my biggest challenge and, you know, I will be very transparent. When I say real time analytics, people often think real time data

and think of it as data never stops coming.

But the thing that people forget

is the queries.

In this world, queries never stop coming. This is high QPS. This is data apps. This is the queries never sleep,

unlike analysts, you know, who go to sleep and the query stop.

So it's not just that data never stops coming. If you really have use cases with it, query is not never stop coming.

That is the number 1 thing

of picking a real time analytics solution. And, again, if the queries never stop coming, that means

your warehouse is running 24 by 7. That's a tell. That's a big tell. Every time you're doing that, you're probably misusing it.

The other big 1 is the pattern of your queries. Like, if you have a lot of selectivity, you have a verb clauses in your queries,

and you're sending it to a warehouse that's doing scans,

then you're probably getting

poor price performance.

So it's both. It's data never stop coming.

Queries never stopped coming. Hopefully, I answered your question.

I might have gone a little bit off topic.

No. It's definitely good.

The point of

the continuity of queries and not just the continuity of data is an interesting 1 as well, and 1 that, as you said, is not something that people generally think about when they're exploring this problem of real time data. And

I'm curious what you have seen as the ways that people maybe take advantage of the information

that that constancy of queries provides to them as well about, particularly for

the embedded analytics use case where you're exposing this

interactive

capability of being able to explore

the data that is pertinent to that customer,

how the ways that they are exploring that data gives you more information about

what they care about and

how you might think about, you know, driving your own product direction?

Yeah. So the types of queries coming in are very interesting, for so many different reasons. 1 is

you can see, for example, if it's, you know, very selective queries, if you have a lot of bear clauses, that's an obvious 1. But the second 1 is but again, it's customer facing analytics. This is where developers and product managers and

they're really trying to iterate

towards what the customer needs, and it's not always obvious.

Right? Whereas if you're building reports for your executive, you already know what your executive wants because the executive comes in and says,

I need to be able to report on blah.

But

in this new world,

the queries not only never stop coming, they never stop changing.

This is again a real tell because

developers are iterating, product managers,

I'm a PM, so I can say this, really don't know what the users want. They're trying to iterate towards it as quickly as they want. They can learn, but they don't know on day 1. Right?

And

this is the agile way of building products, so you have to embrace it, which means

you have to give them flexibility.

You cannot say I'm gonna have this very rigid data model, and you've asked me for this particular

query, so I'm just gonna optimize the heck out of it for that 1 query. Because

they'll ship the product, next day customer says, oh, I wish I could ask this question this way, and suddenly you come back to a whole new access pattern.

So

the classical knowledge in data engineering has been understand your access patterns and optimize the shit out of it. Am I allowed to say that word? The heck out of it

for your customers.

Well, what do you do if your customers are developers,

your internal customers are developers, and they keep coming back to you? Queries never stop changing.

So I'm actually saying embrace that. Embrace the flexibility of queries. Embrace the flexibility

of your data model.

Embrace that agile way of working.

And to do that again, you need to be able to say whether it's search, aggregations,

joins, no matter what the pattern of the queries,

it should just be

fast out of the box. It should be price performing.

You can't lock yourself into a corner

because

you know your PMs are gonna come with a road map improvement, like, 1 month down the line, and you better be ready for it. So how do you embrace that agile?

There has to be a word for this. I haven't come across this. I hear a lot about data modeling

in

modern data stack. There's this whole debate happening,

But what

it doesn't anticipate is that

your queries are going to constantly keep changing on you. Your data is going to constantly keep changing on you in the world of data apps,

and you should actually embrace that. This is our approach. Right? The flexibility of queries. How do you get that? This is why we're indexing everything.

We're saying if we index everything and the database index is themselves immutable,

data changes, no problem. You go update it in place. Queries

change, no problem

because it's already been indexed for a search and aggregation join. You don't have to go build new indexes or obsess

about,

what am I gonna do? Suddenly,

they now wanna do a join

with that new dataset.

You gotta kinda embrace that.

As you were discussing the

constant changing of the queries and what people are trying to explore, it brought to mind the

other major trend that's been happening in the data ecosystem

around data quality, data observability,

sort of lineage,

and also the question of

governance

and compliance and all of these

nonfunctional requirements for the data.

And I'm wondering how that manifests in this real time world because there's

been a growing consensus about how to think about this in the world of data lakes and data warehouses. I'm wondering how you're seeing that manifest in these kind of end user facing applications.

Yeah. We're absolutely seeing it. Especially data observability becomes even more important because think about where observability started. Right? It started with DevOps.

And now you're talking about developers building products on top of this. So observability

has become super important,

and it's only extended in the sense

suddenly end to end latency matters.

It's not just what you've seen in the data lake and data warehouse world, but it's even more. It's end to end data latency.

The way we've worked with this is literally like we have a metrics endpoint,

which, you know, people plug into their Datadog or Prometheus,

and they basically monitor

Rockset like they monitor

any production database.

So

observability has become even more critical.

Data quality, for sure, matters.

But again, in the real time world,

the best thing is

whether it's coming from an event stream and you're

making these decisions based on real time data or it's coming from CDC stream,

oftentimes there's also another copy of the data somewhere. Like you mentioned, you know, you might be doing dual rights. Sometimes

you're also storing it in the lake for eternity for compliance reasons or

know, for your own historical analysis that you might wanna do. So you're also doing that. So there is

a copy of the data somewhere.

We also look at, for example, in our case,

CDC streams coming in from Postgres or Mongo.

Well, great. That is your system of record,

but, say, Rockset is now syncing to that, staying in sync with that. We become the source of truth.

So when people are doing analytics,

the source of truth is often Rockset.

And, again, data quality matters a lot.

However, in this real time world, the way we do it is we make sure that you are able to constantly sync with your data source and that you're only 1 to 2 seconds behind.

And, you know, in the event stream world, oh my god. So many questions around eventual consistency.

How do you handle out of order events?

All of these things matter, and this is why I keep saying

a real time analytics system should be able to handle all of that natively.

You can't wave your way around it.

Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend. Io, 95%

reported being at or overcapacity.

With 72% of data experts reporting demands on their team going up faster than they can hire, it's no surprise they are increasingly turning to automation.

In fact, while only 3.5%

report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months.

That's where our friends at Ascend. Io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion,

transformation, orchestration, and observability.

Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug in architecture,

as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark and can be deployed in AWS, Azure, or GCP.

Go to data engineering podcast.com/ascend

and sign up for a free trial. If you're a data engineering podcast listener,

experience

interacting with customers

and other folks who are building these real time applications, what are some of the most interesting or innovative or unexpected ways that you have seen

these real time architectures implemented

or uses of these real time data streams?

Oh my god. Customers are always surprising us.

The really interesting examples that come to mind, I thought this whole, like, risk analytics

thing was super interesting because

we didn't even realize, but as we were talking to multiple people in that Fintech company,

it became clearer and clearer that

it's not just fraud. Like, when people say risk,

they think of fraud immediately.

But it's not just fraud. You know? There also other types of risk that your business is exposed to which you don't even realize.

For example, let's say Apple Pay stopped working in West Africa.

That's not necessarily a fraudulent transaction.

But by the time they figured that out

on their platform and went and fixed it, 6 hours have gone by, and they have lost 6 hours of revenue from West Africa,

and that could be 1,000,000 of dollars.

So risk is not always fraud. Risk analytics is becoming more and more interesting because, yes, it catches fraud, but it also

catches a lot of, you know, lost revenue opportunities.

And that's been super interesting as we're seeing how people are now doing risk analytics. Again, going back to the economy and these times,

people are thinking about risk very differently.

They're thinking about risk to their revenue. They're thinking about risk to their, you know like, even the sales ops world.

A major risk

is

as the economy changes and your sales team is figuring out, everybody's adapting to

the new world.

There's a major risk if you find out after you've closed your quarter that you really, you know, are so behind on where you need to be.

You need to know that every week, every month so that you can adapt.

In uncertain times, there's a lot of risk, and we're going into the most uncertain time, I think,

for a lot of us.

And in certain times, risk is everything. It's sales ops. It's marketing ops.

It's finance ops. It's really knowing what's happening in your business and adopting adapting

what you're doing changing every day.

So that's been really interesting for us when people suddenly took the word risk analytics

and moved it from fraud detection to all of the other things in uncertain times.

It's been fascinating to watch how they're using real time

to prevent their business from going

off the rails or getting derailed

by the economy. So that's been 1.

What are other really interesting ones? I keep sharing some of these very interesting ones. I'll share a real customer example, whatnot.

This is so cool to see. This is live streaming. I don't know if you folks have heard of it, but very cool company doing

live streaming. So this is ecommerce

where people get on live. It's a buy, sell marketplace

with live video streaming.

So, again, new ways of engaging people in the ecommerce way. And their use case was so fascinating because how do you do

recommendations for live streams? Because you don't have a lot of history. Right? As the stream is happening, it's becoming more and more popular, and you have to recommend the right 1.

I need to know that Tobias is really into, I don't know, baseball cards. So I can recommend to you there's a live stream happening about baseball cards that's becoming really popular right

now.

And that kind of stuff is super hard to do, and they've actually published a really cool blog on how and why they moved away from Elasticsearch

to Rockset for this live streaming example.

Because you would think of Elasticsearch for this use case. Right? And the the default that you go to is Elasticsearch.

And suddenly,

they wanna join, don't join a lot of data. So

started using Rockset, and that I thought was a very, very cool, interesting use case. Just moving away from Elasticsearch to Rockset and doing a bunch of joins.

On the logistics tracking side,

my favorite example continues to be

you know, again, this is me, I guess, because

whenever I see

these

underrepresented

I wanna say underrepresented

in the data world perhaps, but

heavy construction,

It's not digitized. I love the fact that they're digitizing something like heavy construction

infrastructure, building better roads, better bridges by using

real time tracking for cement mixers.

That's really cool. Being able to join data from

what your contractors are doing on-site and weather information so you can reroute

cement trucks in real time.

Think of how much money that saves. And we're not talking houses. You know? We're talking

bridges and we're talking, you know, roads and all the heavy construction.

Well, that's massive number of contractors,

massive taxpayer money

going into it. And we're really proud to save taxpayer money

by digitizing heavy construction. So we see a lot of

this kind of stuff, which

in the grand scheme of things, Facebook has this. You know? Uber already has this. What we want is to bring it to the people who don't have it. The people I know this is a data engineer show, and I want

a lot more data engineers

in the industry.

Reality is there aren't enough data engineers to go around.

So what about that construction company out in the Midwest that cannot hire thousands of data engineers?

How can we give the 2 data engineers they do have superpowers

to go run this kind of massive scale operation

with price performance and all the good stuff.

In your experience of working in this space of real time analytics

and fast data streams and building and managing and directing the Rockset product, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

I think the most interesting 1 I've learned is that almost everything we do,

we think of it initially as, you know, user workflows

or, you know, other projects, but they always come back to price performance.

We do a bunch of things on the ingest side.

For example, roll ups is a great example.

Initially, we were thinking,

how do we make it easy for people?

We start with ease of use and we see the workflows. We see what people are struggling with. And so how do we make it easy for people to do these,

you know, constant roll ups or real time aggregations.

But at the end of the day, it's about price performance.

It's about that compute efficiency

Because,

yes, ease of use matters, and, yes, saving people time matters.

But as we built out the project, the biggest for us was,

wait a minute. This is again about compute efficiency because if we aggregate your data

in this way,

your queries are much faster because you just move that aggregation from the query to

the ingest in real time. So your queries are much faster,

and, certainly, the compute cost is half. So it always comes down to price performance and compute efficiency in this world.

And that, I think, has been

the biggest learning and, of course, the most challenging thing we've had to do

because,

yes, everybody wants real time. You know? I'll give you an example.

10 years ago, if you had asked me,

do I want everything shipped to me

the next day?

I would have said yes

with a big asterisk and said, not if it cost me $50 for shipping.

I'm not gonna do it. Right? I want it, but I can't afford it.

So I would maybe do it for 1 or 2 things 10 years ago.

But today,

I want free shipping for everything, right? Think of the kind of stuff that we expect,

free shipping and, you know, next day shipping.

You would never have done this 10 days ago, and the only reason it's possible is because

you don't have to pay through your nose for it. So that's the big moment for us, which is

everybody wants real time for more and more use cases. Everybody wants low latency queries.

Fast is better than slow any day. Right? I mean, who doesn't want fast queries?

And when people say I don't need it, what they really mean is I can't afford it.

And the only way that you can

change the game is by making it so

compute efficient

that it's even more efficient than doing it in the batchy way. And the minute we do that, which we already done it in a bunch of use cases, the minute you can prove that,

now the data engineer can say,

I'm giving you double the performance

and at half the cost.

That is a win win for the data team and

the consumers.

For folks who are interested in exploring these real time analytics use cases,

what are the situations where Roxette is the wrong choice?

Goes back to the 2 things. If

your queries are, you know, weekly reports

and you only go and run it once a week, absolutely the wrong choice.

Right? We're not built for analysts doing

weekly reports. We're built for developers building data products.

So, yes, your analyst might come to you saying, can you make this go faster?

And you might be tempted to go put it on something like Roxette because, oh, it's so much faster

and

cheaper, but not really because we're not built for those. And what do I mean by they're not built for those?

We are indexing your data

because we're anticipating that you're gonna have a lot of queries. And if you're only gonna do

that 1 query a week,

you know, something like Presto where you pay through your nose

for that 1 query

is actually the right thing to do because you don't have a lot of queries,

and you should be maybe optimizing for something else. Another

analogy I like to use is think of retailers. I work with a lot of retailers these days. Why do they have distribution centers

as well as retail stores?

A distribution center is actually called a warehouse.

So, you know, funny analogy.

Why do you need a physical warehouse as a store?

Because you're optimizing for 2 different things,

and

you still need

that warehouse to store,

pack away a lot of your boxes for infrequent

use. No. You're only going and pulling a box at a time infrequently. Once a week, you go to a distribution center and pull out a box

to ship to your retail store.

That's the perfect use of your warehouse, right, because

you go get it somewhere where

the dollar per square foot is very cheap.

Same thing in a warehouse.

Your dollar per gig is very, very low. It's a great place to pack away a lot of data for infrequent use.

On the Rockset side,

that's not what we're built for. So if you're doing infrequent

analyst style

queries once a week,

I would recommend not to do it. On the other side, what is a retail store optimized for? It's optimized for the customer access patterns. It optimizes for the best experience for your customers,

optimizes for a lot of foot traffic coming into the store, optimizes for your revenue.

That's a retail store. And similarly, we're optimized for compute efficiency,

We're optimized for that low latency customer experience

where queries never stop coming, data never stops coming. So

you cut your compute cost,

but, you know, you're making a trade off with your storage cost. And that's why,

again, it always comes back to right tools for the job. As you continue

to build and evolve and grow the RockSett product, what are some of the things you have planned for the near to medium term and maybe anything specific to these real time applications?

Lots of enhancements coming. The biggest 1 I would say is we continue to push the envelope on price performance. We're so excited about some of the benchmarks we recently published.

Hoping to publish

more of these as we're seeing what's actually happening.

These benchmarks are hard because

it depends on the access patterns to more queries.

So it's still benchmarks where you've been thinking, maybe we can actually publish some of the actual

queries with your permission and show for these access patterns,

you'll get, you know, a better bang for your buck. So looking forward to pushing the envelope there and publishing some of the actual very exciting results we're seeing. This whole, like, you know, cut your cost in half and double your performance kind of thing.

The other big 1, I'll give you a sneak preview because my PR team, which is listening and probably is not gonna like it if I announce the whole thing. We're announcing something pretty soon.

But it's this whole notion of

what we're seeing as you go at massive scale is now you have multiple use cases. Right? The data mesh is famous for all decentralized access and allow every team

to have its own access.

So how do you have your cake and eat it too? How do you

allow people

to

process data in real time, but also isolate the different use cases

so that each of them

can have their own access patterns. And this comes back to compute efficiency. This comes back to compute isolation across different use cases.

So we have some really cool stuff coming here,

which get your real time data,

centralized access to that real time data, but also isolate compute for the different use cases.

I'm intentionally not giving you a name for it. I'm not telling you what it is because we're gonna announce it with a big bang hopefully very soon.

Are there any other aspects of this

real time

data applications

ecosystem

and the ways that you're addressing it at Rockset that we didn't discuss yet that you'd like to cover before we close out the show?

I think the thing that I keep going back to is when you think about real time data, don't think only event streams. Really think about CDC streams because

I kid you not, this is the most exciting thing that I've seen with transactional databases.

I think Dynamo streams nailed it.

Mongo now has Mongo streams, which is really amazing.

Postgres,

MySQL,

Oracle, I think they've all

started

to talk about this a lot more. But tap into those CDC streams. It is

unbelievable what you can do when you start tapping into your CDC streams.

And as you tap into your CDC streams,

pay attention to

how your downstream system

handles updates.

Make sure it's mutable. Make sure it handles up sorts

because otherwise,

the minute you have CDC streams,

you are gonna get inserts, updates, and deletes. It's no longer insert only.

So really pay attention to that mutability.

Really pay attention to upserts.

And that is very, very interesting in terms of the use cases around blocks.

So

the only thing is don't think

only

event streams and time series data. Think CDC streams

because CDC streams are also

real time streams.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I would say it's still that kinda observability of the end to end, you know, especially in the real time world. What we see is you're seeing more data apps, you're seeing more data products.

There's a lot of work happening

in terms of, you know,

being able to monitor across hops, but

still, end to end data latency is hard to monitor. We provide as much information as we can, and still, some some of the things that customers are constantly debugging is, 'Wait, what happened on this system? What happened on my source? What happened on my destination?' I think every data engineer

has run into this at some point or the other.

And if you want to build production systems on this,

it's not enough to have low latency. It's not enough to have compute efficiency.

You need to have that level of data observability

across and monitoring.

And that I think is still developing.

A lot of work is being done. I think Bigeye, Monte Carlo,

great tools out there. But I still still think there's a lot

of work to be done. So

I'd love to see more and more work there, and we are hoping to work more closely with these vendors too.

Alright. Well, thank you very much for taking the time today to join me and share your experiences

working in the space of real time analytics and embedded use cases and the architectural

and logistical challenges of being able to build and maintain these systems. I appreciate all the time and energy that you and your team at Roxette are putting into

supporting these use cases. So thank you again for your time, and I hope you enjoy the rest of your day. Of course. Thank you so much for having me here.

Thank you for listening.

Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com

to subscribe to the show. Sign up for the mailing list and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at dataengineeringpodcast

dotcom with your story.

And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links